# Text Splitting Side Quest

> When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

> At a high level, text splitters work as following:

>    1. Split the text up into small, semantically meaningful chunks (often sentences).
>    2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
>    3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

> That means there are two different axes along which you can customize your text splitter:

>    1. How the text is split
>    2. How the chunk size is measured

-- https://python.langchain.com/docs/modules/data_connection/document_transformers/#text-splitters

Here are some useful options for splitting legislative text, 

* [character text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter)
  * How the text is split: by single character
  * How the chunk size is measured: by number of characters
* [recursive text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
  * How the text is split: by list of characters
  * How the chunk size is measured: by number of characters
* [split by token](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token)
  * How the text is split: by character passed in
  * How the chunk size is measured: by tiktoken tokenizer

If you are not familiar with the concept of a token, this article may help, 
* https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

Mini Side Quest
* see if there is anything interesting that can be done with this https://twitter.com/RLanceMartin/status/1670489431168659456?s=20

In [1]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import TokenTextSplitter
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

In [2]:
text = """We hold these truths to be self-evident, that all men are created equal,

that they are endowed by their Creator with certain unalienable Rights,

that among these are Life, Liberty and the pursuit of Happiness."""

## CharacterTextSplitter

In [3]:
# this is the default separator
CharacterTextSplitter(separator="\n\n", chunk_size=20, chunk_overlap=0).split_text(text)

Created a chunk of size 72, which is longer than the specified 20
Created a chunk of size 71, which is longer than the specified 20


['We hold these truths to be self-evident, that all men are created equal,',
 'that they are endowed by their Creator with certain unalienable Rights,',
 'that among these are Life, Liberty and the pursuit of Happiness.']

In [4]:
# this is what happens if we chandetailge the default separator
CharacterTextSplitter(separator=" ", chunk_size=20, chunk_overlap=0).split_text(text)

['We hold these truths',
 'to be self-evident,',
 'that all men are',
 'created equal,\n\nthat',
 'they are endowed by',
 'their Creator with',
 'certain unalienable',
 'Rights,\n\nthat among',
 'these are Life,',
 'Liberty and the',
 'pursuit of',
 'Happiness.']

In [5]:
# this is what overlap does
CharacterTextSplitter(separator=" ", chunk_size=20, chunk_overlap=10).split_text(text)

['We hold these truths',
 'truths to be',
 'to be self-evident,',
 'that all men are',
 'men are created',
 'created equal,\n\nthat',
 'they are endowed by',
 'endowed by their',
 'by their Creator',
 'Creator with certain',
 'certain unalienable',
 'Rights,\n\nthat among',
 'among these are',
 'these are Life,',
 'are Life, Liberty',
 'Liberty and the',
 'and the pursuit of',
 'of Happiness.']

## RecursiveCharacterTextSplitter

In [6]:
# these are the default separators
RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""], chunk_size=40, chunk_overlap=0).split_text(text)

['We hold these truths to be self-evident,',
 'that all men are created equal,',
 'that they are endowed by their Creator',
 'with certain unalienable Rights,',
 'that among these are Life, Liberty and',
 'the pursuit of Happiness.']

In [7]:
# this is what happens if we add "," to the separators
RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ",", " ", ""], chunk_size=40, chunk_overlap=0).split_text(text)

['We hold these truths to be self-evident',
 ', that all men are created equal,',
 'that they are endowed by their Creator',
 'with certain unalienable Rights',
 ',',
 'that among these are Life',
 ', Liberty and the pursuit of Happiness.']

## TokenTextSplitter

Here we show two versions of token text splitters. One from HuggingFace and one from OpenAI.

In [9]:
# the length unit for chunk_size is now tokens not characters
ts = SentenceTransformersTokenTextSplitter(
    model_name="sentence-transformers/all-mpnet-base-v2",
    chunk_size=10, 
    tokens_per_chunk=10,
    chunk_overlap=0,
)

In [10]:
ts.split_text(text)

['we hold these truths to be self - evident,',
 'that all men are created equal, that they are',
 'endowed by their creator with certain unalienable rights',
 ', that among these are life, liberty and the',
 'pursuit of happiness.']

In [11]:
# the length unit for chunk_size is now tokens not characters
ts = TokenTextSplitter(
    model_name="text-embedding-ada-002", 
    chunk_size=10, 
    chunk_overlap=0,
)

In [12]:
ts.split_text(text)

['We hold these truths to be self-evident',
 ', that all men are created equal,\n\nthat they',
 ' are endowed by their Creator with certain unalienable',
 ' Rights,\n\nthat among these are Life, Liberty and',
 ' the pursuit of Happiness.']

In [13]:
# same for chunk_overlap
ts = SentenceTransformersTokenTextSplitter(
    model_name="sentence-transformers/all-mpnet-base-v2",
    chunk_size=10, 
    tokens_per_chunk=10,
    chunk_overlap=4,
)

In [14]:
ts.split_text(text)

['we hold these truths to be self - evident,',
 'self - evident, that all men are created equal',
 'men are created equal, that they are endowed by',
 'they are endowed by their creator with certain unalie',
 'with certain unalienable rights, that among these',
 ', that among these are life, liberty and the',
 ', liberty and the pursuit of happiness.',
 'happiness.']

In [15]:
# same for chunk_overlap
ts = TokenTextSplitter(
    model_name="text-embedding-ada-002", 
    chunk_size=10, 
    chunk_overlap=4,
)

In [16]:
ts.split_text(text)

['We hold these truths to be self-evident',
 ' self-evident, that all men are created',
 ' all men are created equal,\n\nthat they are endowed',
 'that they are endowed by their Creator with certain un',
 ' Creator with certain unalienable Rights,\n\nthat among',
 ' Rights,\n\nthat among these are Life, Liberty and',
 ' Life, Liberty and the pursuit of Happiness.',
 ' of Happiness.']