# Exploring Splitting Strategies

- Character Splitting
- Recursive Splitting

You can use this application https://chunkviz.up.railway.app made by https://github.com/gkamradt to visualize the splitting in character and recursive strategies.

---

#### Character Text Spltting

This is the simplest method. https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter

**Examples**:
- first example: splitting the text on characters isn't very useful
    as it may split the text in the middle of word, loosing semantic meaning. But could perhaps be useful if you have a specific separator where you need to split the data on.
    
- second example: if we split by newlines alone we get closer to sentence splitting. but chunk-sizing is often lost since the logic is; split on newline or max characters. This could be a problem if we want chunks of somewhat equal size.

In [4]:
# sample text from the gen-ai-wiki pdf
text = """
Generative artificial intelligence or generative AI is a type of artificial intelligence (AI) system capable of generating text, images, or other media in response to prompts. 

Generative AI models learn the patterns and structure of their input training data, and then generate new data that has similar characteristics.
"""

In [5]:
from langchain.text_splitter import CharacterTextSplitter

# split the text into sentences based on the length of characters
splitter = CharacterTextSplitter(
    separator="",
    chunk_size=35,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)

splits = splitter.create_documents([text])

# print the length of the splits
print(f'length: {len(splits)}') # prints the length of the split text
for split in splits:
    print(split) # prints the split text


length: 10
page_content='Generative artificial intelligence'
page_content='or generative AI is a type of arti'
page_content='ficial intelligence (AI) system cap'
page_content='able of generating text, images, or'
page_content='other media in response to prompts'
page_content='. \n\nGenerative AI models learn the'
page_content='patterns and structure of their inp'
page_content='ut training data, and then generate'
page_content='new data that has similar characte'
page_content='ristics.'


In [6]:
from langchain.text_splitter import CharacterTextSplitter

# split the text into sentences based on newline the next paragraph
splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=100, # this is ignored when text is longer, since the separator is set to newline and the next paragraph cant be fit into 100 characters
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)

splits = splitter.create_documents([text])

# print the length of the splits
print(f'length: {len(splits)}') # prints the length of the split text
for split in splits:
    print(f'characters: {len(split.page_content)}')
    print(split) # prints the split text


Created a chunk of size 176, which is longer than the specified 100


length: 2
characters: 175
page_content='Generative artificial intelligence or generative AI is a type of artificial intelligence (AI) system capable of generating text, images, or other media in response to prompts.'
characters: 144
page_content='Generative AI models learn the patterns and structure of their input training data, and then generate new data that has similar characteristics.'


#### Recursive Character Splitting

this medthod tries to split text in order until the chunks are small enough. The default separator list is ["\n\n", "\n", " ", ""]. This strategy is what most simple RAG tutorials out there seem to use.

examples:
- first example: using the default separator config we get chunks that have 
    a much closer character length. This produces better quality chunking (no more middle of sentence splits). In this case the smaller chunk-size and chunk-overlap also produce less semantic meaning per chunk. It would be best to set a larger chunk-size and increase overlap if any.

- second example: when switching to loading a larger document, thinking about chunk-size and overlap could be more important in order to keep the size|number of chunks down. Often in a larger document there is a greater variation of words per sentence and newlines. This could be a matter of cost and performance concern later on; eg. when the chunks are passed to an embedding model and/or retrieved to provide semantic meaning around a topic. Embedding models often have smaller context widows than their larger text generation counterparts.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# recursive text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=35, 
    chunk_overlap=10
    )

splits = text_splitter.create_documents([text])
# print the length of the splits
print(f'chunks: {len(splits)}')

for chunk in splits:
    print(f'characters={len(chunk.page_content)}')
    print(f'page_content={chunk}')



chunks: 12
characters=34
page_content=page_content='Generative artificial intelligence'
characters=29
page_content=page_content='or generative AI is a type of'
characters=33
page_content=page_content='a type of artificial intelligence'
characters=33
page_content=page_content='(AI) system capable of generating'
characters=31
page_content=page_content='text, images, or other media in'
characters=29
page_content=page_content='media in response to prompts.'
characters=30
page_content=page_content='Generative AI models learn the'
characters=32
page_content=page_content='learn the patterns and structure'
characters=33
page_content=page_content='structure of their input training'
characters=32
page_content=page_content='training data, and then generate'
characters=34
page_content=page_content='generate new data that has similar'
characters=24
page_content=page_content='similar characteristics.'


In the above examples text was loaded directly into `create_documents` which creates documents from a list of texts.

To chunk a document loader based object, use another method called `split_documents` which works with objects.

In [8]:
from langchain_community.document_loaders import PyPDFLoader
# load the data from the gen-ai-wiki which is a 3 page pdf
loader = PyPDFLoader(file_path="gen-ai-wiki.pdf")
# save the data to a variable for use later
data = loader.load()
data = data # get the first page


In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# recursive text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256, 
    chunk_overlap=20,
    length_function=len
    )

splits = text_splitter.split_documents(data)
# print the length of the splits
print(f'chunks: {len(splits)}')

for chunk in splits[0:4]: # print the first 4 chunks
    print(f'characters={len(chunk.page_content)}')
    print(f'{chunk}')



chunks: 28
characters=250
page_content='Generative artificial intelligence or generative AI is a type of artificial intelligence (AI) system capable of generating text, images, or other media in response to prompts.[1][2] Generative AI models learn the patterns and structure of their input' metadata={'source': 'gen-ai-wiki.pdf', 'page': 0}
characters=253
page_content='of their input training data, and then generate new data that has similar characteristics.[3][4] Notable generative AI systems include ChatGPT (and its variant Bing Chat), a chatbot built by OpenAI using their GPT-3 and GPT-4 foundational large language' metadata={'source': 'gen-ai-wiki.pdf', 'page': 0}
characters=252
page_content='large language models,[5] and Bard, a chatbot built by Google using their LaMDA foundation model.[6] Other generative AI models include artificial intelligence art systems such as Stable Diffusion, Midjourney, and DALL-E.[7] Generative AI has potential' metadata={'source': 'gen-ai-wiki.pdf', 'p

**Split by token count not characters**

Tiktoken can be used to give a rough estimate of tokens used which can be used to influence where we split the text.

Use it for embedding and when you want to ensure that max token count is not exceeded for the model. If you excceed the limit, data in a chunk will be cut off. 


Using token length changes the chunksize and makes sure that splits are not larger than chunk size of tokens allowed by the embedding model. 

https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token

In [82]:

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter, Tokenizer

# a dictionary that holds the result of the different text splitters below
results = {}

# direct token text splitter, split on 2 tokens
tiktoken_splitter = TokenTextSplitter(
chunk_size=2, chunk_overlap=0
)

splits = tiktoken_splitter.split_documents(data)
results["tiktoken_splitter"] = splits[0]

# recursive character text split with tokenization
recursive_token_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=10, chunk_overlap=0
)

splits = recursive_token_splitter.split_documents(data)
results["recursive_token_splitter"] = splits[0]

# character text split with tokenization
character_token_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=10, chunk_overlap=0
)

splits = character_token_splitter.split_documents(data)
results["character_token_splitter"] = splits[0]

# extra helper function to get the number of tokens from a string
import tiktoken

def num_tokens_from_string(string: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(string))
    return num_tokens

total_chunk_tokens = 0
total_tokens_per_chunk = []

# print the details of the results
for key, chunk in results.items(): # print the first 4 chunks
    
    tokens = num_tokens_from_string(chunk.page_content)
    total_chunk_tokens += tokens
    total_tokens_per_chunk.append(tokens)
    print(f'tokens used count: {tokens}')
    print(f'content chars count: {len(chunk.page_content)}')
    print(f'{key} : {chunk}')
    print("----------------------")


tokens used count: 2
content chars count: 10
tiktoken_splitter : page_content='Generative' metadata={'source': 'gen-ai-wiki.pdf', 'page': 0}
----------------------
tokens used count: 10
content chars count: 56
recursive_token_splitter : page_content='Generative artificial intelligence or generative AI is a' metadata={'source': 'gen-ai-wiki.pdf', 'page': 0}
----------------------
tokens used count: 491
content chars count: 2350
character_token_splitter : page_content='Generative artificial intelligence or generative AI is a type of artificial intelligence (AI) system capable of generating text, images, or other media in response to prompts.[1][2] Generative AI models learn the patterns and structure of their input training data, and then generate new data that has similar characteristics.[3][4] Notable generative AI systems include ChatGPT (and its variant Bing Chat), a chatbot built by OpenAI using their GPT-3 and GPT-4 foundational large language models,[5] and Bard, a chatbot built b