#### Splitters

#### Reasons to Split Documents:

- **Handle Non-Uniform Document Lengths**: Documents in real-world collections vary greatly in size. Splitting helps maintain consistent processing across all documents.
  
- **Overcome Model Input Size Constraints**: Many language and embedding models have maximum input limits. Splitting enables us to process larger documents that would otherwise exceed these restrictions.
  
- **Improve Representation Quality**: For longer documents, embeddings or representations may lose quality when capturing excessive information. Splitting allows for more focused, accurate representations of each part.
  
- **Enhance Retrieval Precision**: In retrieval systems, splitting enables more granular search results, allowing for precise matching between query terms and relevant document sections.
  
- **Optimize Computational Resources**: Processing smaller text chunks can save memory and allow for better parallelization, making processing tasks more efficient.

1. Length based

In [1]:
from langchain_text_splitters import CharacterTextSplitter

In [2]:
with open(r"./state_of_the_union.txt", encoding="utf-8") as f:
    state_of_the_union = f.read()

In [3]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name = "cl100k_base", 
    chunk_size    = 100, 
    chunk_overlap = 0
)

In [4]:
texts = text_splitter.split_text(state_of_the_union)

In [5]:
type(texts)

list

In [6]:
len(texts)

99

In [7]:
texts[:2]

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution.',
 'And with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people.']

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [9]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)

In [10]:
from langchain_text_splitters import TokenTextSplitter

In [11]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [12]:
texts = text_splitter.split_text(state_of_the_union)

In [13]:
len(texts)

943

#### Using Spacy

In [17]:
#pip install --upgrade --quiet  spacy

In [18]:
from langchain_text_splitters import SpacyTextSplitter

In [19]:
text_splitter = SpacyTextSplitter(chunk_size=1000)

In [20]:
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.  



Last year COVID-19 kept us apart.

This year we are finally together again. 



Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans. 



With a duty to one another to the American people to the Constitution. 



And with an unwavering resolve that freedom will always triumph over tyranny. 



Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated. 



He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined. 



He met the Ukrainian people. 



From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.




#### Sentence Transformers

- a BERT model (embeddings)

- The SentenceTransformersTokenTextSplitter is a specialized text splitter for use with the sentence-transformer models.
- The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

In [21]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

In [24]:
splitter = SentenceTransformersTokenTextSplitter(model_name      = 'all-MiniLM-L6-v2', 
                                                 chunk_overlap   = 0, 
                                                 tokens_per_chunk= 5,
                                                 #cache_dir       = r'D:\AI-DATASETS\07-Hugging-Face-Data'
                                                )

In [25]:
# Define your text
dummy_text = "This is an example sentence for tokenization using SentenceTransformers. It demonstrates splitting."

In [26]:
splitter.count_tokens(text=dummy_text)

21

In [28]:
# Split the sentence into tokens
tokens = splitter.split_text(dummy_text)
tokens

['this is an example sentence',
 'for tokenization using sentence',
 '##transformers.',
 'it demonstrates splitting.']