----------------
#### text splitters
------------------

Reasons to Split Documents:

- **Handle Non-Uniform Document Lengths**: Documents in real-world collections vary greatly in size. Splitting helps maintain consistent processing across all documents.
  
- **Overcome Model Input Size Constraints**: Many language and embedding models have maximum input limits. Splitting enables us to process larger documents that would otherwise exceed these restrictions.
  
- **Improve Representation Quality**: For longer documents, embeddings or representations may lose quality when capturing excessive information. Splitting allows for more focused, accurate representations of each part.
  
- **Enhance Retrieval Precision**: In retrieval systems, splitting enables more granular search results, allowing for precise matching between query terms and relevant document sections.
  
- **Optimize Computational Resources**: Processing smaller text chunks can save memory and allow for better parallelization, making processing tasks more efficient.


**1. Length-based**

In [1]:
from langchain_text_splitters import CharacterTextSplitter

In [2]:
# This is a long document we can split up.
with open(r"state_of_the_union.txt", encoding="utf-8") as f:
    state_of_the_union = f.read()

In [3]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name= "cl100k_base", 
    chunk_size   = 100, 
    chunk_overlap= 0
)

In [4]:
texts = text_splitter.split_text(state_of_the_union)

In [5]:
len(texts)

99

In [6]:
texts[0]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution.'

In [7]:
texts[1]

'And with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people.'

In [8]:
texts[2]

'From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n\nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.'

In [16]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [22]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
    separators   =["\n\n", "\n", " ", ""]
)

In [23]:
texts = text_splitter.split_text(state_of_the_union)

In [24]:
len(texts)

550

In [25]:
texts

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
 'the Cabinet. Justices of the Supreme Court. My fellow Americans.',
 'Last year COVID-19 kept us apart. This year we are finally together again.',
 'Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.',
 'With a duty to one another to the American people to the Constitution.',
 'And with an unwavering resolve that freedom will always triumph over tyranny.',
 'Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he',
 'could make it bend to his menacing ways. But he badly miscalculated.',
 'He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of',
 'strength he never imagined.',
 'He met the Ukrainian people.',
 'From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their',
 'determination, inspires the world.',
 'Groups of citizens bloc