# Text Splitters
LLMs have a maximum prompt size, preventing them from feeding entire documents. This makes it crucial to divide documents into smaller parts, and Text Splitters prove to be extremely useful in achieving this. Text Splitters help break down large text documents into smaller, more digestible pieces that language models can process more effectively.

---

Pros:
- **Reduced hallucination**: By providing a source document, the LLM is more likely to generate content based on the given information, reducing the chances of creating false or irrelevant information.
- **Increased accuracy**: With a reliable source document, the LLM can generate more accurate answers, especially in use cases where accuracy is crucial.
- **Verifiable information**: Users can cross-check the generated content with the source document to ensure the information is accurate and reliable.

Cons:
- **Limited scope**: Relying on a single document may limit the scope of the generated content, as the LLM will only have access to the information provided in the document.
- **Dependence on document quality**: The accuracy of the generated content heavily depends on the quality and reliability of the source document. The LLM will likely generate incorrect or misleading content if the document contains inaccurate or biased information.
- **Inability to eliminate hallucination completely**: Although providing a document as a base reduces the chances of hallucination, it does not guarantee that the LLM will never generate false or irrelevant information.

## 1. Character Text Splitter
This type of splitter can be used in various scenarios where you must split long text pieces into smaller, semantically meaningful chunks. The splitter allows you to customize the chunking process along two axes - chunk size and chunk overlap - to balance the trade-offs between splitting the text into manageable pieces and preserving semantic context between chunks.

> Note: Only one separator can be specified for this splitter. If chunk_size of the splitted doc is more than the specified chunk_size, and the separator is not found, then further splitting won't be done.

In [22]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../data/WritingArticleSummary.pdf")
pages = loader.load()
len(pages)

3

In [34]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator=" ", chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(pages)

print(f"You have {len(texts)} documents")

print("Preview:")
print(texts[0].page_content)

You have 10 documents
Preview:
1 
 
Academic Skills, Trent University www.trentu.ca/academicskills 
Peterborough, ON Canada © 2014 
 Writing Article Summaries 
 
 
Understanding Article Summaries 
An article summary is a short, focused paper about one scholarly 
article. This paper is informed by critical reading of an article. For 
argumentative articles, the summary identifies, explains, and 
analyses the thesis and supporting arguments; for empirica l articles, 
the summary identifies, explains, and analyses the research 
questions, methods, and findings. 
Although article summaries are often short and rarely account for a 
large portion of your grade, they are a strong indicator of your 
reading and writ ing skills. Professors ask you to write article 
summaries to help you to develop essential skills in critical reading, 
summarizing, and clear, organized writing. Furthermore, an article 
summary requires you to read a scholarly article quite closely, which 
provides a useful intr

> Note: No universal approach for chunking text will fit all scenarios. Finding the best chunk size means going through a few steps. First, clean up your data by getting rid of anything that's not needed, like HTML tags from websites. Then, pick a few different chunk sizes to test. The best size will depend on what kind of data you're working with and the model you're using. Finally, test out how well each size works by running some queries and comparing the results. You might need to try a few different sizes before finding the best one. This process might take some time, but getting the best results from your project is worth it.

## 2. Recursive Character Text Splitter
The Recursive Character Text Splitter is a text splitter designed to split the text into chunks based on a list of characters provided. It attempts to split text using the characters from a list in order until the resulting chunks are small enough. By default, the list of characters used for splitting is `["\n\n", "\n", " ", ""]`, which tries to keep paragraphs, sentences, and words together as long as possible, as they are generally the most semantically related pieces of text.

> Note: It extends the Character Text Splitter. If the chunk_size of the splitted doc is more than the specified chunk_size, then the splitter will try to split the text using the characters from the list in order until the resulting chunks are small enough.

In [42]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../data/WritingArticleSummary.pdf")
pages = loader.load()
len(pages)

3

Argument `length_function` is powerful, provide a  custom function when you have your own logic for calculating length of chunk. To use a token counter, you can create a custom function that calculates the number of tokens in a given text and pass it as the `length_function` parameter. This will ensure that your text splitter calculates the length of chunks based on the number of tokens instead of the number of characters.

In [43]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=10,
    length_function=len,
)
# When you have text as a string (by directly reading a file) intead of documents
# texts = text_splitter.create_documents([text])
texts = text_splitter.split_documents(pages)
print(len(texts))

print("Preview:")
print(texts[0].page_content)

10
Preview:
1 
 
Academic Skills, Trent University    www.trentu.ca/academicskills  
Peterborough, ON Canada                     © 2014  
  Writing Article Summaries  
 
 
Understanding Article Summaries  
An article summary is a short, focused paper about one scholarly 
article. This paper is informed by critical reading of an article. For 
argumentative articles, the summary identifies, explains, and 
analyses the thesis and supporting arguments; for empirica l articles, 
the summary identifies, explains, and analyses the research 
questions, methods, and findings.  
Although article summaries are often short and rarely account for a 
large portion of your grade, they are a strong indicator of your 
reading and writ ing skills. Professors ask you to write article 
summaries to help you to develop essential skills in critical reading, 
summarizing, and clear, organized writing. Furthermore, an article 
summary requires you to read a scholarly article quite closely, which


## 3. NLTK Text Splitter
The `NLTKTextSplitter` in LangChain is an implementation of a text splitter that uses the Natural Language Toolkit (NLTK) library to split text based on tokenizers. The goal is to split long texts into smaller chunks without breaking the structure of sentences and paragraphs.

However, the `NLTKTextSplitter` is not specifically designed to handle word segmentation in English sentences without spaces. For this purpose, you can use alternative libraries like pyenchant or word segment.

In [44]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../data/WritingArticleSummary.pdf")
pages = loader.load()
len(pages)

3

In [45]:
from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_documents(pages)
print(len(texts))

print("Preview:")
print(texts[0].page_content)

12
Preview:
1 
 
Academic Skills, Trent University    www.trentu.ca/academicskills  
Peterborough, ON Canada                     © 2014  
  Writing Article Summaries  
 
 
Understanding Article Summaries  
An article summary is a short, focused paper about one scholarly 
article.

This paper is informed by critical reading of an article.

For 
argumentative articles, the summary identifies, explains, and 
analyses the thesis and supporting arguments; for empirica l articles, 
the summary identifies, explains, and analyses the research 
questions, methods, and findings.

Although article summaries are often short and rarely account for a 
large portion of your grade, they are a strong indicator of your 
reading and writ ing skills.

Professors ask you to write article 
summaries to help you to develop essential skills in critical reading, 
summarizing, and clear, organized writing.


## 4. `SpacyTextSplitter`
The `SpacyTextSplitter` helps split large text documents into smaller chunks based on a specified size. This is useful for better management of large text inputs. It's important to note that the `SpacyTextSplitter` is an alternative to NLTK-based sentence splitting. 

In [46]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../data/WritingArticleSummary.pdf")
pages = loader.load()
len(pages)

3

If you see the following error on executing the cell below:
`OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.`

Then run the following command in your terminal:
`python -m spacy download en_core_web_sm`

In [49]:
from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000, chunk_overlap=20)

texts = text_splitter.split_documents(pages)

print(len(texts))

print("Preview:")
print(texts[0].page_content)

10
Preview:
1 
 
Academic Skills, Trent University    www.trentu.ca/academicskills  
Peterborough, ON Canada                     © 2014  
  

Writing Article Summaries  
 
 
Understanding Article Summaries  
An article summary is a short, focused paper about one scholarly 
article.

This paper is informed by critical reading of an article.

For 
argumentative articles, the summary identifies, explains, and 
analyses the thesis and supporting arguments; for empirica l articles, 
the summary identifies, explains, and analyses the research 
questions, methods, and findings.  


Although article summaries are often short and rarely account for a 
large portion of your grade, they are a strong indicator of your 
reading and writ ing skills.

Professors ask you to write article 
summaries to help you to develop essential skills in critical reading, 
summarizing, and clear, organized writing.


## 5. `MarkdownTextSplitter`
The `MarkdownTextSplitter` is designed to split text written using Markdown languages like headers, code blocks, or dividers. It is implemented as a simple subclass of `RecursiveCharacterSplitter` with Markdown-specific separators. By default, these separators are determined by the Markdown syntax, but they can be customized by providing a list of characters during the initialization of the MarkdownTextSplitter instance.

In [50]:
from langchain.text_splitter import MarkdownTextSplitter

markdown_text = """
# 

# Welcome to My Blog!

## Introduction
Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python, Java, and JavaScript.

Here's a list of my favorite programming languages:

1. Python
2. JavaScript
3. Java

You can check out some of my projects on [GitHub](https://github.com).

## About this Blog
In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on the latest technology trends, and occasional book reviews.

Here's a small piece of Python code to say hello:

\``` python
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("John")
\```

Stay tuned for more updates!

## Contact Me
Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at johndoe@email.com.

"""

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

print(docs)

[Document(page_content='# \n\n# Welcome to My Blog!', metadata={}), Document(page_content='## Introduction', metadata={}), Document(page_content='Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python,', metadata={}), Document(page_content='Java, and JavaScript.', metadata={}), Document(page_content="Here's a list of my favorite programming languages:\n\n1. Python\n2. JavaScript\n3. Java", metadata={}), Document(page_content='You can check out some of my projects on [GitHub](https://github.com).', metadata={}), Document(page_content='## About this Blog', metadata={}), Document(page_content="In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on", metadata={}), Document(page_content='the latest technology trends, and occasional book reviews.', metadata={}), Document(page_content="Here's a small piece of Python code to say hello:", metadata={}), Document(page_content='\\``` python\ndef say_hello(name):\n

## 6. `TokenTextSplitter`
The main advantage of using `TokenTextSplitter` over other text splitters, like `CharacterTextSplitter`, is that it respects the token boundaries, ensuring that the chunks do not split tokens in the middle. This can be particularly helpful in maintaining the semantic integrity of the text when working with language models and embeddings.

This type of splitter breaks down raw text strings into smaller pieces by initially converting the text into BPE (Byte Pair Encoding) tokens, and subsequently dividing these tokens into chunks. It then reassembles the tokens within each chunk back into text.

One potential drawback of using TokenTextSplitter is that it may require additional computation when converting text to BPE tokens and back. If you need a faster and simpler text-splitting method, you might consider using CharacterTextSplitter, which directly splits the text based on character count, offering a more straightforward approach to text segmentation.

In [51]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../data/WritingArticleSummary.pdf")
pages = loader.load()
len(pages)

3

In [55]:
from langchain.text_splitter import TokenTextSplitter

# Initialize the TokenTextSplitter with desired chunk size and overlap
text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=50)

# Split into smaller chunks
texts = text_splitter.split_documents(pages)

print(len(texts))

print("Preview:")
print(texts[0].page_content)

6
Preview:
1 
 
Academic Skills, Trent University    www.trentu.ca/academicskills  
Peterborough, ON Canada                     © 2014  
  Writing Article Summaries  
 
 
Understanding Article Summaries  
An article summary is a short, focused paper about one scholarly 
article. This paper is informed by critical reading of an article. For 
argumentative articles, the summary identifies, explains, and 
analyses the thesis and supporting arguments; for empirica l articles, 
the summary identifies, explains, and analyses the research 
questions, methods, and findings.  
Although article summaries are often short and rarely account for a 
large portion of your grade, they are a strong indicator of your 
reading and writ ing skills. Professors ask you to write article 
summaries to help you to develop essential skills in critical reading, 
summarizing, and clear, organized writing. Furthermore, an article 
summary requires you to read a scholarly article quite closely, which 
provides a us