# Document Splitting

`Document Splitting` in Lang Chain involves the process of breaking down lengthy texts into smaller, manageable chunks suitable for processing by language models. The goal is to ensure that these chunks are semantically coherent, meaning that related pieces of text stay together, which can vary depending on the text type. Lang Chain offers a range of built-in text splitters within the langchain-text-splitters package, each with different methods of splitting text (e.g., by sentences, HTML or Markdown characters, specific code language characters, tokens, or even semantic similarity) and measuring chunk size. Some splitters also add metadata to each chunk, providing context about its origin. This functionality is crucial for handling large documents efficiently, enabling users to customize the splitting process based on their specific needs, and making it easier to integrate with language models for further processing or analysis. Additionally, tools like Chunkviz aid in evaluating and visualizing the effectiveness of text splitters, facilitating the optimization of text splitting strategies.

More of text splitting in [LangChain Official Documentation](https://python.langchain.com/docs/modules/data_connection/document_transformers/)

<div style="text-align:center"><img src="document_splitting.png" /></div>

In [11]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [13]:
chunk_size =26
chunk_overlap = 4

In [14]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [15]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [16]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [17]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [18]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [19]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [20]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [21]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [22]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Try your own examples!

In [23]:
text4 = 'NOSSA QUE LEGAL ISSO AQUI. ESTOU TENTANDO MAIS EXEMPLOS PRA ENTENDER SE FAZ SENTIDO USAR ESSE SPLITTER'

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = '.'
)

for x in c_splitter.split_text(text4):
    print("split:",  x, "- Tamanho:",len(x))

split: NOSSA QUE LEGAL ISSO AQUI - Tamanho: 25
split: ESTOU TENTANDO MAIS EXEMPLOS PRA ENTENDER SE FAZ SENTIDO USAR ESSE SPLITTER - Tamanho: 75


In [24]:
for x in r_splitter.split_text(text4):
    print("split:",  x, "- Tamanho:",len(x))

split: NOSSA QUE LEGAL ISSO AQUI. - Tamanho: 26
split: ESTOU TENTANDO MAIS - Tamanho: 19
split: EXEMPLOS PRA ENTENDER SE - Tamanho: 24
split: SE FAZ SENTIDO USAR ESSE - Tamanho: 24
split: SPLITTER - Tamanho: 8


## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [25]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [26]:
len(some_text)

496

In [27]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", "", "."]
)

In [28]:
for x in c_splitter.split_text(some_text):
    print("split:",  x, "- Tamanho:",len(x))

split: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. 

 Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, - Tamanho: 448
split: have a space.and words are separated by space. - Tamanho: 46


`c_splitter`: there is no sense the two chunks made by splitter.

In [29]:
for x in r_splitter.split_text(some_text):
    print("split:",  x, "- Tamanho:",len(x))

split: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. - Tamanho: 248
split: Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space. - Tamanho: 243


`r_splitter`: there is more sense the two chunks made by splitter.

In [30]:
c_splitter = CharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=0,
    separator = '.'
)

for x in c_splitter.split_text(some_text):
    print("split:",  x, "- Tamanho:",len(x))

split: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document - Tamanho: 247
split: Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space - Tamanho: 242


Now we are using an approach more effective for `c_splitter` than before. All splitter strategy depends only shape / format of data.

### Let's reduce the chunk size a bit and add a period to our separators:

In [31]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [32]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

### Using PDF Document Loader

In [34]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [35]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [36]:
docs = text_splitter.split_documents(pages)

In [37]:
len(docs)

77

In [38]:
len(pages)

22

In [39]:
print(docs[0].page_content)

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning is th e most exciting field of all the computer 
sciences. So I'm actually always excited about  teaching this class. Sometimes I actually 
think that machine learning is not only the most exciting thin g in computer science, but 
the most exciting thing in all of human e ndeavor, so maybe a little bias there.  
I also want to introduce the TAs, who are all graduate students doing research in or 
related to the machine learni ng and all aspects of machin e learning. Paul Baumstarck


## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [40]:
from langchain.text_splitter import TokenTextSplitter

In [41]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [42]:
text1 = "foo bar bazzyfoo"

In [43]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [44]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [45]:
docs = text_splitter.split_documents(pages)

In [46]:
docs[0]

Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0})

In [47]:
pages[0].metadata

{'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [48]:
# ! pip install unstructured

In [49]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [50]:
markdown_document = """# Title\n\n
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n
### Section A\n\n
Hi this is Lance\n\n
### Section B\n\n
Hi this is Paul\n\n
## Chapter 2\n\n
Hi this is Mike\n\n
### Section C\n\n
Hi this is Joe\n\n
### Section D\n\n
Hi this is Paul\n\n
## Chapter 3\n\n
Hi this is Mike\n\n
### Section E\n\n
Hi this is Joe\n\n
### Section F\n\n
Hi this is Paul\n\n
"""

In [51]:
headers_to_split_on = [
    ('#', 'Header 1'),
    ('##', 'Header 2'),
    ('###', 'Header 3'),
]

In [52]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [53]:
md_header_splits[0]

Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

In [54]:
md_header_splits[1]

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section A'})

Try on a real Markdown file, like a md file in repository.

In [55]:
loader = DirectoryLoader("./", glob="*.md", loader_cls=TextLoader)
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [56]:
headers_to_split_on = [
    ('#', 'Header 1'),
    ('##', 'Header 2'),
    ('###', 'Header 3'),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [57]:
md_header_splits = markdown_splitter.split_text(txt)
md_header_splits[0]

Document(page_content="This document chronicles my journey through a challenge encountered in a Langchain course offered by DeepLearning.AI. The course recommended using OpenAI's API for audio transcription, a resource I neither had access to nor could use locally on my machine. Faced with this limitation, I embarked on a quest to find a workaround that allowed me to perform the transcription locally, utilizing my NVIDIA GeForce GTX 1650 GPU. This narrative covers the problem, the solution I devised, and the invaluable lessons learned along the way.", metadata={'Header 1': 'Overcoming OpenAI API Limitations with Local GPU Acceleration for Audio Transcription', 'Header 2': 'Introduction'})

In [58]:
for document in md_header_splits:
    print(document.metadata)
    print(document.page_content)
    print("-" * 50)

{'Header 1': 'Overcoming OpenAI API Limitations with Local GPU Acceleration for Audio Transcription', 'Header 2': 'Introduction'}
This document chronicles my journey through a challenge encountered in a Langchain course offered by DeepLearning.AI. The course recommended using OpenAI's API for audio transcription, a resource I neither had access to nor could use locally on my machine. Faced with this limitation, I embarked on a quest to find a workaround that allowed me to perform the transcription locally, utilizing my NVIDIA GeForce GTX 1650 GPU. This narrative covers the problem, the solution I devised, and the invaluable lessons learned along the way.
--------------------------------------------------
{'Header 1': 'Overcoming OpenAI API Limitations with Local GPU Acceleration for Audio Transcription', 'Header 2': 'The Problem'}
In the context of the course, the OpenAIWhisperParser API was utilized for transcribing audio from YouTube videos. However, this API is part of OpenAI's paid