- Once you've loaded documents, you'll often want to transform them to better suit your application. 
- You may want to split a long document into smaller chunks that can fit into your model's context window.
- LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.
- When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text.
    1. Split the text up into small, semantically meaningful chunks (often sentences).
    2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
    3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

* That means there are two different axes along which you can customize your text splitter:

- How the text is split
- How the chunk size is measured 

In [1]:
## Reading a PDF file
from langchain_community.document_loaders import PyPDFLoader

In [2]:
pdf_loader = PyPDFLoader('Data/attention.pdf')
documents_pdf = pdf_loader.load()
documents_pdf

[Document(metadata={'source': 'Data/attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transform

In [4]:
type(documents_pdf)

list

# 1. Recursively split by character
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""].

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)
final_documnets = text_splitter.split_documents(documents_pdf)
final_documnets


[Document(metadata={'source': 'Data/attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto'),
 Document(metadata={'source': 'Data/attention.pdf', 'page': 0}, page_content='University of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder a

In [16]:
print(final_documnets[0])
print(final_documnets[1])
print(type(final_documnets[0]))

page_content='Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto' metadata={'source': 'Data/attention.pdf', 'page': 0}
page_content='University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transform

In [15]:
speech = ''
with open('Data/speech.txt') as f:
    speech = f.read()
    print(type(speech))

text_splitter = RecursiveCharacterTextSplitter(chunk_size=50,chunk_overlap=20)
text = text_splitter.create_documents([speech])
print(text[0])
print(text[1])
print(type(text[0]))

<class 'str'>
page_content='The world must be made safe for democracy. Its'
page_content='for democracy. Its peace must be planted upon the'
<class 'langchain_core.documents.base.Document'>


# 2. Split by character
This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.

1. How the text is split: by single character.
2. How the chunk size is measured: by number of characters.


In [18]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

In [19]:
loader = TextLoader('Data/speech.txt')
docs = loader.load()
docs

[Document(metadata={'source': 'Data/speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairne

In [23]:
text_splitter = CharacterTextSplitter(separator='\n\n',chunk_size=100,chunk_overlap=40)
text = text_splitter.split_documents(docs)
text


Created a chunk of size 470, which is longer than the specified 100
Created a chunk of size 347, which is longer than the specified 100
Created a chunk of size 668, which is longer than the specified 100
Created a chunk of size 982, which is longer than the specified 100
Created a chunk of size 789, which is longer than the specified 100


[Document(metadata={'source': 'Data/speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'),
 Document(metadata={'source': 'Data/speech.txt'}, page_content='Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.'),
 Document(metadata={'source': 'Data/speech.txt'

In [24]:
speech = ''
with open('Data/speech.txt') as f:
    speech = f.read()
    print(type(speech))

text_splitter = CharacterTextSplitter(separator='\n\n',chunk_size=100,chunk_overlap=40)
text = text_splitter.create_documents([speech])
print(text[0])
print(text[1])
print(type(text[0]))

Created a chunk of size 470, which is longer than the specified 100
Created a chunk of size 347, which is longer than the specified 100
Created a chunk of size 668, which is longer than the specified 100
Created a chunk of size 982, which is longer than the specified 100
Created a chunk of size 789, which is longer than the specified 100


<class 'str'>
page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'
page_content='Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.'
<class 'langchain_core.documents.base.Document'>


# 3. Split by HTML header

In [25]:
from langchain_text_splitters import HTMLHeaderTextSplitter

In [26]:

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]



In [27]:
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Baz'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Some text about Baz'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some concluding text about Foo')]

In [35]:
url = "https://360digitmg.com/"
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]


In [37]:
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = splitter.split_text_from_url(url)
html_header_splits

[Document(page_content='Limited Seats Available. Call us right away!  \nCourses  \nCategories Objective  \nData Science & Deep Learning Data Analytics & Business Intelligence Generative AI (Gen AI) MLOps Data Engineering & Cloud Technologies Analytics Specializations Business Corporate Training Franchise  \nFeatured Data Science Course In India Trending Best Data Science Training in India Using Python Certification Course in Core Python Artificial Intelligence & Deep Learning Course Training  \nCertification Courses  \n4 Months | VILT  \n2 Months | Hybrid  \n2 Months | VILT  \n2.5 Months | VILT  \nTrending Professional Data Science & AI Course with Placement Assistance Featured Practical Data Science and Artificial Intelligence Course  \nProfessional Courses  \n6 Months | Hybrid  \n12 Months | Hybrid  \nTrending Certificate Course on Data Analytics Data Visualization using Tableau Training Data Visualisation using Power BI Training Featured Certificate Program on Business Analytics  \n

# 4. Recursively split JSON
This json splitter traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size. If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size considder following this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.

1. How the text is split: json value.
2. How the chunk size is measured: by number of characters.


In [38]:
import json

import requests

In [39]:
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

In [40]:
from langchain_text_splitters import RecursiveJsonSplitter

In [42]:
splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = splitter.split_json(json_data=json_data)

In [43]:
for chunks in json_chunks[:3]:
    print(chunks)

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}}


In [45]:
# The splitter can also output documents

docs = splitter.create_documents(texts=[json_data])
for doc in docs[:3]:
    print(doc)

page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'


In [46]:
# or a list of strings
texts = splitter.split_text(json_data=json_data)

print(texts[0])
print(texts[1])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}


# 5. Split by tokens
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

####  tiktoken
tiktoken is a fast BPE tokenizer created by OpenAI.
We can use it to estimate tokens used. It will probably be more accurate for the OpenAI models.

How the text is split: by character passed in.
How the chunk size is measured: by tiktoken tokenizer.


In [2]:
speech = ''
with open('Data/speech.txt') as f:
    speech = f.read()

print(speech)

The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.

Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.

…

It will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people or wit

In [49]:
from langchain_text_splitters import CharacterTextSplitter

In [51]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0

)
texts = text_splitter.split_text(speech)

print(texts[0])


Created a chunk of size 138, which is longer than the specified 100
Created a chunk of size 205, which is longer than the specified 100
Created a chunk of size 160, which is longer than the specified 100


The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.


In [52]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
  model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0  
)

In [3]:
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(speech)
print(texts[0])

The world must be made safe for democracy. Its


#### spaCy
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

Another alternative to NLTK is to use spaCy tokenizer.

How the text is split: by spaCy tokenizer.
How the chunk size is measured: by number of characters.
- https://spacy.io/models/en

In [4]:
from langchain_text_splitters import SpacyTextSplitter

In [8]:
import spacy
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x7feb1b713b50>

In [13]:
text_splitter = SpacyTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(speech)
print(texts[0])

The world must be made safe for democracy.

Its peace must be planted upon the tested foundations of political liberty.

We have no selfish ends to serve.

We desire no conquest, no dominion.

We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make.

We are but one of the champions of the rights of mankind.

We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.



Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.


#### SentenceTransformers
The SentenceTransformersTokenTextSplitter is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

In [15]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

In [19]:
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
texts = splitter.split_text(speech)
print(texts[0])

the world must be made safe for democracy. its peace must be planted upon the tested foundations of political liberty. we have no selfish ends to serve. we desire no conquest, no dominion. we seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. we are but one of the champions of the rights of mankind. we shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them. just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, i feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for. … it will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people or with t

#### NLTK
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.

Rather than just splitting on "\n\n", we can use NLTK to split based on NLTK tokenizers.

How the text is split: by NLTK tokenizer.
How the chunk size is measured: by number of characters.

In [17]:
from langchain_text_splitters import NLTKTextSplitter

In [20]:
text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(speech)
print(texts[0])


The world must be made safe for democracy.

Its peace must be planted upon the tested foundations of political liberty.

We have no selfish ends to serve.

We desire no conquest, no dominion.

We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make.

We are but one of the champions of the rights of mankind.

We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.

Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.


#### Hugging Face tokenizer
Hugging Face has many tokenizers.

We use Hugging Face tokenizer, the GPT2TokenizerFast to count the text length in tokens.

How the text is split: by character passed in.
How the chunk size is measured: by number of tokens calculated by the Hugging Face tokenizer.


In [22]:
from transformers import GPT2TokenizerFast
from langchain_text_splitters import CharacterTextSplitter

In [23]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)

In [25]:
texts = text_splitter.split_text(speech)
print(texts[0])

Created a chunk of size 137, which is longer than the specified 100
Created a chunk of size 205, which is longer than the specified 100
Created a chunk of size 160, which is longer than the specified 100


The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.
