# Exploring Document Splitters and Chunkers in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [1]:
!pip install langchain # ==0.3.11
!pip install langchain-openai #==0.2.12
!pip install langchain-community #==0.3.11

Collecting langchain-openai
  Downloading langchain_openai-0.3.32-py3-none-any.whl.metadata (2.4 kB)
Downloading langchain_openai-0.3.32-py3-none-any.whl (74 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.32
Collecting langchain-community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-core<2.0.0,>=0.3.75 (from langchain-community)
  Downloading langchain_core-0.3.75-py3-none-any.whl.metadata (5.7 kB)
Collecting requests<3,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.6.7->langchain-community)
  Downloading mar

In [2]:
# takes 2 - 5 mins to install on Colab
!pip install "unstructured[all-docs]" #==0.14.0"

Collecting unstructured[all-docs]
  Downloading unstructured-0.18.14-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured[all-docs])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[all-docs])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[all-docs])
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured[all-docs])
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured[all-docs])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured[all-docs])
  Downloading rapidfuzz-3.14.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64

After installing `unstructured`above remember to restart your session when it shows you the following popup, if it doesn't go to `Runtime`and `Restart Session`

![](https://i.imgur.com/UOBaotk.png)

In [1]:
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.10 [186 kB]
Fetched 186 kB in 0s (2,335 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline

In [2]:
!pip install langchain-text-splitters #==0.3.2
!pip install tiktoken #==0.7.0
!pip install spacy
!pip install sentence-transformers #==2.7.0



## Document Splitting and Chunking

After loading documents into LangChain, you might need to transform them for optimal use in your application. One common transformation is splitting a long document into smaller segments to fit within your model's context window. LangChain provides several built-in document transformers to facilitate the splitting, combining, filtering, and manipulating of documents.

#### Process of Document Splitting:
1. **Splitting into Chunks:**
   - Break down the text into small, semantically meaningful units (typically sentences).
   
2. **Combining Chunks:**
   - Assemble these smaller units into larger chunks until they reach a predefined size. This size is determined by a specific measurement function.

3. **Creating Overlapping Chunks:**
   - Once the maximum size is reached, finalize the chunk as an independent text piece.
   - Begin a new chunk, incorporating some overlap with the previous chunk to maintain textual context.

This approach ensures that semantically related text pieces are kept together, which is crucial for maintaining the meaning and continuity of the document.


### RecursiveCharacterTextSplitter

The `RecursiveCharacterTextSplitter` is a versatile tool within LangChain for splitting text based on a list of characters. This splitter is designed to handle various requirements through adjustable parameters.

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

#### Features and Parameters:

- **Character List:** Utilizes a specified list of characters to determine where splits should occur.
- **Chunk Size:** Allows you to set the size of each chunk, helping ensure that chunks are manageable and suit the context window of your model.
- **Overlap:** Configurable overlap between consecutive chunks to maintain context continuity across chunks.

This splitter is particularly useful for texts where precise control over the splitting criteria is needed, allowing for customized chunking strategies based on specific characters.


In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

doc = """Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colors and joyous energy.
"""



In [4]:
print(doc)

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colors and joy

Splitting with smaller chunk size (total characters) makes more paragraphs

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=0,
)

In [6]:
texts = text_splitter.split_text(doc)
print(len(texts)) # 5

5


In [7]:
for text in texts:
    print(text)
    print(len(text))
    print()

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
299

On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream
298

of customers.
13

Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade,
294

various competitions, and a night market that lights up the town with vib

Splitting with larger chunk size (total characters) makes less paragraphs

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=0,
)

texts = text_splitter.split_text(doc)
print(len(texts)) # 3

3


In [9]:
for text in texts:
    print(text)
    print(len(text))
    print()

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
299

On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
312

Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colo

`chunk_overlap` helps to mitigate loss of information when context is divided between chunks especially for really small chunks

In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=100,
)

texts = text_splitter.split_text(doc)
print(len(texts)) # 5

5


In [11]:
for text in texts:
    print(text)
    print(len(text))
    print()

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
299

On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream
298

of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
110

Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival

You can create LangChain `Document` chunks with the `create_documents` function

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=100,
)

In [13]:
docs = text_splitter.create_documents([doc])
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'),
 Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.'),
 Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The fes

### CharacterTextSplitter

The `CharacterTextSplitter` is a straightforward tool in LangChain for dividing text based on a specified character. It's designed to be simple yet effective, providing essential controls for customizing how text is segmented.

#### Key Features and Parameters:
- **Split Character:** By default, it uses a empty string character ("") to split the text, but this can be customized to any character you specify.
- **Chunk Size:** Allows you to define the length of each chunk in terms of the number of characters. This is useful for ensuring each piece of text is of a manageable size for processing.
- **Overlap:** You can set the amount of overlap between consecutive chunks. This helps maintain context and continuity when text is split into separate parts.

This method is the simplest among text splitting tools, focusing on character-based division and providing straightforward measures for chunk length and overlap.

To obtain the string content directly, use `.split_text`.

To create LangChain `Document` objects (e.g., for use in downstream tasks), use `.create_documents`.


In [14]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=200,
    is_separator_regex=False,
)

docs = text_splitter.create_documents([doc])
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'),
 Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.'),
 Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The fes

### Code Splitters

`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

python_code = """
def hello_world():
    print("Hello, World!")
hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([python_code])
python_docs

[Document(metadata={}, page_content='def hello_world():\n    print("Hello, World!")'),
 Document(metadata={}, page_content='hello_world()')]

### Markdown Splitters

We might want to chunk a document based on the structure. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use MarkdownHeaderTextSplitter. This will split a markdown file by a specified set of headers.

For example, if we want to split this markdown:

```
markdown_document = """
# Team Introductions

## Management Team

Hi, this is Jim, the CEO.  
Hi, this is Joe, the CFO.

## Development Team

Hi, this is Molly, the Lead Developer.
"""
```

We can specify the headers to split on:

```
[("#", "Header 1"),
 ("##", "Header 2")]
```

And content is grouped or split by common headers:

```
Document(page_content='Hi, this is Jim, the CEO.\nHi, this is Joe, the CFO.',
metadata={'Header 1': 'Team Introductions', 'Header 2': 'Management Team'})

Document(page_content='Hi, this is Molly, the Lead Developer.',
metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team'})
```

In [16]:
markdown_document = """
# Team Introductions

## Management Team
Hi, this is Jim, the CEO.
Hi, this is Joe, the CFO.

## Development Team
Hi, this is Molly, the Lead Developer.
"""

In [17]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Management Team'}, page_content='Hi, this is Jim, the CEO.\nHi, this is Joe, the CFO.'),
 Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team'}, page_content='Hi, this is Molly, the Lead Developer.')]

By default, `MarkdownHeaderTextSplitter` strips headers being split on from the output chunk's content. This can be disabled by setting `strip_headers = False`.

In [18]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Management Team'}, page_content='# Team Introductions  \n## Management Team\nHi, this is Jim, the CEO.\nHi, this is Joe, the CFO.'),
 Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team'}, page_content='## Development Team\nHi, this is Molly, the Lead Developer.')]

### Tokenizer based Splitting

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model. Let's look at how we can chunk documents using different tokenizers



#### tiktoken splitters

[`tiktoken`](https://github.com/openai/tiktoken) is a fast BPE tokenizer created by OpenAI.

We can use tiktoken to estimate tokens used. It will probably be more accurate for the OpenAI models. We measure the `chunk_size`here based on the number of tokens typically and not the number of characters

For Open AI models, roughly 1 token = 3\4 words.

Approx: 100 tokens ~= 75 words.



We can load a [`TokenTextSplitter`](https://api.python.langchain.com/en/latest/base/langchain_text_splitters.base.TokenTextSplitter.html) splitter, which works with `tiktoken` directly and will ensure each split is smaller than chunk size in terms of the number of tokens.

In [19]:
doc = """Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colors and joyous energy.
"""

In [20]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(model_name='gpt-4o-mini',
                                  chunk_size=30,
                                  chunk_overlap=10)

docs = text_splitter.create_documents([doc])

In [21]:
len(docs)

10

In [22]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a'),
 Document(metadata={}, page_content=' and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering'),
 Document(metadata={}, page_content=' with an array of shops and cafes, each offering a unique taste of local flavor and culture.\nOn a typical afternoon, the town square comes alive with'),
 Document(metadata={}, page_content=' a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts'),
 Document(metadata={}, page_content=' Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts'),
 Document(metadata={}

In [23]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

Words: 27 Tokens: 30 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a
Words: 28 Tokens: 30 Chunk:  and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering
Words: 27 Tokens: 30 Chunk:  with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with
Words: 27 Tokens: 30 Chunk:  a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts
Words: 27 Tokens: 30 Chunk:  Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts
Words: 26 Tokens: 30 Chunk:  by. The aroma of freshly baked bread wafts from the b

Larger chunk size in terms of number of words \ tokens will create lesser chunks or paragraphs as usual

In [24]:
text_splitter = TokenTextSplitter(model_name='gpt-4o-mini',
                                  chunk_size=100,
                                  chunk_overlap=30)

docs = text_splitter.create_documents([doc])

In [25]:
len(docs)

3

In [26]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.\nOn a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone'),
 Document(metadata={}, page_content=' the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.\nGreen Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrate

In [27]:
enc = tiktoken.encoding_for_model("gpt-4o-mini")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

Words: 88 Tokens: 100 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone
Words: 86 Tokens: 100 Chunk:  the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusia

To implement a hard constraint on the chunk size, we can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder`, where each split will be recursively split if it has a larger size and it makes the chunks more meaningful

In [28]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4o-mini",
    chunk_size=100,
    chunk_overlap=30,
)

docs = text_splitter.create_documents([doc])

In [29]:
len(docs)

3

In [30]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'),
 Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.'),
 Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The fes

In [31]:
enc = tiktoken.encoding_for_model("gpt-4o-mini")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

Words: 53 Tokens: 59 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
Words: 53 Tokens: 62 Chunk: On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Words: 63 Tokens: 72 Chunk: Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various 

#### spaCy

[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

LangChain implements splitters based on the [spaCy tokenizer](https://spacy.io/api/tokenizer).

In [32]:
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=500,
                                  chunk_overlap=50)

docs = text_splitter.create_documents([doc])



In [33]:
len(docs)

3

In [35]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains.\n\nWith its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years.\n\nThe main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.\n\n\nOn a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling.'),
 Document(metadata={}, page_content='Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by.\n\nThe aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.\n\n\nGreen Valley is not only known for its scenic beauty but also for its annual festivals.\n\nThe most anticipated event is the Harvest Festival, celebrated with great enthusiasm.\n\nLocals prepare months in advance, cultivating crops and crafting goods for the occasion.'),
 Document(metadata=

In [36]:
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Characters:', len(d.page_content),
        'Chunk:', d.page_content)

Words: 68 Characters: 413 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains.

With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years.

The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.


On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling.
Words: 72 Characters: 470 Chunk: Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by.

The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.


Green Valley is not only known for its scenic beauty but also for its annual festivals.

The most anticipated event is the Harvest Festival, celebrated with great enthusiasm.

Locals prepare months in advance, cultivating crops and crafting goods for the occasion.
Words: 22 Characters: 135 Chunk: The festival fea

#### SentenceTransformers

The [`SentenceTransformersTokenTextSplitter`](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html) is a specialized text splitter for use with the `sentence-transformer` language models.

The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

In [37]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(model_name="sentence-transformers/all-mpnet-base-v2",
                                                 tokens_per_chunk=100,
                                                 chunk_overlap=30)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [38]:
docs = splitter.create_documents([doc])

In [39]:
len(docs)

3

In [40]:
docs

[Document(metadata={}, page_content='welcome to green valley, a small town nestled in the heart of the mountains. with its picturesque landscapes and vibrant community life, green valley has been a hidden gem for years. the main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture. on a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days'),
 Document(metadata={}, page_content='the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. the aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers. green valley is not only known for its scenic beauty but also for its annual festivals. the most anticipated event is the harvest festival, celebrated with g

In [41]:
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Characters:', len(d.page_content),
        'Chunk:', d.page_content)

Words: 88 Characters: 509 Chunk: welcome to green valley, a small town nestled in the heart of the mountains. with its picturesque landscapes and vibrant community life, green valley has been a hidden gem for years. the main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture. on a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days
Words: 84 Characters: 517 Chunk: the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. the aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers. green valley is not only known for its scenic beauty but also for its annual festivals. the most anticipated event is the harvest festival, celebrated with great enthus

### Section-based Splitting in Unstructured.io

Chunking functions in `unstructured` use metadata and document elements detected with partition functions to split a document into smaller parts for uses cases such as Retrieval Augmented Generation (RAG).

`unstructured` uses specific knowledge about each document format to partition the document into semantic units (document elements), we only need to resort to text-splitting when a single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one or more whole elements, preserving the coherence of semantic units established during partitioning.

- Chunking is performed on document elements. It is a separate step performed after partitioning, on the elements produced by partitioning. (Although it can be combined with partitioning in a single step.)

- Chunking combines consecutive elements to form chunks as large as possible without exceeding the maximum chunk size.

- A single element that by itself exceeds the maximum chunk size is divided into two or more chunks using text-splitting.

- Chunking produces a sequence of `CompositeElement`, `Table`, or `TableChunk` elements. Each “chunk” is an instance of one of these three types.

Chunking Options:

The following options are available to tune chunking behaviors. These are keyword arguments that can be used in a partitioning or chunking function call. All these options have defaults and need only be specified when a non-default setting is required. Specific chunking strategies (such as “by-title”) may have additional options.

- `max_characters`: (default=500) - the hard maximum size for a chunk. No chunk will exceed this number of characters. A single element that by itself exceeds this size will be divided into two or more chunks using text-splitting.

- `new_after_n_chars`: (default=max_characters) - the “soft” maximum size for a chunk. A chunk that already exceeds this number of characters will not be extended, even if the next _element_ would fit without exceeding the specified hard maximum. This can be used in conjunction with `max_characters` to set a “preferred” size, like “I prefer chunks of around 1000 characters, but I’d rather have a chunk of 1500 (max_characters) than resort to text-splitting”. This would be specified with `(..., max_characters=1500, new_after_n_chars=1000)`.

- `overlap`: (default=0) - only when using text-splitting to break up an oversized chunk, include this number of characters from the end of the prior chunk as a prefix on the next. This can mitigate the effect of splitting the semantic unit represented by the oversized element at an arbitrary position based on text length.

- `combine_text_under_n_chars argument`: This defaults to the same value as `max_characters` such that sequential small section chunks are combined to maximally fill the chunking window to produce a logically larger chunk


There are currently two chunking strategies, `basic` and `by_title`.

The `basic` strategy combines sequential elements to maximally fill each chunk while respecting both the specified max_characters (hard-max) and new_after_n_chars (soft-max) option values.

The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections.

In [42]:
!wget -O 'layoutparser_paper.pdf' 'http://arxiv.org/pdf/2103.15348.pdf'

--2025-09-03 12:23:26--  http://arxiv.org/pdf/2103.15348.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.3.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://arxiv.org/pdf/2103.15348.pdf [following]
--2025-09-03 12:23:26--  https://arxiv.org/pdf/2103.15348.pdf
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /pdf/2103.15348 [following]
--2025-09-03 12:23:26--  https://arxiv.org/pdf/2103.15348
Reusing existing connection to arxiv.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 4686220 (4.5M) [application/pdf]
Saving to: ‘layoutparser_paper.pdf’


2025-09-03 12:23:26 (106 MB/s) - ‘layoutparser_paper.pdf’ saved [4686220/4686220]



Download nltk packages if not downloaded already

In [43]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [44]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [45]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('./layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000,
                               new_after_n_chars=3800,
                               combine_text_under_n_chars=2000,
                               mode='elements')
data = loader.load()



yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/274 [00:00<?, ?B/s]

The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

In [46]:
len(data)

16

In [47]:
[doc.metadata['category'] for doc in data]

['CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement',
 'CompositeElement']

In [48]:
data[0]

Document(metadata={'source': './layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2023-01-23T09:15:33', 'page_number': 1, 'orig_elements': 'eJzNWW1z1DgS/iu6+QRXI68lyy/KlyNA1cFe9o6C7O3WshQlS+0ZgceekuyEgeK/X0v2hEnCchWqhsqnRLJa3f300y9KXn9aQAsb6Ia31ixOyEKkjRGprqmRWlPBa0UlY4rWTJdMNyAZl4slWWxgUEYNCmU+LXTfO2M7NYCP61bt+nF4uwa7Wg+4w3maosy8fWnNsMZdVsbdbW+7Ici9fi3yRCxJLnlSvlmSeVnwNBFhydI0kbfX03HcWPidH2ATvHhhP0D7aqs0LD7jh8a2MOy2ED+9+GURbelWo1pFg18voFst3sRdP7zd9MY2FiIcPOUZTRnl2XkqT1h+kmVBeouSb7txU4MLjgQdA3wIri5Y+L5X9munEZRV7+xHMOfhBB69CXlWGFFprWhaN0BFLgsqTZnTUqZ1gL1i4niQ5yxJA4ZFkgZMpyWempZV9ZXldPjeAI4Sd8e81qqsM66okJmgwgDQWhQZbQqZZcw0olL5kWl+xeN5KaqEH9L85joevz+o3xnyFOGWshRUAsPKAlzQCrKK5ppJDUrLRsGxId9jOi/LmfV7iG+u4/F7A3l6Z8illAi6kZSLtEKWFymVvM6p0sYoqRjLdXFsyIs8yb9AzkSWJ/IA41sbk8A3QTcwgB5s373VCKt/u3V9jcfSJBdpVvzoRCAdGcnPhBFO3pD/kickIZ5o8hrXF6QigmQkx68J/kzjqZPDOD4DZfDSrwSPFY3hquFU6oJTIQx24kqmtMxzmTFWFrw4dvCY4DJhh9HLi6S6Fr1bG1Hi3uTMxZ1zRukMZJ6WVKQspUKVmiqWSw

In [49]:
print(data[0].page_content)

1

2021

2

0

2 n u J 1 2 ] V C . s c [ 2 v 8 4 3 5 1 . 3 0 1 2 :

v

arXiv

i

X

r

a

LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson’, and Weining Li®

1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and 

In [50]:
data[1]

Document(metadata={'source': './layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2023-01-23T09:15:33', 'page_number': 1, 'orig_elements': 'eJy9Wl1v3LgV/SuEn2LAFCRRn3lLmxYbIG0DrIsCmwYBRVEj1RpJESU7k8X+955Lasaa8SS78WLyZOuKpMh7zz33g/P+1yvd6q3upo9NefWSXWVVmKWFUDxPq5hHURDwPFCKVyoRqcxiP0nk1Q272upJlnKSmPPrler7sWw6OWljn1u56+fpY62bTT1BEoa+jzmL+KEppxrSILXSoW+6iea9fy/SyBM3eBH6XvDhhj0KYuEJEiQi8eKzAjsFkiuzM5Pe0lneNZ91+/Mglb76DS9KPWk1NX33UbXSmI/D2BcY5ntZnoQ5BlRNq6fdoO3cd/+4slvuNrPc2HO9v9Ld5uqDlZrp47Yvm6rRVmuhHwruBzwUt37+MohfCkGzB8z82M3bQo90XtrEpD+TRq4C9qabxr6c7Y5o8P7Lt83U2g2fmqYoVB5WWcpFIBIelXHJpZ/kPJV5HssghDqzC5omdJbIYi93pnGCLEq8jASBEJmXnZW4Sc8zTh4mcXoJ47RNd2dn/nplJjlCy12pP9Ph42RtqIAGz2NLD6qZtFfLsdW70A9ifS/bWVoD/vbh2/Z+rfXA3mo5dk23efH67TUvpNElkwNOKlWtDZOjZlOtGXYzad5XHA8cG2NVPzLJYBvNRhxZs75iZa9mwgZrtvgok51sd6Yx7MXrN6+u2STNnWFNp9oZxt+cjrYqbv47V4UfKLt/GCu4WaPwn3Ic8eZe39IBzqAxqrSSOhE8z8OCR36meCYzzbVfqjAoqrQIosuhMck9gCKMUy+yYFyes9CLLSsEiZefebbjn4fDJEmy+PIkEa

In [51]:
print(data[1].page_content)

1 Introduction

Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classiﬁcation [11,

2 Z. Shen et al.

37], layout detection [38, 22], table detection [26], and scene text detection [4]. A generalized learning-based framework dramatically reduces the need for the manual speciﬁcation of complicated rules, which is the status quo with traditional methods. DL has the potential to transform DIA pipelines and beneﬁt a broad spectrum of large-scale document digitization projects.

However, there are several practical diﬃculties for taking advantages of re- cent advances in DL-based methods: 1) DL models are notoriously convoluted for reuse and extension. Existing models are developed using distinct frame- works like TensorFlow [1] or PyTorch [24], and the high-level parameters can be obfuscated by implementation details [8]. It can be a time-consuming and frustrating experience to debug, reproduce, an

In [52]:
!pip install unstructured



In [53]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('./layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000,
                               new_after_n_chars=3800,
                               combine_text_under_n_chars=2000,
                               mode='elements')
data = loader.load()

