When documents get long, they can have a lot of different information. Therefore to provide the right part of the document , we cut them up in smaller parts. This process is sometimes referred to as *chunking*.

In [1]:
%pip install langchain markdown


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


We read the same document as a file first

In [2]:
history_raw_text = ""
    # This is a long document we can split up.
with open("data/history.md") as f:
    history_raw_text = f.read()

When we use the generic splitter , it splits it per chunks . If useful we can make the chunks overlap too.

In [3]:
# naive , generic chunksize splitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Set a really small chunk size, just to show.
text_splitter = RecursiveCharacterTextSplitter(
     chunk_size=100,
     chunk_overlap=0,
     length_function=len,
     add_start_index=True,
)
texts = text_splitter.create_documents([history_raw_text])

from pprint import pprint
pprint(texts)



[Document(page_content='# A history lesson on Devops\n\n## Devopsdays', metadata={'start_index': 0}),
 Document(page_content='Devopsdays is a worldwide series of technical conferences covering topics of software development,', metadata={'start_index': 45}),
 Document(page_content='IT infrastructure operations, and the intersection between them. Each event is run by volunteers', metadata={'start_index': 144}),
 Document(page_content='from the local area.', metadata={'start_index': 241}),
 Document(page_content='Most devopsdays events feature a combination of curated talks (see open Calls for Proposals) and', metadata={'start_index': 263}),
 Document(page_content='self organized open space content. Topics often include automation, testing, security, and', metadata={'start_index': 360}),
 Document(page_content='organizational culture.', metadata={'start_index': 451}),
 Document(page_content='### History', metadata={'start_index': 476}),
 Document(page_content='The first devopsdays was hel

But we can get smarter by using a content format aware splitter. In this case using Markdown header to do more meaningfull splits.

In [4]:
# Now use a document/content specific splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
texts = md_splitter.split_text(history_raw_text)

pprint(texts)

[Document(page_content='Devopsdays is a worldwide series of technical conferences covering topics of software development, IT infrastructure operations, and the intersection between them. Each event is run by volunteers from the local area.  \nMost devopsdays events feature a combination of curated talks (see open Calls for Proposals) and self organized open space content. Topics often include automation, testing, security, and organizational culture.', metadata={'Header 1': 'A history lesson on Devops', 'Header 2': 'Devopsdays'}),
 Document(page_content='The first devopsdays was held in Ghent, Belgium in 2009. Since then, devopsdays events have multiplied, and if there isn’t one in your city, check out the information about organizing one yourself!', metadata={'Header 1': 'A history lesson on Devops', 'Header 2': 'Devopsdays', 'Header 3': 'History'}),
 Document(page_content='The devopsdays global core team guides local organizers in hosting their own devopsdays events worldwide. Activ