# Document Splitting

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_KEY']

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

## Text splitters

| Name | Splits On | Description |
| --- | --- | --- |
| Recursive | A list of user defined characters | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text |
| Character | A user defined character | Splits text based on a user defined character. One of the simpler methods. |


**Recursive**: 

Important paramenters
- chunkSize controls the max size (in terms of number of characters) of the final documents.
- chunkOverlap specifies how much overlap there should be between chunks. 




In [3]:
chunk_size = 26
chunk_overlap = 4

In [4]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [5]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [6]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [7]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [8]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [9]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [10]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [11]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [12]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Try your own examples!

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [13]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [14]:
len(some_text)

496

In [15]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [16]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [17]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Let's reduce the chunk size a bit and add a period to our separators:

In [18]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [19]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [20]:
from langchain.document_loaders import PyPDFLoader

In [21]:
base_path = "./data/"
filename = "ed3book_jan122022.pdf"

In [22]:
loader = PyPDFLoader(os.path.join(base_path, filename))
pages = loader.load()

In [23]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [24]:
pdf_docs = text_splitter.split_documents(pages)

Created a chunk of size 1134, which is longer than the specified 1000
Created a chunk of size 2472, which is longer than the specified 1000
Created a chunk of size 4122, which is longer than the specified 1000
Created a chunk of size 1942, which is longer than the specified 1000
Created a chunk of size 3861, which is longer than the specified 1000
Created a chunk of size 2576, which is longer than the specified 1000


In [25]:
len(pdf_docs)

2493

In [26]:
len(pages)

653

In [27]:
pdf_docs[100].page_content

'the way words are built up from smaller meaning-bearing units called morphemes . morpheme\nTwo broad classes of morphemes can be distinguished: stems —the central mor- stem\npheme of the word, supplying the main meaning—and afﬁxes —adding “additional” afﬁx\nmeanings of various kinds. So, for example, the word foxconsists of one morpheme\n(the morpheme fox) and the word cats consists of two: the morpheme catand the\nmorpheme -s. A morphological parser takes a word like cats and parses it into the\ntwo morphemes catands, or parses a Spanish word like amaren (‘if in the future\nthey would love’) into the morpheme amar ‘to love’, and the morphological features\n3PL andfuture subjunctive .\nThe Porter Stemmer\nLemmatization algorithms can be complex. For this reason we sometimes make use\nof a simpler but cruder method, which mainly consists of chopping off word-ﬁnal\nafﬁxes. This naive version of morphological analysis is called stemming . One of stemming'

In [28]:
from langchain.document_loaders import NotionDirectoryLoader

In [29]:
loader = NotionDirectoryLoader(f"{base_path}/")
notion_db = loader.load()

In [30]:
docs = text_splitter.split_documents(notion_db)

In [31]:
len(notion_db)

1

In [32]:
len(docs)

2

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [33]:
from langchain.text_splitter import TokenTextSplitter

In [34]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [35]:
text1 = "foo bar bazzyfoo"

In [36]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [37]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [38]:
docs = text_splitter.split_documents(pages)

In [39]:
docs[0]

Document(page_content='Speech and Language Processing\nAn Introduction to Natural', metadata={'source': './data/ed3book_jan122022.pdf', 'page': 0})

In [40]:
pages[0].metadata

{'source': './data/ed3book_jan122022.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [41]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [42]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [43]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [44]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [48]:
for headers in md_header_splits:
    print(headers)

page_content='Hi this is Jim  \nHi this is Joe' metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}
page_content='Hi this is Lance' metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}
page_content='Hi this is Molly' metadata={'Header 1': 'Title', 'Header 2': 'Chapter 2'}


In [46]:
md_header_splits[1]

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

Try on a real Markdown file, like a Notion database.

In [49]:
loader = NotionDirectoryLoader(f"{base_path}/")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [50]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [51]:
md_header_splits = markdown_splitter.split_text(txt)

In [52]:
md_header_splits[0]

Document(page_content='Tags: Dinner\nLink: https://www.forkinthekitchen.com/vegetable-lo-mein-for-two/', metadata={'Header 1': 'Vegetable Lo Mein for Two'})

In [53]:
for headers in md_header_splits:
    print(headers)

page_content='Tags: Dinner\nLink: https://www.forkinthekitchen.com/vegetable-lo-mein-for-two/' metadata={'Header 1': 'Vegetable Lo Mein for Two'}
page_content='- 4 ounces\xa0[Chinese noodles](https://www.amazon.com/gp/product/B0052P1AS4/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=B0052P1AS4&linkCode=as2&tag=forinthekit-20&linkId=10c383c442cc85ec6122caa764a31d1e)\xa0(1/2 package)\n- 1 Tablespoon oyster sauce\n- 1/2 teaspoon sesame oil\n- 1 Tablespoon dark soy sauce\n- 1 Tablespoon light (regular) soy sauce\n- 1 1/2 Tablespoons vegetable oil\n- 1/2 yellow onion, thinly sliced\n- 2 garlic cloves, thinly sliced\n- 1 cup shredded carrots (2 medium)\n- 1 bell pepper, thinly sliced\n- 3-4 ounces snow peas' metadata={'Header 1': 'Vegetable Lo Mein for Two', 'Header 2': 'Ingredients'}
page_content='1. Boil a large pot of water; cook Chinese noodles for 1-2 minutes, stirring to unfold. Drain, rinse, and set aside.\n2. In a small bowl, whisk together oyster sauce, sesame oil, and so