# Document Splitting

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [3]:
chunk_size =26
chunk_overlap = 4

In [4]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [7]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [8]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [9]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [10]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [11]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [13]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [18]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Try your own examples!

In [35]:
text4 = 'NOSSA QUE LEGAL ISSO AQUI. ESTOU TENTANDO MAIS EXEMPLOS PRA ENTENDER SE FAZ SENTIDO USAR ESSE SPLITTER'

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = '.'
)

for x in c_splitter.split_text(text4):
    print("split:",  x, "- Tamanho:",len(x))

split: NOSSA QUE LEGAL ISSO AQUI - Tamanho: 25
split: ESTOU TENTANDO MAIS EXEMPLOS PRA ENTENDER SE FAZ SENTIDO USAR ESSE SPLITTER - Tamanho: 75


In [27]:
for x in r_splitter.split_text(text4):
    print("split:",  x, "- Tamanho:",len(x))

split: NOSSA QUE LEGAL ISSO AQUI. - Tamanho: 26
split: ESTOU TENTANDO MAIS - Tamanho: 19
split: EXEMPLOS PRA ENTENDER SE - Tamanho: 24
split: SE FAZ SENTIDO USAR ESSE - Tamanho: 24
split: SPLITTER - Tamanho: 8


## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [36]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [37]:
len(some_text)

496

In [48]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [49]:
for x in c_splitter.split_text(some_text):
    print("split:",  x, "- Tamanho:",len(x))

split: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. 

 Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, - Tamanho: 448
split: have a space.and words are separated by space. - Tamanho: 46


`c_splitter`: there is no sense the two chunks made by splitter.

In [50]:
for x in r_splitter.split_text(some_text):
    print("split:",  x, "- Tamanho:",len(x))

split: When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. - Tamanho: 248
split: Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space. - Tamanho: 243


`r_splitter`: there is more sense the two chunks made by splitter.