# Document Splitting

문서를 더 작은 덩어리로 분할

* 의미 있는 관계 유지

<img src="fig1.png" width=500>

...
on this model. The Toyota Camry has a head-snapping 80 HP and an eight-speed automatic transmission that will
...

**Chunk 1** : on this model, The Toyota Camry has a head-snapping

**Chunk 2** : 80 HP and an eight-speed automatic transmission that will

**Question**: What are the specifications on the Camry?

=> 단순한 분할로 얻어진 덩어리들로부터는 어떤 답도 얻지 못할 수도 있음. 

### Example Splitter

```
langchain.text_splitter.CharacterTextSplitter(
    separator: str = "\n\n"
    chunk_size=4000
    chunk_overlap=200,
    length_function=<builtin function len>,
)
Methods:
create_documents() - Create documents from a list of texts.
split_documents() - Split documents
```
<img src="fig2.png" width=500>

### Types of Splitters

**langchain.text_splitter.**

* **CharacterTextSplitter()** - 문자를 보는 텍스트 분할 구현입니다.
* **MarkdownHeaderTextSplitter()** - 지정된 헤더를 기반으로 markdown 파일 분할을 구현합니다.
* **TokenTextSplitter()** - 토큰을 살펴보는 텍스트 분할 구현.
* SentenceTransformersTokenTextSplitter() - 토큰을 살펴보는 텍스트 분할 구현입니다.
* **RecursiveCharacterTextSplitter()** - 문자를 살펴보는 텍스트 분할 구현입니다. 작동하는 문자를 찾기 위해 반복적으로 다른 문자로 분할을 시도합니다.
* Language() - CPP, Python, Ruby, Markdown 등의 경우:
* NLTKTextSplitter() - NLTK(Natural Language Tool Kit)를 사용하여 문장을 보는 텍스트 분할 구현
* SpacyTextSplitter() - Spacy를 사용하여 문장을 보는 텍스트 분할 구현


In [59]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [61]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [65]:
chunk_size = 26
chunk_overlap = 4

In [69]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

이것이 아래 문자열을 분할하지 않는 이유는 무엇입니까?

In [72]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [74]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [76]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [78]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

자, 이렇게 하면 문자열이 분할되지만 중첩이 5로 지정되었지만 3처럼 보입니까? (짝수로 시도해 봅시다.)

In [82]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [84]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [86]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [88]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

## Recursive splitting details

`RecursiveCharacterTextSplitter` 가 일반 텍스트에 권장됩니다.

In [91]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [93]:
len(some_text)

496

In [95]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [99]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [101]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Let's reduce the chunk size a bit and add a period to our separators:

In [103]:
import warnings
warnings.filterwarnings("ignore")

In [107]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""],
    is_separator_regex=True
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [109]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""],
    is_separator_regex=True
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [111]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [115]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [119]:
docs = text_splitter.split_documents(pages)

In [121]:
len(docs)

78

In [123]:
len(pages)

22

In [125]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [129]:
docs = text_splitter.split_documents(notion_db)

In [131]:
len(notion_db)

48

In [133]:
len(docs)

341


## Token splitting

원한다면 토큰(token) 수를 명시적으로 나눌 수도 있습니다.

이는 LLM이 종종 토큰에 지정된 컨텍스트 창을 가지고 있기 때문에 유용할 수 있습니다.

토큰 은 ~4자ers.

In [135]:
from langchain.text_splitter import TokenTextSplitter

In [137]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [139]:
text1 = "foo bar bazzyfoo"

In [141]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [143]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [145]:
docs = text_splitter.split_documents(pages)

In [147]:
docs[0]

Document(metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}, page_content='MachineLearning-Lecture01  \n')

In [149]:
pages[0].metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

## Context aware splitting

Chunking은 텍스트를 공통 컨텍스트와 함께 유지하는 것을 목표로 합니다.

텍스트 분할은 종종 문장이나 기타 구분 기호를 사용하여 관련 텍스트를 함께 유지하지만 많은 문서(예: Markdown)에는 분할에 명시적으로 사용할 수 있는 구조(헤더)가 있습니다.

아래와 같이 `MarkdownHeaderTextSplitter`를 사용하여 청크(Chunk)의 헤더 메타데이터를 보존할 수 있습니다.

In [153]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [174]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [176]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [178]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [180]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

In [182]:
md_header_splits[1]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}, page_content='Hi this is Lance')

Notion 데이터베이스와 같은 실제 Markdown 파일을 사용해 보십시오.

In [186]:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [190]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [194]:
md_header_splits = markdown_splitter.split_text(txt)

In [196]:
md_header_splits[0]

Document(metadata={'Header 1': "Blendle's Employee Handbook"}, page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've 