# 文本切割的五種技巧

## 字符切割 (Character Splitting)

字符切割是最基本的文本切割方式。
它的過程是簡單地將文本按N個字符的大小進行劃分，而不考慮內容或形式。

這種方法不建議用於任何應用場景，但它是一個理解基本概念的良好起點。

- 優點：簡單且易於實施
- 缺點：非常僵硬，沒有考慮文本的結構

需要了解的概念：

- 大小（Chunk Size）- 您希望每個塊包含的字符數量，例如50、100、100,000等。
- 重疊（Chunk Overlap）- 您希望相鄰的塊之間重疊的字符數量。這樣可以避免將一個完整的上下文切割成多個片段，儘管這樣會在不同塊中產生重複的數據。

In [7]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

In [8]:
# Create a list that will hold your chunks
chunks = []

chunk_size = 35

for i in range(0, len(text), chunk_size):
    chunk = text[i : i + chunk_size]
    chunks.append(chunk)
chunks

['This is the text I would like to ch',
 'unk up. It is the example text for ',
 'this exercise']

### CharacterSplitter（Langchain）

在語言模型的世界中處理文本時，我們不直接處理原始字符串。通常我們會處理文檔。

文檔是包含您關注的文本的對象，但它還包含其他元數據，使得後續的過濾和操作更為方便。

In [17]:
!pip install langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [16]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)

text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to ch'),
 Document(metadata={}, page_content='unk up. It is the example text for '),
 Document(metadata={}, page_content='this exercise')]

## 遞歸字符文本切割 (Recursive Character Text Splitting)

這是一個多功能的切割器，也是我在快速搭建應用程序時的首選。
如果您不確定應該使用哪種切割方式，這是一個很好的起點。

字符切割的問題在於我們沒有考慮文檔的結構，而只是按照固定的字符數進行切割。

遞歸字符文本切割器可以幫助解決這個問題。
使用它時，我們會指定一系列分隔符來切割文檔。

您可以看到 LangChain 的默認分隔符：
- "\n\n" - 雙換行符，通常表示段落分隔
- "\n" - 單換行符
- " " - 空格
- "" - 單個字符

切割器首先尋找雙換行符（段落分隔）。

當段落被切割後，切割器會檢查塊的大小，如果塊太大，則會按下一個分隔符進行切割。如果塊仍然太大，則繼續使用下一個分隔符，依此類推。

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(metadata={}, page_content="One of the most important things I didn't understand about the"),
 Document(metadata={}, page_content='world when I was a child is the degree to which the returns for'),
 Document(metadata={}, page_content='performance are superlinear.'),
 Document(metadata={}, page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(metadata={}, page_content='"You get out," I heard a thousand times, "what you put in." They'),
 Document(metadata={}, page_content='meant well, but this is rarely true. If your product is only'),
 Document(metadata={}, page_content="half as good as your competitor's, you don't get half as many"),
 Document(metadata={}, page_content='customers. You get no customers, and you go out of business.'),
 Document(metadata={}, page_content="It's obviously true that the returns for performance are"),
 Document(metadata={}, page_content='superlinear in business. Some think this is a flaw of'),
 Document(metadata=