# 文档转换器

# 文档切割
### 原理
1. 将文档分成小的、有意义的块（句子）。
2. 将小的块组合成一个更大的块，直到达到一定的大小。
3. 一旦达到一定的大小，接着开始创建与下一个块重叠的部分

### 示例
- 第一个文档分割
- 按字符切割
- 代码文档切割
- 按token来切割

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

#加载要切割的文档
with open("test.txt", "r") as f:
    zhhx = f.read()

#使用递归字符切分器
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, #切割的块大小 一般通过长度函数计算
    chunk_overlap=50, #切分的文本块重叠大小，
    length_function=len, #长度函数，也可以传递tokenize函数
    add_start_index=True, #是否添加开始索引
)
text = text_splitter.create_documents(zhhx)
text

In [11]:
from langchain.text_splitter import CharacterTextSplitter

#加载要切割的文档
with open("test.txt", "r") as f:
    zhhx = f.read()

#使用递归字符切分器
text_splitter = CharacterTextSplitter(
    separator="。", #切割标识符 默认是换行
    chunk_size=50, #切割的块大小 一般通过长度函数计算
    chunk_overlap=20, #切分的文本块重叠大小，
    length_function=len, #长度函数，也可以传递tokenize函数
    add_start_index=True, #是否添加开始索引
    is_separator_regex=False, #是否是正则表达式
)
text = text_splitter.create_documents([zhhx])
text

Created a chunk of size 125, which is longer than the specified 50
Created a chunk of size 72, which is longer than the specified 50
Created a chunk of size 72, which is longer than the specified 50
Created a chunk of size 63, which is longer than the specified 50
Created a chunk of size 52, which is longer than the specified 50
Created a chunk of size 96, which is longer than the specified 50
Created a chunk of size 51, which is longer than the specified 50
Created a chunk of size 66, which is longer than the specified 50
Created a chunk of size 105, which is longer than the specified 50
Created a chunk of size 84, which is longer than the specified 50
Created a chunk of size 78, which is longer than the specified 50
Created a chunk of size 72, which is longer than the specified 50
Created a chunk of size 66, which is longer than the specified 50
Created a chunk of size 92, which is longer than the specified 50
Created a chunk of size 58, which is longer than the specified 50
Created 

[Document(page_content='蒂法介绍\n蒂法·洛克哈特(日语:ティファ・ロックハート，Tifa Rokkuhāto，英语:Tifa Lockhart)为电子游戏《最终幻想VII》及《最终幻想VII补完计划》相关作品中的虚构⻆ 色，由\U0010fc00村哲也创作和设计，此后也在多个游戏中客串登场', metadata={'start_index': 1}),
 Document(page_content='2014年东京电玩展上，星名美津纪cosplay《最终幻想VII 降临之子》中的蒂法·洛克哈特 蒂法是克劳德的⻘梅竹⻢，两人同为尼布鲁海姆出身', metadata={'start_index': 127}),
 Document(page_content='在米德加经营作为反抗组织“雪崩”根 据地的酒馆“第七天堂”，并且是小有名气的招牌店员', metadata={'start_index': 199}),
 Document(page_content='擅⻓格斗，以拳套为武器。本传7年前 克劳德离开故乡从军时，曾许下约定“如果有危机时一定会保护她”', metadata={'start_index': 242}),
 Document(page_content='与爱丽丝相识之后，两 人成为好友', metadata={'start_index': 291}),
 Document(page_content='第一个察觉克劳德记忆混乱的人，后来协助精神崩溃的克劳德\U0010fc01新找回真正的自 己', metadata={'start_index': 308}),
 Document(page_content='本传的大战结束后，依大家的期待在战后新生的米德加再次开设第七天堂(原第七天堂因 第柒区圆盘崩塌遭压毁)，同时也照顾一群受到星痕症候群折磨的孩子们', metadata={'start_index': 346}),
 Document(page_content='蒂法被《纽约时报》称为“网络一代”的海报女郎，与劳拉·克罗夫特相比，她是电子游戏中坚 强、\U0010fc02立和有吸引力的女性⻆色的典型代表', metadata={'start_index': 420}),
 Document(page_content='媒体普遍称赞

#### 代码文档切割

In [21]:
from langchain.text_splitter import  (
    PythonCodeTextSplitter,
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    Language
)

#支持解析的编程语言有
# [e.value for e in Language]

#pythons示例
PYTHON_CODE = """
    def add(a, b):
        return a + b
    
    # 相减
    def subtract(a, b):
        return a - b
"""
py_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=50, #切割的块大小 一般通过长度函数计算
    chunk_overlap=10, #切分的文本块重叠大小，
)
py = text_splitter.create_documents([PYTHON_CODE])
py

[Document(page_content='def add(a, b):\n        return a + b\n    \n    # 相减\n    def subtract(a, b):\n        return a - b', metadata={'start_index': 5})]

#### 按token来切割



In [23]:

from langchain.text_splitter import CharacterTextSplitter

#加载要切割的文档

with open("test.txt", "r") as f:
    zhhx = f.read()

py_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=4000, #切割的块大小 一般通过长度函数计算
    chunk_overlap=30, #切分的文本块重叠大小，
)
text = text_splitter.create_documents([zhhx])
text


Created a chunk of size 125, which is longer than the specified 50
Created a chunk of size 72, which is longer than the specified 50
Created a chunk of size 72, which is longer than the specified 50
Created a chunk of size 63, which is longer than the specified 50
Created a chunk of size 52, which is longer than the specified 50
Created a chunk of size 96, which is longer than the specified 50
Created a chunk of size 51, which is longer than the specified 50
Created a chunk of size 66, which is longer than the specified 50
Created a chunk of size 105, which is longer than the specified 50
Created a chunk of size 84, which is longer than the specified 50
Created a chunk of size 78, which is longer than the specified 50
Created a chunk of size 72, which is longer than the specified 50
Created a chunk of size 66, which is longer than the specified 50
Created a chunk of size 92, which is longer than the specified 50
Created a chunk of size 58, which is longer than the specified 50
Created 

[Document(page_content='蒂法介绍\n蒂法·洛克哈特(日语:ティファ・ロックハート，Tifa Rokkuhāto，英语:Tifa Lockhart)为电子游戏《最终幻想VII》及《最终幻想VII补完计划》相关作品中的虚构⻆ 色，由\U0010fc00村哲也创作和设计，此后也在多个游戏中客串登场', metadata={'start_index': 1}),
 Document(page_content='2014年东京电玩展上，星名美津纪cosplay《最终幻想VII 降临之子》中的蒂法·洛克哈特 蒂法是克劳德的⻘梅竹⻢，两人同为尼布鲁海姆出身', metadata={'start_index': 127}),
 Document(page_content='在米德加经营作为反抗组织“雪崩”根 据地的酒馆“第七天堂”，并且是小有名气的招牌店员', metadata={'start_index': 199}),
 Document(page_content='擅⻓格斗，以拳套为武器。本传7年前 克劳德离开故乡从军时，曾许下约定“如果有危机时一定会保护她”', metadata={'start_index': 242}),
 Document(page_content='与爱丽丝相识之后，两 人成为好友', metadata={'start_index': 291}),
 Document(page_content='第一个察觉克劳德记忆混乱的人，后来协助精神崩溃的克劳德\U0010fc01新找回真正的自 己', metadata={'start_index': 308}),
 Document(page_content='本传的大战结束后，依大家的期待在战后新生的米德加再次开设第七天堂(原第七天堂因 第柒区圆盘崩塌遭压毁)，同时也照顾一群受到星痕症候群折磨的孩子们', metadata={'start_index': 346}),
 Document(page_content='蒂法被《纽约时报》称为“网络一代”的海报女郎，与劳拉·克罗夫特相比，她是电子游戏中坚 强、\U0010fc02立和有吸引力的女性⻆色的典型代表', metadata={'start_index': 420}),
 Document(page_content='媒体普遍称赞