 # Text Splitting Methods

- Author: [Ilgyun Jeong](https://github.com/johnny9210)
- Peer Review : [JoonHo Kim](https://github.com/jhboyo), [Sunyoung Park (architectyou)](https://github.com/Architectyou)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb)

## Overview
Text splitting is a crucial preprocessing step in Natural Language Processing (NLP). This tutorial covers various text splitting methods and tools, exploring their advantages, disadvantages, and appropriate use cases.

Main approaches to text splitting:

1. **Token-based Splitting**
   - Tiktoken: OpenAI's high-performance BPE tokenizer
   - Hugging Face tokenizers: Tokenizers for various pre-trained models
   
2. **Sentence-based Splitting**
   - SentenceTransformers: Splits text while maintaining semantic coherence
   - NLTK: Natural language processing based sentence and word splitting
   - spaCy: Text splitting utilizing advanced language processing capabilities

3. **Language-specific Tools**
   - KoNLPy: Specialized splitting tool for Korean text processing

Each tool has its unique characteristics and advantages:
- Tiktoken offers fast processing speed and compatibility with OpenAI models
- SentenceTransformers provides meaning-based sentence splitting
- NLTK and spaCy implement linguistic rule-based splitting
- KoNLPy specializes in Korean morphological analysis and splitting

Through this tutorial, you will understand the characteristics of each tool and learn to choose the most suitable text splitting method for your project.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Example Usage of Tiktoken](#example-usage-of-tiktoken)
- [Example Usage of TokenTextSplitter](#example-usage-of-tokentextsplitter)
- [Example Usage of SentenceTransformers](#example-usage-of-sentencetransformers)
- [Example Usage of NLTK](#example-usage-of-nltk)
- [Example Usage of spaCy](#example-usage-of-spacy)
- [Example Usage of KoNLPy](#example-usage-of-konlpy)
- [Example Usage of Hugging Face tokenizers](#example-usage-of-hugging-face-tokenizers)

### References
- [Langchain: Tiktoken](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)
- [Langchain: TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html)
- [Langchain: SentenceTransformers](https://python.langchain.com/api_reference/text_splitters/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html)
- [Langchain: NLTK](https://python.langchain.com/api_reference/text_splitters/nltk/langchain_text_splitters.nltk.NLTKTextSplitter.html)
- [LangChain: spaCy](https://python.langchain.com/api_reference/text_splitters/spacy/langchain_text_splitters.spacy.SpacyTextSplitter.html)
- [LangChain: KoNLPy](https://python.langchain.com/api_reference/text_splitters/konlpy/langchain_text_splitters.konlpy.KonlpyTextSplitter.html)
- [LangChain: Hugging Face tokenizers](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)
- [LangChain: How to split text by tokens](https://python.langchain.com/docs/how_to/split_by_token/)

----

## Environment Setup

Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.

**[Note]**
- The `langchain-opentutorial` is a bundle of easy-to-use environment setup guidance, useful functions and utilities for tutorials. 
- Check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_text_splitters",
        "tiktoken",
        "spacy",
        "sentence-transformers",
        "nltk",
        "konlpy",
    ],
    verbose=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "TokenTextSplitter",
    }
)

Environment variables have been set successfully.


Alternatively, you can set and load `OPENAI_API_KEY` from a `.env` file. 

**[Note]** This is only necessary if you haven't already set `OPENAI_API_KEY` in previous steps.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Using Tiktoken for Text Splitting

`tiktoken` is a fast BPE (Byte Pair Encoding) tokenizer developed by OpenAI. Here's an example demonstrating its use with a text splitter:

1. Open the text file `appendix-keywords.txt` and read its contents. Store this text in a variable named `file`.

In [5]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

2. Display some of the content read from the `file`.

In [6]:
# Print a portion of the content read from the file.
print(file[:500])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders


Using `CharacterTextSplitter` with `tiktoken`:

1. Initialize a text splitter using the `from_tiktoken_encoder` method. This method leverages the `tiktoken` encoder for measurement and merging.

   * `chunk_size=300`: This parameter determines the maximum size of each text chunk in tokens. The splitter will attempt to create chunks that contain no more than 300 tokens each. This number can be adjusted based on your specific needs and model token limits. This parameter will be used consistently throughout the tutorial with the same meaning.

   * `chunk_overlap=50`: This defines how many tokens should overlap between consecutive chunks. Overlap helps maintain context and coherence between chunks, especially useful when the text contains concepts that span chunk boundaries. A value of 50 means each chunk will share its last 50 tokens with the beginning of the next chunk. This parameter will appear multiple times in this tutorial, always serving the same purpose.

Note: Throughout this tutorial, whenever you see `chunk_size` and `chunk_overlap` parameters in different text splitters, they follow the same principles as described above, though the exact implementation might vary depending on the specific splitter being used.

This combination of character-based splitting with tiktoken-based measurement provides a good balance between processing speed and accurate chunk sizing, particularly when working with models that use tiktoken for tokenization (like OpenAI's models).

In [7]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    # Set the chunk size to 300.
    chunk_size=300,
    # Set the overlap between chunks to 50.
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

2. Print the number of resulting text chunks after splitting.

In [8]:
print(len(texts))  # Output the number of divided chunks.

10


3. Print the first element of the `texts` list, which holds the split chunks.

In [9]:
# Print the first element of the texts list.
print(texts[0])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].
Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.
Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.
Associated keywords: tokenization, natural language processing, parsing

Tokenizer


**Note**
- When using `CharacterTextSplitter.from_tiktoken_encoder`, the text is split primarily by the `CharacterTextSplitter`. The `tiktoken` tokenizer is used for measuring and merging the divided text. This might lead to chunks exceeding the token size intended for the language model.
- Consider `RecursiveCharacterTextSplitter.from_tiktoken_encoder` or directly loading the `tiktoken` splitter, for stricter control and ensuring each split adheres to the language model's token limit. If a split text exceeds this size, it is recursively divided.

## Example Usage of TokenTextSplitter
This section will cover using the `TokenTextSplitter` class to split text into chunks based on tokens.

In [10]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=300,  # Set the chunk size to 300.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

# Split the state_of_the_union text into chunks.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first chunk of the divided text.

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.
Example: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].
Related keywords: natural language processing, vectorization, deep learning

Token

Definition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.
Example: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.
Associated keywords: tokenization, natural language processing, parsing

Tokenizer

Definition: A tokenizer 

## Example Usage of SentenceTransformers

`SentenceTransformersTokenTextSplitter` is a specialized splitter designed for `sentence-transformer` models. It automatically splits text into chunks that fit within the token window of the sentence-transformer model being used.

1. Initialize a text splitter using the `SentenceTransformersTokenTextSplitter` class.

In [11]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Create a sentence splitter and set the overlap between chunks to 50.
splitter = SentenceTransformersTokenTextSplitter(chunk_size=300, chunk_overlap=50)

2. Inspect the sample text.

In [12]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


3. Calculate the number of tokens (excluding start and stop tokens) in the `file` variable, and print the result.

In [13]:
count_start_and_stop_tokens = 2  # Set the number of start and stop tokens to 2.

# Subtract the count of start and stop tokens from the total number of tokens in the text.
text_token_count = splitter.count_tokens(text=file) - count_start_and_stop_tokens
print(text_token_count)  # Print the calculated number of tokens in the text.

2231


4. Utilize the `splitter.split_text()` function to split the text stored in `text_to_split` into chunks.

In [14]:
text_chunks = splitter.split_text(text=file)  # Split the text into chunks.

5. Print the first chunk using `print(text_chunks[1])`.

In [15]:
# Print the 0th chunk.
print(text_chunks[1])  # Print the second chunk from the divided text chunks.

a database for quick access. related keywords : embedding, database, vectorization, vectorization sql definition : sql ( structured query language ) is a programming language for managing data in a database. you can query, modify, insert, delete, and more data. example : select * from users where age > 18 ; looks up information about users who are 18 years old or older. associated keywords : database, query, data management, data management csv definition : csv ( comma - separated values ) is a file format for storing data, where each data value is separated by a comma. it is used for simple storage and exchange of tabular data. example : a csv file with the headers name, age, and occupation might contain data such as hong gil - dong, 30, developer. related keywords : data format, file processing, data exchange json definition : json ( javascript object notation ) is a lightweight data interchange format that represents data objects using text that is readable to both humans and machin

## Example Usage of NLTK

The Natural Language Toolkit (NLTK) is a Python library for natural language processing (NLP) tasks. It supports various NLP tasks like text preprocessing, tokenization, morphological analysis, and part-of-speech tagging.

Here's how to use NLTK tokenizers for text splitting, offering an alternative to splitting by newlines (`\n\n`).
- Splitting method: NLTK tokenizer
- The chunk size is determined by the number of characters.

1. Before using NLTK, you need to download the necessary data files.

In [16]:
import nltk

nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ilgyun/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Downloading `punkt_tab` enables NLTK to tokenize text into words or sentences for multiple languages, including English.

2. Repeate the process of opening `appendix-keywords.txt`, reading its contents, and storing the text in the `file` variable.

In [17]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


3. Create a text splitter using the `NLTKTextSplitter` class.
4. Set the `chunk_size` parameter to 300 (or any desired value) to control the mazimum chunk size in characters.

In [18]:
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(
    chunk_size=300,  # Set the chunk size to 300.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

5. Utilize the `split_text` method of the `text_splitter` object to split the text stored in `file`.

In [19]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Semantic Search

Definition: A vector store is a system that stores data converted to vector format.

It is used for search, classification, and other data analysis tasks.

Example: Vectors of word embeddings can be stored in a database for quick access.


## Example Usage of spaCy

spaCy is an open-source library for advanced NLP, written in Python and Cython.

Like NLTK, spaCy also provides an alternative to basic newline splitting (`\n\n`).
- Splitting method: spaCy's tokenizer
- The chunk size is measured by the number of characters.

1. To split text with spaCy, you need to download the `en_core_web_sm` spaCy model for English.

In [20]:
!python -m spacy download en_core_web_sm --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


2. Repeate the process of opening `appendix-keywords.txt`, reading its contents, and storing the text in the `file` variable.

In [21]:
# Open the file data/appendix-keywords.txt and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

# Print a portion of the content read from the file.


print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


3. Create a text splitter using the `SpacyTextSplitter` class.
4. Set the `chunk_size` parameter to 300 (or any desired value) to control the mazimum chunk size in characters.

In [22]:
import warnings
from langchain_text_splitters import SpacyTextSplitter

# Ignore  warning messages.
warnings.filterwarnings("ignore")

# Create the SpacyTextSplitter.
text_splitter = SpacyTextSplitter(
    chunk_size=300,  # Set the chunk size to 300.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

5. Use the `split_text` method of the `text_splitter` object to split the `file` text.

In [23]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

Semantic Search

Definition: A vector store is a system that stores data converted to vector format.

It is used for search, classification, and other data analysis tasks.


Example: Vectors of word embeddings can be stored in a database for quick access.


## Example Usage of KoNLPy

As mentioned in the [LangChain's How-to guides](https://python.langchain.com/docs/how_to/split_by_token/#konlpy), KoNLPy offers a dedicated text splitter for Korean text processing with useful features for morphological analysis, part-of-speech tagging, and syntactic parsing.

1. Since it is an example of processing Korean language, we need to log Korean text to split.

In [24]:
# Open the data/appendix-keywords-korean.txt file and create a file object named f.
with open("./data/appendix-keywords-korean.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.
예시: 사용자가 "태양계 행성"이라고 검색하면, "목성", "화성" 등과 같이 관련된 행성에 대한 정보를 반환합니다.
연관키워드: 자연어 처리, 검색 알고리즘, 데이터 마이닝

Embedding

정의: 임베딩은 단어나 문장 같은 텍스트 데이터를 저차원의 연속적인 벡터로 변환하는 과정입니다. 이를 통해 컴퓨터가 텍스트를 이해하고 처리할 수 있게 합니다.
예시: "사과"라는 단어를 [0.65, -0.23, 0.17]과 같은 벡터로 표현합니다.
연관키워드: 자연어 처


2. Create a text splitter using the `KonlpyTextSplitter` class.

In [25]:
from langchain_text_splitters import KonlpyTextSplitter

# Create a text splitter object using KonlpyTextSplitter.
text_splitter = KonlpyTextSplitter()

3. Use the `text_splitter` to split `the file` content into sentences.

In [26]:
texts = text_splitter.split_text(file)  # Split the file content into sentences.
print(texts[0][:350])  # Print the first sentence from the divided text.

Semantic Search 정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.

예시: 사용자가 " 태양계 행성" 이라고 검색하면, " 목성", " 화 성" 등과 같이 관련된 행성에 대한 정보를 반환합니다.

연관 키워드: 자연어 처리, 검색 알고리즘, 데이터 마이닝 Embedding 정의: 임베딩은 단어나 문장 같은 텍스트 데이터를 저차원의 연속적인 벡터로 변환하는 과정입니다.

이를 통해 컴퓨터가 텍스트를 이해하고 처리할 수 있게 합니다.

예시: " 사과" 라는 단어를 [0.65, -0.23, 0.17] 과 같은 벡터로 표현합니다.

연


## Example Usage of Hugging Face tokenizers

Hugging Face provides various tokenizers.

This tutorial demonstrates calculating the token length of a text using one of Hugging Face's tokenizers, `GPT2TokenizerFast`.
- Splitting method: Hugging Face's `GPT2TokenizerFast`
- The chunk size is determined by the number of characters

**[Note]**
- The chunk size is based on the number of tokens calculated by the Hugging Face tokenizer.
- A `tokenizer` object is created using the `GPT2TokenizerFast` class.

Call `from_pretrained` method to load the pre-trained `gpt2` tokenizer model.

1. Call `from_pretrained` method to load the pre-trained `gpt2` tokenizer model.

In [27]:
from transformers import GPT2TokenizerFast

# Load the GPT-2 tokenizer.
hf_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

2. Repeate the process of opening `appendix-keywords.txt`, reading its contents, and storing the text in the `file` variable.

In [28]:
# Open the data/appendix-keywords.txt file and create a file object named f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

Semantic Search

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

Embedding

Definition: Embed


3. Create a text splitter using `from_huggingface_tokenizer` method.

In [29]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    # Use the Hugging Face tokenizer to create a CharacterTextSplitter object.
    hf_tokenizer,
    chunk_size=300,
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

4. Check the split result of the first element.

In [30]:
print(texts[1])  # Print the first element of the texts list.

Tokenizer

Definition: A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural language processing.
Example: Split the sentence “I love programming.” into [“I”, “love”, “programming”, “.”].
Associated keywords: tokenization, natural language processing, parsing

VectorStore

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

SQL

Definition: SQL(Structured Query Language) is a programming language for managing data in a database. You can query, modify, insert, delete, and more data.
Example: SELECT * FROM users WHERE age > 18; looks up information about users who are 18 years old or older.
Associated keywords: database, query, data management, data management

CSV
