# Examples of chunking text

The topic we're discussing is about using a "text splitter" in programming, specifically to break down a large piece of text into smaller, manageable chunks. This is particularly useful when dealing with long documents, like Dr. Martin Luther King Jr.'s "I Have a Dream" speech, which can be overwhelming to process all at once.

Imagine you have a big cake, and you want to share it with your friends. Instead of giving them the whole cake at once, you cut it into slices. Each slice is easier to handle and enjoy. Similarly, a text splitter takes a long text and divides it into smaller sections, or "chunks," making it easier to analyze or work with. In our example, we set the size of each chunk to 100 characters, with a little overlap of 20 characters to ensure we don’t miss any important context between the chunks.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

# Step 1: Load the input text from the specified file path
file_path = "C:\\Python\\Agent-School\\docs\\i-have-a-dream.txt"

with open(file_path, encoding="utf-8") as file:
    speech = file.read()

# Step 2: Split the text into manageable chunks
text_splitter = CharacterTextSplitter(
    chunk_size=100,     # Max characters per chunk
    chunk_overlap=20,   # Overlap between chunks
    length_function=len
)

documents = text_splitter.create_documents([speech])
print("First document chunk:\n", documents[0], "\n")
# print(documents[0])


This code demonstrates how to process text by splitting it into smaller chunks using the `RecursiveCharacterTextSplitter` from the LangChain library. Here's a summary:

1. **Load a Speech Text File**: 
   - It reads the full text of Martin Luther King Jr.'s "I Have a Dream" speech from a file located on your computer.

2. **Set Up the Text Splitter**:
   - A `RecursiveCharacterTextSplitter` object is created with specific parameters:
     - **`chunk_size`**: The maximum number of characters per chunk is set to 40.
     - **`chunk_overlap`**: Chunks overlap by 12 characters to maintain context between them.
     - **`length_function`**: Measures the size of chunks based on character count.
     - **`add_start_index`**: Includes the starting index of each chunk for reference.

3. **Create Document Objects**:
   - The speech text is divided into chunks (following the splitter's configuration), and these chunks are stored as LangChain Document objects.

4. **Optional Debug Output**:
   - It prints the number of chunks created and previews the first two chunks for verification.

5. **Additional Text Splitting Example**:
   - A standalone string ("Python can be easy to pick up...") is split using the same splitter to illustrate its usage on general text.


The last part illustrates nicel what's going on when you are chunking, as it splits up the sentence into a few words

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Step 1: Load the raw speech text from the specified file path
file_path = "C:\\Python\\Agent-School\\docs\\i-have-a-dream.txt"
with open(file_path, encoding="utf-8") as paper:
    speech = paper.read()

# Step 2: Split the text using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=40,         # Max characters per chunk
    chunk_overlap=12,      # Overlap between chunks
    length_function=len,   # Use length of string to measure size
    add_start_index=True   # Track start index of each chunk in the original text
)

# Step 3: Create LangChain Document objects from the full speech
docs = text_splitter.create_documents([speech])

# (Optional debug output)
print(len(docs))
print(f"Doc 1: {docs[0]}")
print(f"Doc 2: {docs[1]}")

# Step 4: You can also split any standalone string using the same splitter
s = "Python can be easy to pick up whether you're a professional or a beginner."
text = text_splitter.split_text(s)
print(text)


392
Doc 1: page_content='﻿As far as black Americans were' metadata={'start_index': 0}
Doc 2: page_content='were concerned, the nation’s response' metadata={'start_index': 27}
['Python can be easy to pick up whether', "up whether you're a professional or a", 'or a beginner.']
