# LangChain Text Chunking Demo (Standalone)
This notebook reproduces the core logic of the DAG's `chunk_text` task WITHOUT any Airflow pieces. It shows:
1. Creating a sample DataFrame of documents.
2. Using `RecursiveCharacterTextSplitter` to chunk text.
3. Converting chunks into a normalized (exploded) DataFrame.
4. A reusable `chunk_text` utility function.
5. Comparing different splitter parameters and reconstructing the original text.

NOTE: Newer LangChain versions may require importing from `langchain_text_splitters` instead of `langchain.text_splitter`. If you see import errors, try:
`from langchain_text_splitters import RecursiveCharacterTextSplitter`.


In [1]:
# Import Dependencies
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

# If the above imports fail, uncomment the alternative import paths below (for newer LangChain versions):
# from langchain_text_splitters import RecursiveCharacterTextSplitter
# from langchain_core.documents import Document  # alternative core Document class


In [2]:
# Create Sample DataFrame
data = {
    'title': ['doc1', 'doc2'],
    'text': [
        'LangChain provides a standard interface for chains, supports serialization, and offers a flexible approach for building applications.',
        'Chunking long text into smaller pieces helps vector databases store and retrieve semantically relevant passages efficiently.'
    ]
}
df = pd.DataFrame(data)
df

Unnamed: 0,title,text
0,doc1,LangChain provides a standard interface for ch...
1,doc2,Chunking long text into smaller pieces helps v...


In [3]:
# Initialize RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=40,        # max chars per chunk
    chunk_overlap=20,     # overlap between consecutive chunks
    separators=['\n\n', '\n', ' ', '']  # fallback split hierarchy
)
print(splitter)

<langchain_text_splitters.character.RecursiveCharacterTextSplitter object at 0x15e12e510>


In [4]:
# Wrap Rows as LangChain Documents
documents = [Document(page_content=row.text, 
                      metadata={'title': row.title}) for row in df.itertuples()]
len(documents), documents[0]

(2,
 Document(page_content='LangChain provides a standard interface for chains, supports serialization, and offers a flexible approach for building applications.', metadata={'title': 'doc1'}))

In [5]:
df["chunks"] = df["text"].apply(
    lambda x: splitter.split_documents([Document(page_content=x)])
)
display(df.head())
df = df.explode("chunks", ignore_index=True)
display(df.head())
df.dropna(subset=["chunks"], inplace=True)
display(df.head())
df["text"] = df["chunks"].apply(lambda x: x.page_content)
display(df.head())
df.drop(["chunks"], inplace=True, axis=1)
df.reset_index(inplace=True, drop=True)
display(df.head())

Unnamed: 0,title,text,chunks
0,doc1,LangChain provides a standard interface for ch...,[page_content='LangChain provides a standard i...
1,doc2,Chunking long text into smaller pieces helps v...,[page_content='Chunking long text into smaller...


Unnamed: 0,title,text,chunks
0,doc1,LangChain provides a standard interface for ch...,page_content='LangChain provides a standard in...
1,doc1,LangChain provides a standard interface for ch...,"page_content='standard interface for chains, s..."
2,doc1,LangChain provides a standard interface for ch...,"page_content='chains, supports serialization, ..."
3,doc1,LangChain provides a standard interface for ch...,"page_content='serialization, and offers a flex..."
4,doc1,LangChain provides a standard interface for ch...,page_content='offers a flexible approach for b...


Unnamed: 0,title,text,chunks
0,doc1,LangChain provides a standard interface for ch...,page_content='LangChain provides a standard in...
1,doc1,LangChain provides a standard interface for ch...,"page_content='standard interface for chains, s..."
2,doc1,LangChain provides a standard interface for ch...,"page_content='chains, supports serialization, ..."
3,doc1,LangChain provides a standard interface for ch...,"page_content='serialization, and offers a flex..."
4,doc1,LangChain provides a standard interface for ch...,page_content='offers a flexible approach for b...


Unnamed: 0,title,text,chunks
0,doc1,LangChain provides a standard interface,page_content='LangChain provides a standard in...
1,doc1,"standard interface for chains, supports","page_content='standard interface for chains, s..."
2,doc1,"chains, supports serialization, and","page_content='chains, supports serialization, ..."
3,doc1,"serialization, and offers a flexible","page_content='serialization, and offers a flex..."
4,doc1,offers a flexible approach for building,page_content='offers a flexible approach for b...


Unnamed: 0,title,text
0,doc1,LangChain provides a standard interface
1,doc1,"standard interface for chains, supports"
2,doc1,"chains, supports serialization, and"
3,doc1,"serialization, and offers a flexible"
4,doc1,offers a flexible approach for building


In [6]:
df

Unnamed: 0,title,text
0,doc1,LangChain provides a standard interface
1,doc1,"standard interface for chains, supports"
2,doc1,"chains, supports serialization, and"
3,doc1,"serialization, and offers a flexible"
4,doc1,offers a flexible approach for building
5,doc1,for building applications.
6,doc2,Chunking long text into smaller pieces
7,doc2,into smaller pieces helps vector
8,doc2,pieces helps vector databases store and
9,doc2,databases store and retrieve


In [7]:
df.dtypes

title    object
text     object
dtype: object