In [None]:
'''
# Data -> pdf -> 100 pages -> text, tables, Images (embeddings) -> vectorDB

# text ---My name is Netra [0.1, 0.2, 0.5, 0.4] --> VectorDBs --> Document + Embedding is store in VectorDB. 

# Document contains:
    - metadata
    - page_content
    - embedding

# When to perform the chunking. 
    1. If the capacity of the model is small. 
    2. Fast Retrieval. 
    3. Computational Resources. 
    4. If need to improve the representation quality of the data. 

# ✅ What is Chunking?
- Chunking = Splitting large documents (text, PDFs, webpages) into smaller parts or “chunks” before generating embeddings.

- For example:
    a. A 100-page PDF ➝ split into 500 small chunks (paragraphs, sentences, etc.)
    b. Each chunk ➝ converted to vector (embedding) ➝ stored in a vector database.

# 🔍 Why and When to Perform Chunking ?
    1. 📏 Model Context Limitations
        - 🧠 Chunking ensures that each piece fits into the model's max token size.

    2. 🚀 Faster and More Accurate Retrieval
        - “Fast Retrieval”

        - Smaller chunks = finer-grained search.
        - When a user asks a question, the retriever can fetch only the most relevant chunk, not an entire page or document.
        - This increases accuracy and relevance of responses from the LLM.
        - 📌 Example: If you store full pages or full chapters, your retrieval may return too much irrelevant data.

    3. 🧮 Optimizing Computational Resources
        - “Computational Resources”

        - Embedding large chunks uses more memory and time.
        - Embedding smaller chunks is faster and easier to batch-process.
        - Vector search is also faster when each entry is smaller and semantically tight.
    
    4. 🎯 Improving Semantic Representation
        - “Improve the representation quality”

        - Large chunks often contain mixed topics, reducing embedding precision.
        - Chunking helps isolate semantically consistent units, improving retrieval performance.
        - Better representation = better search results.

# Why do we need the Chunking ?
    1. Model Limitation. 
    2. Handle non-uniform document lenght. 
    3. Improve representation of data. 
    4. Cost Retrieval since chunk we can regulate. 
    5. Optimize the computational resources. 

# chunk_size is a hyperparameter so we can't decide, that depends on the data and the models. 

# APPROACHES:
    1. Length based approach.  
        - Count length, tokens. 
        - token(word) based, character(individual) based. 
        
    2. Text Structured based approach. 
        - Paragraph, sentence. 

    3. Document Structured based approach. 
        - MARKDOWN, HTML
        - CODE (PROGRAMMING LANGUAGES)
        - JSON

    4. Semantic based approach.
        - 
'''

In [2]:
text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""

from langchain_text_splitters import CharacterTextSplitter

text_splitter1 = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 50, 
    chunk_overlap = 0 # Required to sustain the context of the given sentence. 
)

text_splitter2 = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 50, # hyperparameter. 
    chunk_overlap = 10 # hyperparameter. 
)

text_splitter3 = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 50, 
    chunk_overlap = 15
)

# 
chunks1 = text_splitter1.split_text(
    text=text
)

chunks2 = text_splitter2.split_text(
    text=text
)

chunks3 = text_splitter3.split_text(
    text=text
)

# 
for i, chunk in enumerate(chunks1):
    print(f"Chunk {i+1}: {chunk}\n")

Created a chunk of size 123, which is longer than the specified 50
Created a chunk of size 107, which is longer than the specified 50
Created a chunk of size 102, which is longer than the specified 50
Created a chunk of size 123, which is longer than the specified 50
Created a chunk of size 107, which is longer than the specified 50
Created a chunk of size 102, which is longer than the specified 50
Created a chunk of size 123, which is longer than the specified 50
Created a chunk of size 107, which is longer than the specified 50
Created a chunk of size 102, which is longer than the specified 50


Chunk 1: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Chunk 2: Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Chunk 3: Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Chunk 4: Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.



In [8]:
from langchain.document_loaders import TextLoader

loader = TextLoader(
    "state_of_the_union.txt"
)

docs = loader.load()

# to see the data.
docs[0].page_content

splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 3000, 
    chunk_overlap = 250
)

splitted_docs = splitter.split_documents(
    docs
)

print("Total Length of the character in the text docs: ",len(docs[0].page_content))

splitted_docs

Total Length of the character in the text docs:  6682


[Document(metadata={'source': 'state_of_the_union.txt'}, page_content='Introduction Artificial Intelligence (AI) is no longer just a concept from science fiction—it is a powerful force reshaping every aspect of our lives. From recommendation algorithms and voice assistants to autonomous vehicles and medical diagnostics, AI is already deeply integrated into our daily experiences. At its core, AI refers to the simulation of human intelligence in machines programmed to think, learn, and make decisions. As we stand on the brink of a technological revolution, AI is set to play a defining role in the evolution of society, industry, and even human identity.\nWhat is Artificial Intelligence?\nArtificial Intelligence is the field of computer science that focuses on creating systems capable of performing tasks that normally require human intelligence. These tasks include:\nLearning: Acquiring and improving knowledge or skills through experience.\nReasoning: Drawing conclusions from data or facts

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=250,
    length_function=len,
    is_separator_regex=False,
)

# Loading example document. 
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

texts = text_splitter.create_documents(
    [state_of_the_union]
)

len(texts)

print(texts[0])
print(texts[1])

page_content='Introduction Artificial Intelligence (AI) is no longer just a concept from science fiction—it is a powerful force reshaping every aspect of our lives. From recommendation algorithms and voice assistants to autonomous vehicles and medical diagnostics, AI is already deeply integrated into our daily experiences. At its core, AI refers to the simulation of human intelligence in machines programmed to think, learn, and make decisions. As we stand on the brink of a technological revolution, AI is set to play a defining role in the evolution of society, industry, and even human identity.

What is Artificial Intelligence?

Artificial Intelligence is the field of computer science that focuses on creating systems capable of performing tasks that normally require human intelligence. These tasks include:

Learning: Acquiring and improving knowledge or skills through experience.
Reasoning: Drawing conclusions from data or facts.
Problem-solving: Finding solutions to complex problems.
