## üß© Document Object + Metadata Demo (LangChain)

This script demonstrates how LangChain represents data using the `Document` class
and why **metadata** is important in RAG and retrieval systems.

### ‚úîÔ∏è What this script does

- Imports LangChain text utilities & `Document`
- Creates a sample `Document` containing:
  - `page_content` ‚Üí the actual text
  - `metadata` ‚Üí contextual information about the text
- Prints both content and metadata to show the structure
- Explains why metadata is useful

### ‚öôÔ∏è Used uv in place of pip

This project uses uv instead of pip because it is faster and automatically manages virtual environments.

‚úîÔ∏è Commands Used
- Initialize the project:
- uv init
- Install dependencies from requirements.txt:
- uv add -r requirements.txt

Add an additional package:
- uv add pandas

Run the script using uv:
- uv run python script.py

### üí° Why uv?

- Faster than pip
- Creates and manages virtual environments automatically
- Keeps dependencies isolated and reproducible
- Well-suited for modern Python + RAG projects

### üóÇÔ∏è What is a `Document`?

A `Document` in LangChain stores:

- **Text** (`page_content`)
- **Metadata** (`metadata = {...}`)

Example metadata fields:
- source / filename  
- page number  
- author  
- timestamps  
- tags or category  

Metadata helps during:
- üîç Search & filtering
- üß† Context-aware retrieval
- üìö Grouping and organization
- üßæ Auditing and traceability

### ‚úÇÔ∏è Why we imported text splitters?

The splitters are used later to:
1. Break large documents into smaller chunks
2. Feed chunks into embeddings / vector DBs
3. Build RAG pipelines efficiently

(They are imported here for later use in the workflow.)

### üéØ Key takeaway

This script is about **understanding the structure of a Document object**
before moving into:
- text splitting
- embeddings
- vector storage
- RAG pipelines


# Introduction to dataIngestion

In [1]:
import os
from typing import List,Dict,Any
import pandas as pd

In [2]:
from langchain_core.documents import Document
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
    
)
print("Set up Completed")

  from .autonotebook import tqdm as notebook_tqdm


Set up Completed


### Understanding Document structure in LangChain

In [3]:
# Create a simple document
doc = Document(
    page_content = "This is a sample document.",
    metadata = {
        "source":"sample.txt",
        "page":1,
        "author":"Ahmed",
        "date_created":"2026-01-03"
    }
)
print("Document Structure")

print(f"Content :{doc.page_content}")
print(f"Metadata :{doc.metadata}")

# Why metadata matters:
print("Metadata is crucial for:")
print("- Contextual Understanding")
print("- Efficient Retrieval")
print("- Enhanced Searchability")
print("- Data Management")
print("- Compliance and Auditing")

Document Structure
Content :This is a sample document.
Metadata :{'source': 'sample.txt', 'page': 1, 'author': 'Ahmed', 'date_created': '2026-01-03'}
Metadata is crucial for:
- Contextual Understanding
- Efficient Retrieval
- Enhanced Searchability
- Data Management
- Compliance and Auditing


In [4]:
type(doc)

langchain_core.documents.base.Document

## üìÇ Text File Loading Demo ‚Äî TextLoader & DirectoryLoader (LangChain)

This script demonstrates how to create sample text files and load them into
LangChain `Document` objects for use in RAG or NLP pipelines.

### ‚úîÔ∏è What the script does

1. Creates a `data/text_files/` folder
2. Saves three sample `.txt` files into it
3. Loads a **single file** using `TextLoader`
4. Loads **all files in the folder** using `DirectoryLoader`
5. Prints document content previews and metadata
6. Displays pros & cons of using `DirectoryLoader`

### üßæ Why this is useful

- Converts raw text files into structured `Document` objects  
- Preserves metadata like file path (important for search & traceability)  
- Prepares text for:
  - splitting
  - embeddings
  - vector databases
  - RAG pipelines

### ‚öôÔ∏è Tools Used

- **TextLoader** ‚Üí loads one file at a time  
- **DirectoryLoader** ‚Üí loads multiple files in bulk using glob patterns  

### üí° Key takeaway

This script is a foundation step before:
- chunking text
- embedding documents
- building retrieval-augmented applications


## Text Files

In [5]:
# create a simple txt file
import os
os.makedirs("data/text_files",exist_ok=True)

In [6]:
sample_texts = {
    "data/text_files/python_intro.txt": """
Python is a high-level, interpreted programming language
known for its readability and versatility. It is widely
used in web development, data science, automation,
machine learning, and scripting.

Python emphasizes code simplicity and developer productivity.
Its large ecosystem of libraries makes it one of the most
popular programming languages in the world.
""",

    "data/text_files/ai_overview.txt": """
Artificial Intelligence (AI) refers to systems designed
to perform tasks that normally require human intelligence,
such as reasoning, learning, perception, and language
understanding.

Modern AI applications include chatbots, recommendation
systems, autonomous vehicles, and predictive analytics.
""",

    "data/text_files/cloud_computing.txt": """
Cloud computing allows users to store and process data
over the internet instead of on local machines.

It offers benefits such as scalability, flexibility,
cost efficiency, and remote accessibility.
"""
}



for filepath,content in sample_texts.items():
    with open(filepath,"w",encoding="utf-8") as file:
        file.write(content)

print("Sample text added succesfully to files!")

Sample text added succesfully to files!


## TextLoader - Read Single File

In [9]:
from langchain_community.document_loaders import TextLoader

## Loading a single text file
loader = TextLoader("data/text_files/python_intro.txt",encoding="utf-8")

document = loader.load()
# print(type(document))
print(document)

print(f"Loaded {len(document)} document")
print(f"Content preview: {document[0].page_content[:100]}...")
print(f"Metadata: {document[0].metadata}")

[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='\nPython is a high-level, interpreted programming language\nknown for its readability and versatility. It is widely\nused in web development, data science, automation,\nmachine learning, and scripting.\n\nPython emphasizes code simplicity and developer productivity.\nIts large ecosystem of libraries makes it one of the most\npopular programming languages in the world.\n')]
Loaded 1 document
Content preview: 
Python is a high-level, interpreted programming language
known for its readability and versatility....
Metadata: {'source': 'data/text_files/python_intro.txt'}


## DirectoryLoader-Multiple Text Files

In [10]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

# Load all the files from the directory
dir_loader = DirectoryLoader(
    "data/text_files",
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=True
)

documents = dir_loader.load()

print(f"\nLoaded {len(documents)} documents from directory.")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"Content preview: {doc.page_content[:100]}...")
    print(f"Source: {doc.metadata['source']}")
    print(f"Length: {len(doc.page_content)} characters")


print("\nAdvantages of DirectoryLoader:")
print("1. Automatically loads multiple files from a directory")
print("2. Supports glob patterns for flexible file selection")
print("3. Preserves file metadata such as path and filename")
print("4. Useful for batch data ingestion in RAG systems")
print("5. Easy to combine with text splitters and embeddings")

print("\nDisadvantages of DirectoryLoader:")
print("1. Loads all files into memory ‚Äî not ideal for huge datasets")
print("2. Limited control over deeply nested directory structures")
print("3. Requires compatible loaders for each file type")
print("4. No built-in filtering for duplicates or noisy content")
print("5. Can be slower for very large folders compared to streaming loaders")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 136.63it/s]


Loaded 3 documents from directory.

Document 1:
Content preview: 
Artificial Intelligence (AI) refers to systems designed
to perform tasks that normally require huma...
Source: data\text_files\ai_overview.txt
Length: 298 characters

Document 2:
Content preview: 
Cloud computing allows users to store and process data
over the internet instead of on local machin...
Source: data\text_files\cloud_computing.txt
Length: 201 characters

Document 3:
Content preview: 
Python is a high-level, interpreted programming language
known for its readability and versatility....
Source: data\text_files\python_intro.txt
Length: 363 characters

Advantages of DirectoryLoader:
1. Automatically loads multiple files from a directory
2. Supports glob patterns for flexible file selection
3. Preserves file metadata such as path and filename
4. Useful for batch data ingestion in RAG systems
5. Easy to combine with text splitters and embeddings

Disadvantages of DirectoryLoader:
1. Loads all files into memory ‚Äî 




## üß© Text Splitting Techniques in LangChain

This guide explains three major text splitting methods used in RAG pipelines and
document processing:

- CharacterTextSplitter  
- RecursiveCharacterTextSplitter  
- TokenTextSplitter  

It also includes an example of **overlap behavior** and a comparison summary.

---

## ‚úÖ Method-1: CharacterTextSplitter

**How it works**

- Splits text based on **fixed character length**
- Uses a separator (e.g., space or newline)
- Adds overlap between chunks if needed

**Pros**
- Simple and predictable
- Uniform chunk sizes

**Cons**
- May break sentences and meaning
- No semantic awareness

**Use when**
- Doing basic experiments
- Chunk size consistency matters

---

## ‚úÖ Method-2: RecursiveCharacterTextSplitter

**How it works**

- Splits text using a **hierarchical fallback strategy**  
  paragraph ‚Üí sentence ‚Üí word ‚Üí character
- Adds overlap **only when a chunk is actually split**
- Produces context-preserving chunks

**Pros**
- Meaning-aware splitting
- Works best for real documents (PDFs, articles, pages)
- Reduces broken sentences

**Cons**
- Less predictable chunk boundaries than character splitting

**Use when**
- Building RAG systems
- Processing long paragraphs or natural text

### üîé Overlap Example

- Overlap ensures the **end of one chunk is repeated in the next**
- Prevents loss of meaning across chunk boundaries
- Useful for embeddings & retrieval

---

## ‚úÖ Method-3: TokenTextSplitter

**How it works**

- Splits text based on **tokens (LLM units)** rather than characters
- Aligns with model token limits

**Pros**
- Token-safe chunks for LLMs
- More consistent for embeddings
- Ideal for OpenAI / transformer models

**Cons**
- Token count ‚â† character count (less human-visible)

**Use when**
- Preparing text for embeddings
- Working with API token limits

---

## üÜö Quick Comparison

| Splitter | Strength | Weakness | Best Use |
|--------|--------|--------|--------|
| Character | Simple & predictable | Breaks sentences | Basic processing |
| Recursive | Meaning-aware | Less uniform | RAG / documents |
| Token | Token-aligned | Harder to read | LLM & embeddings |

---

## üéØ When to Choose Which?

- ‚úî **RecursiveCharacterTextSplitter** ‚Üí Real-world RAG pipelines  
- ‚úî **TokenTextSplitter** ‚Üí Embeddings & token-safe chunks  
- ‚úî **CharacterTextSplitter** ‚Üí Simple or controlled experiments  

---

## üí° Key Takeaway

Different splitters exist because **different tasks need different chunk behavior**.  
Choose the one that best preserves meaning while fitting model limits.


## TextSplitting Techniques

In [10]:
## Different text splitting techniques
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
    
)
print(document)

[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='\nPython is a high-level, interpreted programming language\nknown for its readability and versatility. It is widely\nused in web development, data science, automation,\nmachine learning, and scripting.\n\nPython emphasizes code simplicity and developer productivity.\nIts large ecosystem of libraries makes it one of the most\npopular programming languages in the world.\n')]


In [11]:
## Method-1 : Character Text Spltter
text = document[0].page_content
char_splitter = CharacterTextSplitter(
    separator=" ", # Split on newlines
    chunk_size=200, # Each chunk has 200 characters
    chunk_overlap=20, # Overlap between chunks
    length_function=len # Use length of text
)


char_chunks = char_splitter.split_text(text)
print(f"\nCharacter Text Splitter produced {len(char_chunks)} chunks:")
for i, chunk in enumerate(char_chunks):
    print(f"\nChunk {i+1}:\n{chunk}")


Character Text Splitter produced 2 chunks:

Chunk 1:
Python is a high-level, interpreted programming language
known for its readability and versatility. It is widely
used in web development, data science, automation,
machine learning, and

Chunk 2:
learning, and scripting.

Python emphasizes code simplicity and developer productivity.
Its large ecosystem of libraries makes it one of the most
popular programming languages in the world.


In [12]:
## Mathod-2: RecursiveCharacterTextSplitter
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""], # Hierarchical separators
    chunk_size=200, # Each chunk has 200 characters
    chunk_overlap=20, # Overlap between chunks
    length_function=len # Use length of text
)

recursive_chunks = recursive_splitter.split_text(text)
print(f"\nRecursive Character Text Splitter produced {len(recursive_chunks)} chunks:")
for i, chunk in enumerate(recursive_chunks):
    print(f"\nChunk {i+1}:\n{chunk}")


Recursive Character Text Splitter produced 2 chunks:

Chunk 1:
Python is a high-level, interpreted programming language
known for its readability and versatility. It is widely
used in web development, data science, automation,
machine learning, and scripting.

Chunk 2:
Python emphasizes code simplicity and developer productivity.
Its large ecosystem of libraries makes it one of the most
popular programming languages in the world.


In [13]:
# Overlap example
from langchain_text_splitters import RecursiveCharacterTextSplitter

text = """
Retrieval Augmented Generation allows language models to retrieve external knowledge while answering questions. It improves factual accuracy, reduces hallucinations, and enables real-world enterprise applications.
""".replace("\n", " ")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=80,
    chunk_overlap=20,
    separators=[" ", ""]
)

chunks = splitter.split_text(text)

print(f"\nChunks generated: {len(chunks)}")
for i, c in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---\n{c}")

print("\nüü° Overlap Check")
for i in range(len(chunks)-1):
    overlap = chunks[i][-20:]
    print(f"\nBetween Chunk {i+1} ‚Üí {i+2}")
    print("Overlap:", overlap)
    print("Chunk", i+1, "ends with ...", overlap)
    print("Chunk", i+2, "starts with", chunks[i+1][:20], "...")



Chunks generated: 4

--- Chunk 1 ---
Retrieval Augmented Generation allows language models to retrieve external

--- Chunk 2 ---
retrieve external knowledge while answering questions. It improves factual

--- Chunk 3 ---
It improves factual accuracy, reduces hallucinations, and enables real-world

--- Chunk 4 ---
enables real-world enterprise applications.

üü° Overlap Check

Between Chunk 1 ‚Üí 2
Overlap: to retrieve external
Chunk 1 ends with ... to retrieve external
Chunk 2 starts with retrieve external kn ...

Between Chunk 2 ‚Üí 3
Overlap:  It improves factual
Chunk 2 ends with ...  It improves factual
Chunk 3 starts with It improves factual  ...

Between Chunk 3 ‚Üí 4
Overlap: d enables real-world
Chunk 3 ends with ... d enables real-world
Chunk 4 starts with enables real-world e ...


In [14]:
## Method-3 : Token Text Splitter
token_splitter = TokenTextSplitter(
    chunk_size=40,
    chunk_overlap=10
)
token_chumks = token_splitter.split_text(text)
print(f"Created {len(token_chumks)} chunks using TokenTextSplitter")
for i, chunk in enumerate(token_chumks):
    print(f"\nChunk {i+1}:\n{chunk}")

Created 1 chunks using TokenTextSplitter

Chunk 1:
 Retrieval Augmented Generation allows language models to retrieve external knowledge while answering questions. It improves factual accuracy, reduces hallucinations, and enables real-world enterprise applications. 


In [15]:
print("\nüìå Comparison of Text Splitting Methods\n")

print("üîπ RecursiveCharacterTextSplitter")
print("- Smart splitting using hierarchical rules (paragraph ‚Üí sentence ‚Üí word ‚Üí char)")
print("- Preserves semantic meaning better")
print("- Adds overlap only when chunks are actually split")
print("- Best for RAG, documents, PDFs, articles\n")

print("üîπ CharacterTextSplitter")
print("- Splits strictly by character count")
print("- Very predictable but may break sentences")
print("- No semantic awareness")
print("- Useful for uniform chunk sizes or simple text processing\n")

print("üîπ TokenTextSplitter")
print("- Splits based on tokens instead of characters")
print("- Aligns with LLM token limits")
print("- Reduces embedding inconsistency")
print("- Best for OpenAI models and embedding pipelines\n")

print("üîπ Recursive vs Character (Quick Difference)")
print("- Recursive = Meaning-aware + flexible splitting")
print("- Character = Fixed size chunks, no context awareness\n")

print("üîπ When to Use What?")
print("- Use RecursiveCharacterTextSplitter ‚Üí Real world RAG systems")
print("- Use TokenTextSplitter ‚Üí LLM / embedding token-safe chunks")
print("- Use CharacterTextSplitter ‚Üí Simple or controlled experiments\n")



üìå Comparison of Text Splitting Methods

üîπ RecursiveCharacterTextSplitter
- Smart splitting using hierarchical rules (paragraph ‚Üí sentence ‚Üí word ‚Üí char)
- Preserves semantic meaning better
- Adds overlap only when chunks are actually split
- Best for RAG, documents, PDFs, articles

üîπ CharacterTextSplitter
- Splits strictly by character count
- Very predictable but may break sentences
- No semantic awareness
- Useful for uniform chunk sizes or simple text processing

üîπ TokenTextSplitter
- Splits based on tokens instead of characters
- Aligns with LLM token limits
- Reduces embedding inconsistency
- Best for OpenAI models and embedding pipelines

üîπ Recursive vs Character (Quick Difference)
- Recursive = Meaning-aware + flexible splitting
- Character = Fixed size chunks, no context awareness

üîπ When to Use What?
- Use RecursiveCharacterTextSplitter ‚Üí Real world RAG systems
- Use TokenTextSplitter ‚Üí LLM / embedding token-safe chunks
- Use CharacterTextSplitter 