**Understanding Document Structure in LangChain**

In LangChain, a `Document` is a structured object that contains two main components:

- **Content (`page_content`)**: The actual text or data of the document. This can be any string, such as the contents of a file, a web page, or a database entry.
- **Metadata (`metadata`)**: A dictionary of key-value pairs that provides additional context about the document. Common metadata fields include source, author, creation date, page number, and custom tags.

This structure enables:
- **Traceability**: You can track where each piece of information came from.
- **Flexible Querying**: Metadata allows for filtering and searching documents based on attributes.
- **Context-Aware Processing**: Metadata can be used to provide context during retrieval and generation tasks, improving the relevance and accuracy of results.

**Example:**


In [1]:
# create a simple document
from langchain_core.documents import Document
doc = Document(
    page_content="This is a sample document. " * 100,
    metadata={"source": "sample_source.txt",
              "page" : 1,
              "author": "John Doe",
              "date_created": "2023-10-01",
              "custom_field": "any_value"
              }
)
print("Document Created!")
print(f"Document Content: {doc.page_content[:50]}...")  # Print first 50 characters
print(f"Document Content Length: {len(doc.page_content)} characters")
print(f"Document Metadata: {doc.metadata}")
print(f"Document Metadata Keys: {list(doc.metadata.keys())}")



Document Created!
Document Content: This is a sample document. This is a sample docume...
Document Content Length: 2700 characters
Document Metadata: {'source': 'sample_source.txt', 'page': 1, 'author': 'John Doe', 'date_created': '2023-10-01', 'custom_field': 'any_value'}
Document Metadata Keys: ['source', 'page', 'author', 'date_created', 'custom_field']


In [2]:
import os
os.makedirs("data/text_files", exist_ok=True)

In [3]:
sample_texts = {
"data/text_files/rag_intro.txt": """ RAG stands for Retrieval-Augmented Generation

It is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.

Key Concepts

Retrieval: The process of searching and pulling relevant documents, data, or snippets from an external knowledge base (like a database, an organization's internal documents, or the public internet).
Augmentation: The retrieved information is added to the user's original query as additional context.
Generation: The LLM uses this augmented prompt (the query + the external facts) to generate a more accurate, up-to-date, and grounded response.
Why is RAG Important? RAG helps to overcome several key limitations of traditional LLMs:

Reduces "Hallucinations": LLMs sometimes generate plausible-sounding but factually incorrect information because they are making predictions based only on their initial, static training data. RAG grounds the model's answer in verifiable external sources, making the output more factual.
Access to Current and Specific Data: Traditional LLMs only know what they were trained on, which can quickly become outdated. RAG allows the model to access real-time, specific, or proprietary information (like a company's latest product manual or a customer's account details) without needing to retrain the entire model.
Cost-Effective: It is much faster and cheaper to update the external knowledge base than it is to continuously retrain and fine-tune a massive LLM.
Transparency: Many RAG systems can provide citations or source links for the information they use, allowing users to verify the claims and building trust.
In essence, RAG acts as a dynamic information layer that keeps the powerful generative abilities of an LLM connected to the most current and relevant facts available. """,

"data/text_files/ml_intro.txt": """ Machine learning is a field of artificial intelligence where algorithms learn from data without being explicitly programmed, allowing them to identify patterns and make predictions or decisions. It involves training models on vast datasets, which learn from patterns and relationships to generate new insights, make predictions, or perform tasks. Unlike static software, machine learning models can improve their performance over time as they are exposed to more data.  
How it Works
Training Data: Machine learning systems are given large sets of data (training data) that contains both inputs and desired outputs. 
Algorithms: Algorithms act as the rules that analyze this data, searching for mathematical correlations between the inputs and the expected outputs. 
Learning Patterns: The algorithm uses this information to identify patterns and relationships within the data. 
Model Creation: The learned patterns are encapsulated into a model, which can then make predictions or decisions when presented with new, unseen data. 
Key Aspects
Data-Driven: Machine learning is fundamentally about learning from data, which drives its ability to make accurate predictions. 
Adaptive: Models can continuously get better at their assigned tasks as they are exposed to new data and feedback. 
Predictive Power: A core function of machine learning is to forecast future outcomes, like stock market movements or recommended video content. 
A Subset of AI: Machine learning is a specialized form of artificial intelligence, focused on the specific ability of learning from data. 
Examples
Healthcare: Identifying trends in patient data to improve diagnoses and treatments. 
Personalization: Recommending products or videos based on user history and preferences. 
Image Recognition: Detecting objects or anomalies (like cancer in CT scans) by learning from image datasets.  """
}


for file_path, content in sample_texts.items():
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(content)

## Text Loaders in LangChain

Text loaders are essential components in LangChain for ingesting and structuring raw data from various sources. They convert files, directories, or other data sources into standardized `Document` objects, enabling downstream processing and retrieval.

### Common Text Loaders

- **TextLoader**  
    Loads a single text file into a `Document`.  
    - **Use Case:** When you want to ingest one file at a time.
    - **Example:**  
        ```python
        loader = TextLoader("path/to/file.txt", encoding="utf-8")
        docs = loader.load()
        ```

- **DirectoryLoader**  
    Loads multiple files from a directory, using a specified loader for each file.  
    - **Use Case:** Batch ingestion of many files (e.g., all `.txt` files in a folder).
    - **Features:**  
        - Supports glob patterns for file selection.
        - Can use different loader classes for different file types.
        - Shows progress for large datasets.
    - **Example:**  
        ```python
        dir_loader = DirectoryLoader(
                "data/text_files",
                glob="*.txt",
                loader_cls=TextLoader,
                loader_kwargs={"encoding": "utf-8"},
                show_progress=True
        )
        dir_docs = dir_loader.load()
        ```

### Loader Output

All loaders produce a list of `Document` objects, each containing:
- **page_content:** The main text of the document.
- **metadata:** Contextual information (e.g., source file path).

### Loader Selection Tips

- Use **TextLoader** for single files or when you need granular control.
- Use **DirectoryLoader** for batch processing and scalability.
- Choose loader classes based on file type (e.g., PDFLoader, CSVLoader for non-text files).

**Efficient data ingestion starts with the right loader choice, ensuring your pipeline is scalable, consistent, and easy to maintain.**

**TextLoader Example:**

In [4]:
from langchain_community.document_loaders import TextLoader
import json
loader = TextLoader("data/text_files/rag_intro.txt", encoding="utf-8")    
docs = loader.load()
print("Type of loaded document:", type(docs))
print(docs)
print(json.dumps([doc.model_dump() for doc in docs], indent=2))  # Pretty-print the list of documents
print(f"Number of documents loaded: {len(docs)}")
print(f"First document content (first 100 chars): {docs[0].page_content[:100]}...")

Type of loaded document: <class 'list'>
[Document(metadata={'source': 'data/text_files/rag_intro.txt'}, page_content=' RAG stands for Retrieval-Augmented Generation\n\nIt is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.\n\nKey Concepts\n\nRetrieval: The process of searching and pulling relevant documents, data, or snippets from an external knowledge base (like a database, an organization\'s internal documents, or the public internet).\nAugmentation: The retrieved information is added to the user\'s original query as additional context.\nGeneration: The LLM uses this augmented prompt (the query + the external facts) to generate a more accurate, up-to-date, and grounded response.\nWhy is RAG Important? RAG helps to overcome several key limitations of traditional LLMs:\

**DirectoryLoader - Multiple Text Files Example:**

In [5]:
from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader("data/text_files", 
                             glob="*.txt",  # Pattern to match all .txt files in the directory and subdirectories
                             loader_cls=TextLoader, # Specify the loader class to use for each file
                            loader_kwargs={"encoding": "utf-8"},
                            show_progress=True
                            )
dir_docs = dir_loader.load()
print("Type of loaded documents from directory:", type(dir_docs) , len(dir_docs))
print(f"Number of documents loaded from directory: {len(dir_docs)}")

for i, doc in enumerate(dir_docs): # i means index, doc means document
    print(f"\nDocument {i+1}:")
    print(f"Type: {type(doc)}")
    print(f"Content (first 100 chars): {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    if i >= 2:  # Limit output to first 3 documents for brevity
        break
 
print("""\n**📈Advantages of DirectoryLoader**:
1. **Batch Processing**: Load multiple files in one go, saving time and effort.
2. **Consistency**: Ensures all files are processed using the same loader and settings.
3. **Scalability**: Easily handle large volumes of documents by loading them in batches.
4. **Flexibility**: Supports various file types and structures through customizable loaders.
5. **Progress Tracking**: Provides feedback on loading progress, useful for large datasets. """)

# disadvantages of DirectoryLoader
print("""\n**⚠️Disadvantages of DirectoryLoader**:
1. **Limited Control**: Less granular control over individual file processing compared to single-file loaders.
2. **Error Propagation**: Errors in one file may affect the loading of subsequent files.
3. **Resource Intensive**: Loading many files at once can consume significant memory and processing power.
4. **Complexity**: May introduce complexity in managing and configuring multiple loaders for different file types.
5. **Debugging Challenges**: Harder to debug issues with specific files when using batch loading. """)


100%|██████████| 2/2 [00:00<00:00, 215.56it/s]

Type of loaded documents from directory: <class 'list'> 2
Number of documents loaded from directory: 2

Document 1:
Type: <class 'langchain_core.documents.base.Document'>
Content (first 100 chars):  Machine learning is a field of artificial intelligence where algorithms learn from data without bei...
Metadata: {'source': 'data\\text_files\\ml_intro.txt'}

Document 2:
Type: <class 'langchain_core.documents.base.Document'>
Content (first 100 chars):  RAG stands for Retrieval-Augmented Generation

It is an AI framework that significantly enhances th...
Metadata: {'source': 'data\\text_files\\rag_intro.txt'}

**📈Advantages of DirectoryLoader**:
1. **Batch Processing**: Load multiple files in one go, saving time and effort.
2. **Consistency**: Ensures all files are processed using the same loader and settings.
3. **Scalability**: Easily handle large volumes of documents by loading them in batches.
4. **Flexibility**: Supports various file types and structures through customizable loaders.
5




## Text Splitters and Strategies in LangChain

Text splitters are essential tools in LangChain for dividing large texts into manageable, context-preserving chunks. This process is crucial for tasks like retrieval, embedding, and generation, where input size limits and context windows must be respected.

### Why Split Text?

- **Model Constraints**: LLMs and embedding models have maximum input sizes (characters or tokens).
- **Efficient Retrieval**: Smaller chunks improve search relevance and granularity.
- **Context Preservation**: Overlapping chunks help maintain continuity across splits.

### Common Text Splitting Strategies

1. **Character-Based Splitting**
    - **CharacterTextSplitter**: Splits text at specified characters or separators (e.g., newline, period).
    - **Use Case**: Structured text with clear delimiters (paragraphs, lines).
    - **Parameters**: `separator`, `chunk_size`, `chunk_overlap`.

2. **Recursive Splitting**
    - **RecursiveCharacterTextSplitter**: Tries multiple separators in order, recursively splitting to best fit chunk size.
    - **Use Case**: Unstructured or complex text, produces more balanced chunks.
    - **Parameters**: `separators` (list), `chunk_size`, `chunk_overlap`.

3. **Token-Based Splitting**
    - **TokenTextSplitter**: Splits text based on token count using a tokenizer.
    - **Use Case**: NLP tasks and LLMs where token limits matter (e.g., OpenAI models).
    - **Parameters**: `chunk_size` (tokens), `chunk_overlap` (tokens).

### Choosing a Strategy

- **Simple, Structured Text**: Use `CharacterTextSplitter`.
- **Complex or Mixed Text**: Use `RecursiveCharacterTextSplitter` for balanced chunks.
- **Token-Limited Models**: Use `TokenTextSplitter` for precise control over input size.

**Tip:** Always consider the downstream task (retrieval, embedding, generation) and the model's input constraints when selecting a splitter and configuring its parameters.

Text splitting is a foundational step for building robust, scalable, and context-aware AI pipelines in LangChain.

**Method 1: CharacterTextSplitter splitting**

In [6]:
from langchain.text_splitter import(
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)  
print(docs)
text = docs[0].page_content
text
 

[Document(metadata={'source': 'data/text_files/rag_intro.txt'}, page_content=' RAG stands for Retrieval-Augmented Generation\n\nIt is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.\n\nKey Concepts\n\nRetrieval: The process of searching and pulling relevant documents, data, or snippets from an external knowledge base (like a database, an organization\'s internal documents, or the public internet).\nAugmentation: The retrieved information is added to the user\'s original query as additional context.\nGeneration: The LLM uses this augmented prompt (the query + the external facts) to generate a more accurate, up-to-date, and grounded response.\nWhy is RAG Important? RAG helps to overcome several key limitations of traditional LLMs:\n\nReduces "Hallucinations": LLMs someti

' RAG stands for Retrieval-Augmented Generation\n\nIt is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.\n\nKey Concepts\n\nRetrieval: The process of searching and pulling relevant documents, data, or snippets from an external knowledge base (like a database, an organization\'s internal documents, or the public internet).\nAugmentation: The retrieved information is added to the user\'s original query as additional context.\nGeneration: The LLM uses this augmented prompt (the query + the external facts) to generate a more accurate, up-to-date, and grounded response.\nWhy is RAG Important? RAG helps to overcome several key limitations of traditional LLMs:\n\nReduces "Hallucinations": LLMs sometimes generate plausible-sounding but factually incorrect information because t

In [7]:

char_splitter = CharacterTextSplitter(
    separator="\n", # split at new lines
    chunk_size=100, # each chunk will be 100 characters
    chunk_overlap=20, # 20 characters overlap between chunks
    length_function=len # function to measure length (default is len)
)

char_chunks = char_splitter.split_text(text)
print(f"Number of character-based chunks: {len(char_chunks)}")
for i, chunk in enumerate(char_chunks[:3]):  # Print first 3 chunks
    print(f"\nCharacter Chunk {i+1} (length {len(chunk)}):\n{chunk}")
print("\n---\n")
print(char_chunks)
print(char_chunks[0])
print(char_chunks[1])


Created a chunk of size 274, which is longer than the specified 100
Created a chunk of size 198, which is longer than the specified 100
Created a chunk of size 143, which is longer than the specified 100
Created a chunk of size 286, which is longer than the specified 100
Created a chunk of size 322, which is longer than the specified 100
Created a chunk of size 147, which is longer than the specified 100
Created a chunk of size 154, which is longer than the specified 100


Number of character-based chunks: 12

Character Chunk 1 (length 45):
RAG stands for Retrieval-Augmented Generation

Character Chunk 2 (length 274):
It is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.

Character Chunk 3 (length 12):
Key Concepts

---

['RAG stands for Retrieval-Augmented Generation', 'It is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.', 'Key Concepts', "Retrieval: The process of searching and pulling relevant documents, data, or snippets from an external knowledge base (like a database, an organization's internal documents, or the 

**Warning : The chunk is larger than the specified chunk_size=100 because of how CharacterTextSplitter works with the separator parameter:**

Separator Behavior: When you specify separator="\n", the splitter will only split at newline characters. It won't split in the middle of a paragraph or line, even if the chunk size exceeds the specified limit.

Chunk Size is a Target: The chunk_size of 100 is treated as a target/minimum size, not a strict maximum. If a text segment between separators is longer than the chunk size, it will be kept intact.

To fix this and get smaller chunks, you have two options:


1. Use Different Separators:   <br>
char_splitter = CharacterTextSplitter(  <br>
    separator=["\n", ".", " "],  # Try multiple separators   <br>
    chunk_size=100,<br>
    chunk_overlap=20,<br>
    length_function=len<br>
)
<br>

2.Use RecursiveCharacterTextSplitter: The RecursiveCharacterTextSplitter is generally recommended as it handles this situation better by trying different separators in order: ["\n\n", "\n", " ", ""]

recursive_splitter = RecursiveCharacterTextSplitter(<br>
    chunk_size=100,<br>
    chunk_overlap=20,<br>
    length_function=len<br>
)

**Method 2 : Recursive Character Text Splitter**

In [8]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n","\n"," ",""],  # Try this separators in order of priority
    chunk_size=200, # each chunk will be 200 characters
    chunk_overlap=20, # 20 characters overlap between chunks
    length_function=len # function to measure length (default is len)
)

rec_chunks = recursive_splitter.split_text(text)
print(f"Number of recursive chunks: {len(rec_chunks)}")
for i, chunk in enumerate(rec_chunks[:3]):  # Print first 3 chunks
    print(f"\nRecursive Chunk {i+1} (length {len(chunk)}):\n{chunk}")
print("\n---\n")
print(rec_chunks)
print(rec_chunks[0])
print(rec_chunks[1])


Number of recursive chunks: 15

Recursive Chunk 1 (length 45):
RAG stands for Retrieval-Augmented Generation

Recursive Chunk 2 (length 198):
It is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think

Recursive Chunk 3 (length 93):
a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.

---

['RAG stands for Retrieval-Augmented Generation', 'It is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think', 'a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.', 'Key Concepts', "Retrieval: The process of searching and pulling relevant documents, data, or snippets from an external knowledge base (like a database, an organization's interna

In [9]:
sample_text = "This is RAG framework. It connects LLMs to external knowledge sources. RAG enhances accuracy and reduces hallucinations."
sample_recursive_chunks = RecursiveCharacterTextSplitter(
    separators=[" "],
    chunk_size=50,
    chunk_overlap=10,
    length_function=len
)

sample_chunks = sample_recursive_chunks.split_text(sample_text)
print(f"Number of sample recursive chunks : {len(sample_chunks)}")
print("\n---\n")
print(sample_chunks[0])

for i, chunk in enumerate(sample_chunks):  # Print all sample chunks
    print(f"\nSample Recursive Chunk {i+1} (length {len(chunk)}):\n{chunk}")


Number of sample recursive chunks : 3

---

This is RAG framework. It connects LLMs to

Sample Recursive Chunk 1 (length 42):
This is RAG framework. It connects LLMs to

Sample Recursive Chunk 2 (length 48):
LLMs to external knowledge sources. RAG enhances

Sample Recursive Chunk 3 (length 45):
enhances accuracy and reduces hallucinations.


**Method 3 : Token based Splitter**

In [10]:
token_splitter = TokenTextSplitter(
    chunk_size=100, # each chunk will be 100 tokens
    chunk_overlap=10, # 10 tokens overlap between chunks
    length_function=len
)
token_chunks = token_splitter.split_text(text)
print(f"Number of token-based chunks: {len(token_chunks)}")
for i, chunk in enumerate(token_chunks[:3]):  # Print first 3 token chunks
    print(f"\nToken Chunk {i+1} (length {len(chunk)}):\n{chunk}")

Number of token-based chunks: 5

Token Chunk 1 (length 461):
 RAG stands for Retrieval-Augmented Generation

It is an AI framework that significantly enhances the capabilities of large language models (LLMs) by connecting them to external, authoritative knowledge sources before generating a response. Think of it as giving the LLM an "open-book" exam instead of a "closed-book" one.

Key Concepts

Retrieval: The process of searching and pulling relevant documents, data, or snippets from an external knowledge base (like

Token Chunk 2 (length 462):
, or snippets from an external knowledge base (like a database, an organization's internal documents, or the public internet).
Augmentation: The retrieved information is added to the user's original query as additional context.
Generation: The LLM uses this augmented prompt (the query + the external facts) to generate a more accurate, up-to-date, and grounded response.
Why is RAG Important? RAG helps to overcome several key limitations of tradi

### Differences Between TextSplitter, RecursiveCharacterTextSplitter, and TokenTextSplitter

| Splitter Type                      | Splitting Strategy                | Use Case / Strengths                                   | Parameters & Customization                |
|------------------------------------|-----------------------------------|--------------------------------------------------------|-------------------------------------------|
| **CharacterTextSplitter**          | Splits text at specified character(s) or separators (e.g., newline, period, space). | Simple, fast splitting for structured text (e.g., paragraphs, lines). | `separator`, `chunk_size`, `chunk_overlap`, `length_function` |
| **RecursiveCharacterTextSplitter** | Tries multiple separators in order of priority, recursively splitting to best fit chunk size. | Handles unstructured or complex text, produces more balanced chunk sizes. | `separators` (list), `chunk_size`, `chunk_overlap`, `length_function` |
| **TokenTextSplitter**              | Splits text based on token count (not character count), using a tokenizer. | Useful for LLMs and NLP tasks where token limits matter (e.g., OpenAI models). | `chunk_size` (tokens), `chunk_overlap` (tokens), `length_function` |

**Summary:**
- **CharacterTextSplitter**: Best for simple, character-based splitting. Structed text, Text has clear delimiters. Fastest option. May break chunks unevenly if text lacks separators.
- **RecursiveCharacterTextSplitter**: Best for robust, multi-separator splitting and balanced chunks. Default choice for varied text. Slightly slower due to recursion.
- **TokenTextSplitter**: Best for token-aware splitting, ideal for LLM input constraints. More accurate for embeddings and model inputs. Slower due to tokenization overhead.