# Module Overview

This module covers everything you need to know about **parsing and ingesting data for RAG systems**, from basic text files to complex PDFs and databases. We will use **LangChain v0.3** and explore each technique with practical examples.

---

## Table of Contents

1. [Introduction to Data Ingestion](#introduction-to-data-ingestion)  
2. [Text Files (.txt)](#text-files-txt)  
3. [Markdown Files (.md)](#markdown-files-md)  
4. [PDF Documents](#pdf-documents)  
5. [Microsoft Word Documents](#microsoft-word-documents)  
6. [CSV and Excel Files](#csv-and-excel-files)  
7. [JSON and Structured Data](#json-and-structured-data)  
8. [Web Scraping](#web-scraping)  
9. [Databases (SQL)](#databases-sql)  
10. [Audio and Video Transcripts](#audio-and-video-transcripts)  
11. [Advanced Techniques](#advanced-techniques)  
12. [Best Practices](#best-practices)



### Introduction to Data Ingestion

In [4]:
import os
from typing import List,Dict,Any 
import pandas as pd


In [7]:
from langchain_core.documents import Document
from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
)

print("‚úÖ Imports working")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Imports working


## Understanding document structure in LangChain

In [11]:
# creating the simple document
doc = Document(
    page_content="LangChain is a framework for building applications powered by large language models.",
    metadata={
        "source": "langchain_docs",
        "page": 1,
        "author": "LangChain Team"
    }
)
print("Document Structure:")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")

Document Structure:
Content: LangChain is a framework for building applications powered by large language models.
Metadata: {'source': 'langchain_docs', 'page': 1, 'author': 'LangChain Team'}


# page_content ‚Äî What the model reads

## Purpose:

This is the actual text that gets:

- split into chunks

- converted into embeddings

- retrieved and sent to the LLM

## Key idea:

If it should influence the model‚Äôs answer, it belongs in page_content.

Examples of what belongs here:

- Paragraphs from PDFs

- Website text

- Transcripts

- Documentation

# metadata ‚Äî Where the text came from
metadata: dict[str, Any]


## Purpose:

- Stores contextual information about the text

- Not embedded into vectors

- Used for:

    - filtering search results

    - tracing answers

    - citations and debugging

Key idea:

Metadata helps you and the system, not the language model directly.

üîç Why metadata matters in RAG

In real RAG systems, metadata enables:

- Source attribution ("This answer comes from page 3 of rag_whitepaper.pdf"
)

- Filtered retrieval("Only search documents from 2024"
)

- Debugging hallucinations("Which document produced this answer?"
)

# TextLoader - Read single file 

In [23]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader(file_path=r"D:\RAG-udemy\data\text_file\nlp.txt", encoding="utf8")
documents = loader.load()
print(f"Number of documents loaded: {len(documents)}")
print("Sample document content:")
print(documents[0].page_content[:500])  # Print first 500 characters of the first document      
print(f"Metadata: {documents[0].metadata}")

Number of documents loaded: 1
Sample document content:
NLP usually stands for Natural Language Processing.

It‚Äôs a field of artificial intelligence (AI) that helps computers understand, interpret, and generate human language (text or speech).

What NLP is used for

You‚Äôve probably already used NLP without realizing it:

Chatbots & voice assistants (like Siri or ChatGPT)

Spam filters in email

Autocorrect & grammar checkers

Translation apps (e.g., Google Translate)

Search engines

Sentiment analysis (detecting emotions in text)

What NLP tries to 
Metadata: {'source': 'D:\\RAG-udemy\\data\\text_file\\nlp.txt'}


# üìÑ TextLoader (LangChain) ‚Äì Theory Notes

`TextLoader` is a **document loader** in LangChain designed to read a **single text file** and convert it into a `Document` object.  
It is typically used when working with individual `.txt` files in **RAG (Retrieval-Augmented Generation)** pipelines or smaller datasets.

---

## üß† Conceptual Overview

**Purpose:**

- Load a **single file** into LangChain‚Äôs standard `Document` format.
- Standardizes input for downstream RAG processes:
  - Text splitting
  - Embedding generation
  - Storage in a vector database
- Automatically generates **metadata** such as file path.

**Core Idea:**  
> `Document` = `page_content` (text from the file) + `metadata` (context like file path)

---

## ‚úÖ Advantages

- **Simple and easy to use:** Load any single `.txt` file in one line of code.
- **Automatic metadata:** Keeps file path for traceability.
- **Lightweight:** Ideal for small-scale experiments or learning purposes.
- **Flexible encoding:** Supports custom encodings (e.g., `utf-8`, `utf-16`).

---

## ‚ùå Disadvantages / Limitations

- **Single file only:** Cannot batch-load multiple files; use `DirectoryLoader` for that.
- **No glob support:** Can only load the exact file path provided.
- **Limited error handling:** If the file is missing or unreadable, it raises an error.

---

## üìù Use in RAG Pipelines

Typical workflow:

1. **Load a single document**
   ```python
   from langchain_community.document_loaders import TextLoader

   loader = TextLoader(file_path="data/text_file/nlp.txt", encoding="utf-8")
   documents = loader.load()


# Directory loader - sutable for multiple tect file 

In [24]:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(
    r"D:\RAG-udemy\data\text_file",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf8"},
    show_progress=True
)
documents = loader.load()
print (f"Loaded {len(documents)} documents")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print (f" Source: {doc.metadata[ 'source']}")
    print(f" Length: {len(doc.page_content)} characters")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 658.19it/s]

Loaded 2 documents

Document 1:
 Source: D:\RAG-udemy\data\text_file\ML.txt
 Length: 916 characters

Document 2:
 Source: D:\RAG-udemy\data\text_file\nlp.txt
 Length: 1059 characters





# üìÇ DirectoryLoader (LangChain) ‚Äì Theory Notes

`DirectoryLoader` is a **document loader** in LangChain designed to read multiple files from a directory (and optionally its subdirectories) and convert them into `Document` objects.  
It is primarily used in **RAG (Retrieval-Augmented Generation)** pipelines for batch ingestion of text data.

---

## üß† Conceptual Overview

**Purpose:**

- Convert a folder of files into LangChain `Document` objects.
- Standardizes input data for downstream processes:
  - Text splitting
  - Embedding generation
  - Vector database insertion
- Preserves metadata such as the **file path** for traceability.

**Core Idea:**  
> `Document` = `page_content` (text) + `metadata` (context like file path or source)

---

## ‚úÖ Advantages

- **Batch loading of files:** Loads multiple documents at once instead of one by one.
- **Glob pattern support:** Can filter specific file types (e.g., `.txt`, `.pdf`).
- **Progress tracking:** Some implementations show loading progress for large directories.
- **Recursive scanning:** Can automatically traverse subdirectories to load all matching files.

---

## ‚ùå Disadvantages / Limitations

- **Single file type per loader:** All files must be of the same type; mixed types require multiple loaders.
- **Limited error handling:** Corrupted or unreadable files may cause the loader to fail.

---

## üìù Use in RAG Pipelines

Typical workflow:

1. **Load documents**
   ```python
   from langchain_community.document_loaders import DirectoryLoader

   loader = DirectoryLoader("data/text_files/", glob="*.txt", recursive=True)
   documents = loader.load()
