# **Module overview**

This module covers everything about parsing and ingesting data for RAG systems, from basic text files to complex PDFs and databases. We'll use LangChain and explore each technique with practical and useful examples.

Content:
* Introduction to Data Ingestion
    * Text Files (.txt)
    * PDF documents
    * Word documents
    * CSV and excel files
    * Json and structured data
    * Web scrapping
    * Databases (SQl)
    * Audio and video transcripts
* Advanced Techniques
* Best practices

## **Data Ingestion**

In [2]:
import os 
from typing import List, Dict, Any
import pandas as pd

from langchain_core.documents import Document
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

#### Understanding document structure in Langchain

Important step previous applicate any technique.

Metadata is really important for:
- Filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing


In [3]:
# Create a simple document
doc = Document(
    page_content = "This is the main text content that will embedded and searched",
    metadata = {
        "source":"example.txt",
        "page":1,
        "author":"AngelO",
        "date_created":"2025-08-23",
        "custom_field":"any_value"
    }
)

print("Document Structure")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")


Document Structure
Content: This is the main text content that will embedded and searched
Metadata: {'source': 'example.txt', 'page': 1, 'author': 'AngelO', 'date_created': '2025-08-23', 'custom_field': 'any_value'}


### **Case 1: text files**

The simple case for split documents

In [12]:
# TextLoader - read a single file
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

# Load a single file text
loader = TextLoader("../data/text_files/MCP.txt", encoding='utf-8')
documents = loader.load()

# Verify the loaders and documents
print(f"Loaded {len(documents)} document")
print(type(documents))
print(documents)

Loaded 1 document
<class 'list'>
[Document(metadata={'source': '../data/text_files/MCP.txt'}, page_content="Model Context Protocol (MCP): A Comprehensive IntroductionThe Model Context Protocol (MCP) represents one of the most significant innovations in the conversational artificial intelligence ecosystem, establishing an open standard that revolutionizes how large language models (LLMs) interact with external data sources and specialized tools. Developed by Anthropic in 2024, MCP emerges as a response to a critical need in the AI field: the ability to provide language models with secure, structured, and efficient access to information and functionalities that extend beyond their base training.Historical Context and the NeedDuring the early years of LLM development, one of the primary limitations was their static nature. These models, while extremely powerful in natural language processing and reasoning, were fundamentally constrained by the information present in their training data, w

It's possible load miltiples files from directory

In [21]:
from langchain_community.document_loaders import DirectoryLoader

# Load all the text files from the directory
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt", # All the files with .txt extension
    loader_cls=TextLoader, # Loader class to use
    loader_kwargs= {'encoding':'utf-8'},
    show_progress=True
)

documents = dir_loader.load()

print(f"Directory loaded with {len(documents)} files")
for i, doc in enumerate(documents):
    print(f"\n Document {i+1}")
    print(f"\n Source: {doc.metadata['source']}")
    print(f"\n Lenght: {len(doc.page_content)} characters")

100%|██████████| 2/2 [00:00<00:00, 1695.01it/s]

Directory loaded with 2 files

 Document 1

 Source: ../data/text_files/AgenticAI.txt

 Lenght: 1361 characters

 Document 2

 Source: ../data/text_files/MCP.txt

 Lenght: 5436 characters





#### **Analysis of Text Files**

Directory loader characteristics:
- Advantages:
    * Load multiple files at once
    * Supports glob patterns
    * Progress tracking
    * Recursive directory scanning
- Disadvantages:
    * All files must be the same type
    * Limited error handling per file
    * Can be memory intensive for large directories