# **Module overview**

This module covers everything about parsing and ingesting data for RAG systems, from basic text files to complex PDFs and databases. We'll use LangChain and explore each technique with practical and useful examples.

Content:
* Introduction to Data Ingestion
    * Text Files (.txt)
    * PDF documents
    * Word documents
    * CSV and excel files
    * Json and structured data
    * Web scrapping
    * Databases (SQl)
    * Audio and video transcripts
* Advanced Techniques
* Best practices

## **Data Ingestion**

In [10]:
import os 
from typing import List, Dict, Any
import pandas as pd

from langchain_core.documents import Document
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

#### Understanding document structure in Langchain

Important step previous applicate any technique

In [None]:
# Create a simple document
doc = Document(
    page_content = "This is the main text content that will embedded and searched",
    metadata = {
        "source":"example.txt",
        "page":1,
        "author":"AngelO",
        "date_created":"2025-08-23",
        "custom_field":"any_value"
    }
)

print("Document Structure")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")
doc.

Document Structure
Content: This is the main text content that will embedded and searched
Metadata: {'source': 'example.txt', 'page': 1, 'author': 'AngelO', 'date_created': '2025-08-23', 'custom_field': 'any_value'}


Document(metadata={'source': 'example.txt', 'page': 1, 'author': 'AngelO', 'date_created': '2025-08-23', 'custom_field': 'any_value'}, page_content='This is the main text content that will embedded and searched')