### 🎯 Module Overview
This module covers everything you need to know about parsing and ingesting data for RAG systems, from basic text files to complex PDFs and databases. We'll use LangChain v0.3 and explore each technique with practical examples.

Table of Contents

- Introduction to Data Ingestion
- Text Files (.txt)
- PDF Documents
- Microsoft Word Documents
- CSV and Excel Files
- JSON and Structured Data
- Web Scraping
- Databases (SQL)
- Audio and Video Transcripts
- Advanced Techniques
- Best Practices

### Introduction to Data Ingestion

In [1]:
import os
from typing import List,Dict,Any
import pandas as pd

In [2]:
from langchain_core.documents import Document
from langchain.text_splitter import(
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)
print("Set up complete")

Set up complete


### Understanding Document Structure In Langchain

In [3]:
### Create a Simple document
doc=Document(
    page_content="This is the main text content that will be embedded and searched.",
    metadata={
        "source":"example.txt",
        "page":1,
        "author":"Laavanjan",
        "date_created":"2024-01-01",
        "cutom_field":"any_value"

    }
)

print("Document Structure")
print(f'Page Content: {doc.page_content}')
print(f'Metadata: {doc.metadata}')

# Why metadata matters:
print("\n📝 Metadata is crucial for:")
print("- Filtering search results")
print("- Tracking document sources")
print("- Providing context in responses")
print("- Debugging and auditing")

Document Structure
Page Content: This is the main text content that will be embedded and searched.
Metadata: {'source': 'example.txt', 'page': 1, 'author': 'Laavanjan', 'date_created': '2024-01-01', 'cutom_field': 'any_value'}

📝 Metadata is crucial for:
- Filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing


In [4]:
print(doc)
print(type(doc))
print(doc.metadata)
print(doc.page_content)

page_content='This is the main text content that will be embedded and searched.' metadata={'source': 'example.txt', 'page': 1, 'author': 'Laavanjan', 'date_created': '2024-01-01', 'cutom_field': 'any_value'}
<class 'langchain_core.documents.base.Document'>
{'source': 'example.txt', 'page': 1, 'author': 'Laavanjan', 'date_created': '2024-01-01', 'cutom_field': 'any_value'}
This is the main text content that will be embedded and searched.


### Text Files (.txt) - The Simplest Case {#2-text-files}

In [5]:
os.makedirs("data/TextFiles", exist_ok=True)

In [9]:
sample_texts={
    "data/TextFiles/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "data/TextFiles/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("✅ Sample text files created!")

✅ Sample text files created!


### TextLoader- Read Single File 

In [12]:
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

#loading a single text file
loader = TextLoader('data/TextFiles/python_intro.txt', encoding='utf-8')

In [13]:
loader

<langchain_community.document_loaders.text.TextLoader at 0x199dc546dd0>

In [14]:
doc=loader.load()
print(type(doc))
print(doc)

<class 'list'>
[Document(metadata={'source': 'data/TextFiles/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogramming languages in the world.\n\nKey Features:\n- Easy to learn and use\n- Extensive standard library\n- Cross-platform compatibility\n- Strong community support\n\nPython is widely used in web development, data science, artificial intelligence, and automation.')]


In [18]:
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

loader=TextLoader(file_path="data/TextFiles/machine_learning.txt")
doc=loader.load()

print(len(doc))
print(f"content: {doc[0].page_content}")
print(f"metadata: {doc[0].metadata}")


1
content: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems


    
metadata: {'source': 'data/TextFiles/machine_learning.txt'}


### DirectoryLoader- Multiple Text Files

In [25]:
### Directory Loader
from langchain.document_loaders import DirectoryLoader
from langchain_community.document_loaders import DirectoryLoader

## load all the text files in a directory
dir_loader=DirectoryLoader(path="data/TextFiles",
                       glob="*.txt",## pattern to match files
                       loader_cls= TextLoader, ## loader class to use
                       loader_kwargs={'encoding':'utf-8'}, ## every text data need to be encoded with utf-8
                       show_progress=True # show progress bar
                       )

print(dir_loader)

<langchain_community.document_loaders.directory.DirectoryLoader object at 0x00000199DC73EC10>


In [26]:
documents=dir_loader.load()

100%|██████████| 2/2 [00:00<00:00, 365.09it/s]




In [28]:
print(f"Load {len(documents)} documents")
for i,doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f" Source: {doc.metadata['source']}")
    print(f' length: {len(doc.page_content)}')

# 📊 Analysis
print("\n📊 DirectoryLoader Characteristics:")
print("✅ Advantages:")
print("  - Loads multiple files at once")
print("  - Supports glob patterns")
print("  - Progress tracking")
print("  - Recursive directory scanning")

print("\n❌ Disadvantages:")
print("  - All files must be same type")
print("  - Limited error handling per file")
print("  - Can be memory intensive for large directories")

Load 2 documents

Document 1:
 Source: data\TextFiles\machine_learning.txt
 length: 575

Document 2:
 Source: data\TextFiles\python_intro.txt
 length: 489

📊 DirectoryLoader Characteristics:
✅ Advantages:
  - Loads multiple files at once
  - Supports glob patterns
  - Progress tracking
  - Recursive directory scanning

❌ Disadvantages:
  - All files must be same type
  - Limited error handling per file
  - Can be memory intensive for large directories


### Text Splitting Statergies

In [29]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

print(documents)

[Document(metadata={'source': 'data\\TextFiles\\machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '), Document(metadata={'source': 'data\\TextFiles\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogra

In [30]:
# Method1- Character text Splitter

text= documents[0].page_content
text

'Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '

In [32]:
print('CHARACTER TEXT SPLITTER')
char_splitter = CharacterTextSplitter(
    separator="\n",  # Split on newlines
    chunk_size=200,  # Max chunk size in characters
    chunk_overlap=20,  # Overlap between chunks
    length_function=len  # How to measure chunk size
)

char_chunks= char_splitter.split_text(text)
print(f'Created {len(char_chunks)} chunks') 
print(f'First chunk: {char_chunks[0]}')

CHARACTER TEXT SPLITTER
Created 4 chunks
First chunk: Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems to learn and improve


In [38]:
for i in char_chunks:
    print(f'chunk:{i} \n')

chunk:Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems to learn and improve 

chunk:from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.
Types of Machine Learning: 

chunk:1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties 

chunk:Applications include image recognition, speech processing, and recommendation systems 



In [41]:
# Method 1: Character-based splitting on ' '
print("1️⃣ CHARACTER TEXT SPLITTER")
char_splitter = CharacterTextSplitter(
    separator=" ",  # Split on newlines
    chunk_size=200,  # Max chunk size in characters
    chunk_overlap=20,  # Overlap between chunks
    length_function=len  # How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0]}")

1️⃣ CHARACTER TEXT SPLITTER
Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing


In [45]:
for i in char_chunks:
    print(f'chunk:- {i}')
    print('-'*100)

chunk:- Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing
----------------------------------------------------------------------------------------------------
chunk:- on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning:
----------------------------------------------------------------------------------------------------
chunk:- Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
----------------------------------------------------------------------------------------------------


In [48]:
 # Method 2: Recursive character splitting (RECOMMENDED)
print("\n RECURSIVE CHARACTER TEXT SPLITTER")
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=['\n\n','\n',' ',''], #Try these seperators
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

recursive_chunks=recursive_splitter.split_text(text)
print(f'Created {len(recursive_chunks)} chunks')
print(f'First Chunk:- {recursive_chunks[0]}')
print(recursive_chunks)


 RECURSIVE CHARACTER TEXT SPLITTER
Created 6 chunks
First Chunk:- Machine Learning Basics
['Machine Learning Basics', 'Machine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs', 'that can access data and use it to learn for themselves.', 'Types of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data', '3. Reinforcement Learning: Learning through rewards and penalties', 'Applications include image recognition, speech processing, and recommendation systems']


In [49]:
for i in recursive_chunks:
    print(f'chunk:- {i}')
    print(100*'-')

chunk:- Machine Learning Basics
----------------------------------------------------------------------------------------------------
chunk:- Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
----------------------------------------------------------------------------------------------------
chunk:- that can access data and use it to learn for themselves.
----------------------------------------------------------------------------------------------------
chunk:- Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
----------------------------------------------------------------------------------------------------
chunk:- 3. Reinforcement Learning: Learning through rewards and penalties
----------------------------------------------------------------------------

In [50]:
# Create text without natural break points
simple_text = "This is sentence one and it is quite long. This is sentence two and it is also quite long. This is sentence three which is even longer than the others. This is sentence four. This is sentence five. This is sentence six."

splitter = RecursiveCharacterTextSplitter(
    separators=[" "],  # Only split on spaces
    chunk_size=80,
    chunk_overlap=20,
    length_function=len
)

chunks = splitter.split_text(simple_text)

print(f"\nSimple text example - {len(chunks)} chunks:\n")

for i in range(len(chunks) - 1):
    print(f"Chunk {i+1}: '{chunks[i]}'")
    print(f"Chunk {i+2}: '{chunks[i+1]}'")
    
    
    print()


Simple text example - 4 chunks:

Chunk 1: 'This is sentence one and it is quite long. This is sentence two and it is also'
Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'

Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'
Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'

Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'
Chunk 4: 'is sentence five. This is sentence six.'



In [63]:
# Method 3: Token-based splitting
print("\n3️⃣ TOKEN TEXT SPLITTER")
token_splitter = TokenTextSplitter(
    chunk_size=50,  # Size in tokens (not characters)
    chunk_overlap=10
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks ")
for i in token_chunks:
    print(f"\n Chunk {token_chunks.index(i)+1}: {i}")


3️⃣ TOKEN TEXT SPLITTER
Created 3 chunks 

 Chunk 1: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types

 Chunk 2:  use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards

 Chunk 3: 
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems


    


In [51]:
# 📊 Comparison
print("\n📊 Text Splitting Methods Comparison:")
print("\nCharacterTextSplitter:")
print("  ✅ Simple and predictable")
print("  ✅ Good for structured text")
print("  ❌ May break mid-sentence")
print("  Use when: Text has clear delimiters")

print("\nRecursiveCharacterTextSplitter:")
print("  ✅ Respects text structure")
print("  ✅ Tries multiple separators")
print("  ✅ Best general-purpose splitter")
print("  ❌ Slightly more complex")
print("  Use when: Default choice for most texts")

print("\nTokenTextSplitter:")
print("  ✅ Respects model token limits")
print("  ✅ More accurate for embeddings")
print("  ❌ Slower than character-based")
print("  Use when: Working with token-limited models")


📊 Text Splitting Methods Comparison:

CharacterTextSplitter:
  ✅ Simple and predictable
  ✅ Good for structured text
  ❌ May break mid-sentence
  Use when: Text has clear delimiters

RecursiveCharacterTextSplitter:
  ✅ Respects text structure
  ✅ Tries multiple separators
  ✅ Best general-purpose splitter
  ❌ Slightly more complex
  Use when: Default choice for most texts

TokenTextSplitter:
  ✅ Respects model token limits
  ✅ More accurate for embeddings
  ❌ Slower than character-based
  Use when: Working with token-limited models
