### Introduction to Data Ingestion and Parsing


In [2]:
import os
from typing import List, Dict, Any
import pandas


In [3]:
from langchain_core.documents import Document
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
)
print("Set up is complete!")


Set up is complete!


### Understanding the Document Structure in LangChain

In [4]:
## Create a simple document
doc = Document(
    page_content="This is the main text content that will be embedded and searched.",
    metadata={
        "source": "example.txt",
        "page": 1,
        "author": "Johnny Boy",
        "date_created": "2025-09-02",
        "custom_field": "custom_value",
    },
)
print("Document structure");
print(f"Content: {doc.page_content}");
print(f"Metadata: {doc.metadata}");

Document structure
Content: This is the main text content that will be embedded and searched.
Metadata: {'source': 'example.txt', 'page': 1, 'author': 'Johnny Boy', 'date_created': '2025-09-02', 'custom_field': 'custom_value'}


### Text Files (.txt) - The simplest case {#2-text-files}

In [5]:
## Create a simple text file
import os
os.makedirs("data/text_files", exist_ok=True)


In [6]:
sample_texts = {
    "data/text_files/python_intro.txt": """1-dataingestion.ipynb
Python's dynamic typing allows flexible code but can cause runtime errors.
The GIL prevents true multithreading, making multiprocessing better for CPU tasks.
List comprehensions replace for loops with readable one-liners.
Python emphasizes readability and the principle of one obvious way.
The standard library is extensive, earning Python the 'batteries included' nickname.
Duck typing determines suitability by methods, not actual type.
Python's interpreter enables rapid prototyping and interactive development.
Data science popularity comes from NumPy, Pandas, and Scikit-learn libraries.
Python's natural language syntax makes it ideal for beginners.
Virtual environments solve dependency conflicts between projects
    """,
    "data/text_files/machine_learning.txt": """
Machine learning algorithms learn patterns from data without explicit programming.
Supervised learning uses labeled data to predict outcomes on new examples.
Unsupervised learning finds hidden structures in data without target labels.
Neural networks mimic brain neurons with interconnected layers of nodes.
Deep learning uses multiple hidden layers to model complex relationships.
Overfitting occurs when models memorize training data but fail on new data.
Cross-validation splits data to test model performance on unseen examples.
Feature engineering transforms raw data into meaningful model inputs.
Gradient descent optimizes model parameters by minimizing prediction errors.
Ensemble methods combine multiple models to improve overall accuracy.
    """
}

for filepath, content in sample_texts.items():
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)
    print(f"{filepath} created")


data/text_files/python_intro.txt created
data/text_files/machine_learning.txt created


### TextLoader - Read a Single File

In [7]:
##from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader
loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")
docs = loader.load()
print(f"Number of documents: {len(docs)}")
print(f"Content: {docs[0].page_content[:100]}")
print(f"Metadata: {docs[0].metadata}")
print(type(docs))

Number of documents: 1
Content: 1-dataingestion.ipynb
Python's dynamic typing allows flexible code but can cause runtime errors.
The
Metadata: {'source': 'data/text_files/python_intro.txt'}
<class 'list'>


### DirectoryLoader - Multiple Text Files


In [8]:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(
    "data/text_files/", 
    glob="**/*.txt", 
    loader_cls=TextLoader,
    show_progress=True
)
docs = loader.load()
print(f"Number of documents: {len(docs)}")
for i, doc in enumerate(docs):
    print(f"Document {i+1}:")
    print(f"   Source: {doc.metadata['source']}")
    print(f"   Lenght: {len(doc.page_content)} characters")


100%|██████████| 2/2 [00:00<00:00, 3943.87it/s]

Number of documents: 2
Document 1:
   Source: data/text_files/python_intro.txt
   Lenght: 747 characters
Document 2:
   Source: data/text_files/machine_learning.txt
   Lenght: 755 characters





### Text Splitting Strategies

In [9]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
)
print(docs)


[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content="1-dataingestion.ipynb\nPython's dynamic typing allows flexible code but can cause runtime errors.\nThe GIL prevents true multithreading, making multiprocessing better for CPU tasks.\nList comprehensions replace for loops with readable one-liners.\nPython emphasizes readability and the principle of one obvious way.\nThe standard library is extensive, earning Python the 'batteries included' nickname.\nDuck typing determines suitability by methods, not actual type.\nPython's interpreter enables rapid prototyping and interactive development.\nData science popularity comes from NumPy, Pandas, and Scikit-learn libraries.\nPython's natural language syntax makes it ideal for beginners.\nVirtual environments solve dependency conflicts between projects\n    "), Document(metadata={'source': 'data/text_files/machine_learning.txt'}, page_content='\nMachine learning algorithms learn patterns from data without explicit p

### Method 1 - Character Text Splitter

In [10]:
text=docs[0].page_content
print("CharacterTextSplitter")
char_splitter = CharacterTextSplitter(
    separator="\n", # Split on newlines prevents overlaps, use " " to see overlaps
    chunk_size=100,
    chunk_overlap=30,
    length_function=len,
)
char_chunks = char_splitter.split_text(text)
print(f"Number of chunks: {len(char_chunks)}")
for i, chunk in enumerate(char_chunks):
    print(f"Chunk {i+1}: {chunk}\n")


CharacterTextSplitter
Number of chunks: 10
Chunk 1: 1-dataingestion.ipynb
Python's dynamic typing allows flexible code but can cause runtime errors.

Chunk 2: The GIL prevents true multithreading, making multiprocessing better for CPU tasks.

Chunk 3: List comprehensions replace for loops with readable one-liners.

Chunk 4: Python emphasizes readability and the principle of one obvious way.

Chunk 5: The standard library is extensive, earning Python the 'batteries included' nickname.

Chunk 6: Duck typing determines suitability by methods, not actual type.

Chunk 7: Python's interpreter enables rapid prototyping and interactive development.

Chunk 8: Data science popularity comes from NumPy, Pandas, and Scikit-learn libraries.

Chunk 9: Python's natural language syntax makes it ideal for beginners.

Chunk 10: Virtual environments solve dependency conflicts between projects



### Method 2 - RecursiveCharacterTextSplitter

In [None]:
print("RecursiveCharacterTextSplitter")
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=30,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)
chunks = recursive_splitter.split_text(text)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")
    

### Method 3 - Token Text Splitter

In [None]:
print("TokenTextSplitter")
token_splitter = TokenTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
)
token_chunks = token_splitter.split_text(text)
print(f"created {len(token_chunks)} chunks")
print(f"first chunk: {token_chunks[0][:100]}")