# Data Ingestion

[Langchain Document Structure]()


In [1]:
## Document Data Structure and Ingestion
from langchain_core.documents import Document


In [2]:
doc = Document(
    page_content="This is the content of the document. Writing it to validate and test my ongoign RAG Application.",
    metadata={
        "source": "my_source.pdf", 
        "page": 1,
        "author": "Mayank Tripathi",
        "date_created": "2025-09-09"
        }
)

doc

Document(metadata={'source': 'my_source.pdf', 'page': 1, 'author': 'Mayank Tripathi', 'date_created': '2025-09-09'}, page_content='This is the content of the document. Writing it to validate and test my ongoign RAG Application.')

In [4]:
## Create a test text file with some content
import os
os.makedirs("../data/text_files", exist_ok=True)

In [12]:
sample_texts = {
    "../data/text_files/doc1.txt": """This is the content of document 1. It contains information about AI and Machine Learning.
    Here I am going to copy some code from Google sarch on Python Programming.

    Python is a high-level, interpreted, general-purpose programming language known for its readability and versatility. It's designed to be easy to learn and use, making it a popular choice for beginners and experienced programmers alike. 
Here's a more detailed look at why Python is considered a high-level language:
Key Characteristics of Python as a High-Level Language:
Abstraction:
.
Python abstracts away many of the low-level details of computer hardware, such as memory management, allowing programmers to focus on the logic of their code. 
Readability:
.
Python's syntax is designed to be clear and concise, resembling natural language, which makes it easier to read, write, and understand code. 
Dynamic Typing:
.
Python uses dynamic typing, meaning that the data type of a variable is checked during runtime, not at compile time. This simplifies coding and allows for more flexibility. 
Interpreted:
.
Python code is executed by an interpreter, which reads and executes the code line by line, rather than being compiled into machine code before execution. 
Object-Oriented:
.
Python supports object-oriented programming, allowing you to organize code into reusable objects and classes. 
Extensive Libraries:
.
Python has a vast standard library and a large ecosystem of third-party packages, providing pre-built functionalities for various tasks, such as web development, data analysis, and machine learning. 
Why is this important?
Ease of Use:
High-level languages like Python are easier to learn and use than low-level languages, making them more accessible to a wider range of users. 
Faster Development:
The abstraction and readability of Python allow for faster development cycles, as programmers can write code more quickly and with fewer errors. 
Portability:
Python is platform-independent, meaning that Python code can run on different operating systems (Windows, macOS, Linux) without modification, as long as the Python interpreter is installed. 
Versatility:
Python's versatility makes it suitable for a wide range of applications, including web development, data science, machine learning, scripting, and more. 
In essence, Python's high-level nature makes it a powerful and user-friendly language that empowers developers to create complex applications with relative ease. 

    """,
"../data/text_files/doc2.txt": """This is the content of document 2. It contains information about Data Science and Data Analysis.

    What does the Lorem Ipsum text mean?
Lorem Ipsum comes from a latin text written in 45BC by Roman statesman, lawyer, scholar, and philosopher, Marcus Tullius Cicero. The text is titled "de Finibus Bonorum et Malorum" which means "The Extremes of Good and Evil". The most common form of Lorem ipsum is the following:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

The text is a corrupted version of the original and therefore does not mean anything in particular. The book however where it originates discusses the philosophical views of Epicureanism, Stoicism, and the Platonism of Antiochus of Ascalon.

Lorem ipsum is widely in use since the 14th century and up to today as the default dummy "random" text of the typesetting and web development industry. In fact not only it has survived the test of time but it thrived and can be found in many software products, from Microsoft Word to WordPress.
"""
}

In [14]:
for filepath, content in sample_texts.items():
    print(f"Creating {filepath}...")
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)

print("✅ Sample text files created.")

Creating ../data/text_files/doc1.txt...
Creating ../data/text_files/doc2.txt...
✅ Sample text files created.


In [15]:
 ## Read the text files and create Document objects
from langchain.document_loaders import TextLoader

loader = TextLoader("../data/text_files/doc1.txt", encoding="utf-8")

print(loader)
documents = loader.load()
print(f"Number of documents loaded: {len(documents)}")
print(documents[0])  # Print the first document to verify

<langchain_community.document_loaders.text.TextLoader object at 0x10adb5940>
Number of documents loaded: 1
page_content='This is the content of document 1. It contains information about AI and Machine Learning.
    Here I am going to copy some code from Google sarch on Python Programming.

    Python is a high-level, interpreted, general-purpose programming language known for its readability and versatility. It's designed to be easy to learn and use, making it a popular choice for beginners and experienced programmers alike. 
Here's a more detailed look at why Python is considered a high-level language:
Key Characteristics of Python as a High-Level Language:
Abstraction:
.
Python abstracts away many of the low-level details of computer hardware, such as memory management, allowing programmers to focus on the logic of their code. 
Readability:
.
Python's syntax is designed to be clear and concise, resembling natural language, which makes it easier to read, write, and understand code. 
D

In [16]:
## Directory Loader
from langchain_community.document_loaders import DirectoryLoader

# Load all the text files from the directory
dir_loader = DirectoryLoader(
    "../data/text_files", 
    glob="**/*.txt", 
    loader_cls=TextLoader, ## Loader Class to use
    loader_kwargs={"encoding": "utf-8"},
    # show_progress=True # Need tqdm package for progress bar
    show_progress=False  
)

documents = dir_loader.load()
print(f"Number of documents loaded from directory: {len(documents)}")
print(documents)  


Number of documents loaded from directory: 2
[Document(metadata={'source': '../data/text_files/doc1.txt'}, page_content="This is the content of document 1. It contains information about AI and Machine Learning.\n    Here I am going to copy some code from Google sarch on Python Programming.\n\n    Python is a high-level, interpreted, general-purpose programming language known for its readability and versatility. It's designed to be easy to learn and use, making it a popular choice for beginners and experienced programmers alike. \nHere's a more detailed look at why Python is considered a high-level language:\nKey Characteristics of Python as a High-Level Language:\nAbstraction:\n.\nPython abstracts away many of the low-level details of computer hardware, such as memory management, allowing programmers to focus on the logic of their code. \nReadability:\n.\nPython's syntax is designed to be clear and concise, resembling natural language, which makes it easier to read, write, and understa

In [21]:
# Same can be done with PDF files using PyPDFLoader 

from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

dir_loader = DirectoryLoader(
    "../data/pdf_files", 
    glob="**/*.pdf", 
    loader_cls=PyMuPDFLoader, ## Loader Class to use
    show_progress=False  
)


pdf_documents = dir_loader.load()
print(f"Number of PDF documents loaded from directory: {len(pdf_documents)}")
print(pdf_documents)  # Print the first document to verify

Number of PDF documents loaded from directory: 0
[]


In [22]:
type(documents[0])

langchain_core.documents.base.Document

# Data Embedding and Vector Store

In [23]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [26]:
class EmbeddingManager:
    """Handdles document embedding generation using SentenceTransformers and stores them in ChromaDB."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load the SentenceTransformer model."""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print("Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """Generate embeddings for a list of texts."""
        if not self.model:
            raise ValueError("Model not loaded.")
        
        print(f"Generating Embedding for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        return embeddings
    

    def get_embedding_dimension(self) -> int:
        """Get the dimension of the embeddings."""
        if not self.model:
            raise ValueError("Model not loaded.")
        return self.model.get_sentence_embedding_dimension()
    

In [28]:
# Initialize the Embedding Manager
embedding_manager = EmbeddingManager()

embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}


<__main__.EmbeddingManager at 0x163776710>

In [29]:
# Vector Store with ChromaDB
class VectorStore:
    """
    Manages document embeddings in a ChromaDB vector store
    """

    def __init__(self, collection_name: str = "documents", persist_directory: str = "./chroma_db"):
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """Initialize ChromaDB client and collection."""
        try:
            print("Initializing ChromaDB client...")
            # Create persistent ChromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path = self.persist_directory)

            # Get or Create Collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "PDF Document embeddings collection for RAG Application"}
            )

            print(f"Vector Store initializing, Colelction: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error initializing ChromaDB: {e}")
            raise


    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """Add documents and their embeddings to the collection in Vector Store."""
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents and embeddings must match.")
        
        print(f"Adding {len(documents)} documents to vector store ...")

        # Prepare data for ChromaDB
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc, emb) in enumerate(zip(documents, embeddings)):
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare metadata
            metadata = dict(doc.metadata)
            metadata["doc_index"] = i   
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document Content
            documents_text.append(doc.page_content)

            # Embedding
            embeddings_list.append(emb.tolist())

            # Add to Collection
            try:
                self.collection.add(
                    ids=ids,
                    metadatas=metadatas,
                    documents=documents_text,
                    embeddings=embeddings_list
                )
                print(f"Successfully added {len(documents)} documents to vector store.")
                print(f"Total Documents in collection: {self.collection.count()}")
            except Exception as e:
                print(f"Error adding documents to vector store: {e}")
                raise               
             

In [30]:
vector_store=VectorStore()
vector_store

Initializing ChromaDB client...
Vector Store initializing, Colelction: documents
Existing documents in collection: 0


<__main__.VectorStore at 0x164830590>