In [1]:
### Document Structure
from langchain_core.documents import Document

In [2]:
doc = Document(
    page_content="This is main text content, I am using for RAG",
    metadata = {
        "source":"example.txt",
        "pages":1,
        "author":"Anuj Patel",
        "date_created":"2025-01-01"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Anuj Patel', 'date_created': '2025-01-01'}, page_content='This is main text content, I am using for RAG')

In [3]:
import os
os.makedirs("../data/text_files",exist_ok=True)

In [29]:
sample_text = {
    "../data/text_files/python_intro.txt": """

        What is Python?
Python is a popular programming language. It was created by Guido van Rossum, and released in 1991.

It is used for:

web development (server-side),
software development,
mathematics,
system scripting.
What can Python do?
Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software development.
Why Python?
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines than some other programming languages.
Python runs on an interpreter system, meaning that code can be executed as soon as it is written. This means that prototyping can be very quick.
Python can be treated in a procedural way, an object-oriented way or a functional way.
Good to know
The most recent major version of Python is Python 3, which we shall be using in this tutorial.
In this tutorial Python will be written in a text editor. It is possible to write Python in an Integrated Development Environment, such as Thonny, Pycharm, Netbeans or Eclipse which are particularly useful when managing larger collections of Python files.


""" ,
"../data/text_files/machine_learning.txt": """

    Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.
    Machine Learning is mainly divided into three core types: Supervised, Unsupervised and Reinforcement Learning along with two additional types, Semi-Supervised and Self-Supervised Learning.

Supervised Learning: Trains models on labeled data to predict or classify new, unseen data.
Unsupervised Learning: Finds patterns or groups in unlabeled data, like clustering or dimensionality reduction.
Reinforcement Learning: Learns through trial and error to maximize rewards, ideal for decision-making tasks.

"""
}

for filepath, content in sample_text.items():

    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)
print("Sample text files created")

Sample text files created


In [5]:
### Textloader  - Loading text from Text files
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../data/text_files/python_intro.txt",encoding="utf-8")
document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='\n\n        What is Python?\nPython is a popular programming language. It was created by Guido van Rossum, and released in 1991.\n\nIt is used for:\n\nweb development (server-side),\nsoftware development,\nmathematics,\nsystem scripting.\nWhat can Python do?\nPython can be used on a server to create web applications.\nPython can be used alongside software to create workflows.\nPython can connect to database systems. It can also read and modify files.\nPython can be used to handle big data and perform complex mathematics.\nPython can be used for rapid prototyping, or for production-ready software development.\nWhy Python?\nPython works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).\nPython has a simple syntax similar to the English language.\nPython has syntax that allows developers to write programs with fewer lines than some other programming languages.\nPython runs on an interpreter 

In [6]:
## Directory Loader - To read all files from a directory
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

## load all the text fies from the directory
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob = "**/*.txt", # Pattern to match files
    loader_cls= TextLoader, # loader class to use
    loader_kwargs={'encoding':'utf-8'},
    show_progress=False
)

documnets =dir_loader.load()
documnets

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='\n\n        What is Python?\nPython is a popular programming language. It was created by Guido van Rossum, and released in 1991.\n\nIt is used for:\n\nweb development (server-side),\nsoftware development,\nmathematics,\nsystem scripting.\nWhat can Python do?\nPython can be used on a server to create web applications.\nPython can be used alongside software to create workflows.\nPython can connect to database systems. It can also read and modify files.\nPython can be used to handle big data and perform complex mathematics.\nPython can be used for rapid prototyping, or for production-ready software development.\nWhy Python?\nPython works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).\nPython has a simple syntax similar to the English language.\nPython has syntax that allows developers to write programs with fewer lines than some other programming languages.\nPython runs on an interpreter 

In [7]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

## load all text files from the directory
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob = "**/*.pdf",
    loader_cls= PyMuPDFLoader,
    show_progress= False
)

pdfs =dir_loader.load()
pdfs

  from .autonotebook import tqdm as notebook_tqdm


[Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'file_path': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'total_pages': 42, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='AI Agents\nAuthors: Julia Wiesinger, Patrick Marlow and\nVladimir Vuskovic'),
 Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'file_path': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'total_pages': 42, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 1}, page_content='Agents\nSeptember 2024\n2\nAcknowledgements\nDesigner\nMichael Lanning \nTechnical Writer\nJoey Haymaker\nCurators and Editors\nAntonio Gulli\nA

PyPDFLoader - This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pypdf library for PDF processing and offers both synchronous and asynchronous document loading.


PyMuPDFLoader- This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting tables, extracting images, and defining extraction mode. It integrates the PyMuPDF library for PDF processing and offers both synchronous and asynchronous document loading.



In [8]:
from langchain_community.document_loaders import UnstructuredExcelLoader

## load all text files from the directory
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob = "**/*.xlsx",
    loader_cls= lambda file_path:UnstructuredExcelLoader(file_path),
    show_progress= False
)

excels =dir_loader.load()
excels

[Document(metadata={'source': '../data/text_files/FlightTicket_TrainData.xlsx'}, page_content='Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price IndiGo 24/03/2019 Banglore New Delhi BLR → DEL 22:20 01:10 22 Mar 2h 50m non-stop No info 3897 Air India 1/05/2019 Kolkata Banglore CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 stops No info 7662 Jet Airways 9/06/2019 Delhi Cochin DEL → LKO → BOM → COK 09:25 04:25 10 Jun 19h 2 stops No info 13882 IndiGo 12/05/2019 Kolkata Banglore CCU → NAG → BLR 18:05 23:30 5h 25m 1 stop No info 6218 IndiGo 01/03/2019 Banglore New Delhi BLR → NAG → DEL 16:50 21:35 4h 45m 1 stop No info 13302 SpiceJet 24/06/2019 Kolkata Banglore CCU → BLR 09:00 11:25 2h 25m non-stop No info 3873 Jet Airways 12/03/2019 Banglore New Delhi BLR → BOM → DEL 18:55 10:25 13 Mar 15h 30m 1 stop In-flight meal not included 11087 Jet Airways 01/03/2019 Banglore New Delhi BLR → BOM → DEL 08:00 05:05 02 Mar 21h 5m 1 stop No info

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def split_documents(documents,chunk_size = 1000, chunk_overlap = 200):
    """
    Split documents into smaller chunks
    Args:
        documents: List of Document objects or raw strings.
        chunk_size: Max characters per chunk
        chunk_overlaps: Overlap between chunks
    Returns:
        List of Document chunks
    """

    # Ensure all inputs are documents objects
    if isinstance(documents[0],str):
        documents = [Document(page_content=doc,metadata={}) for doc in documents]
    
    text_spliter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap,
        length_function = len,
        separators=['\n\n','\n','',""]
    )

    split_docs = text_spliter.split_documents(documents)
    print(f"split{len(documents)} documents into {len(split_docs)} chunks")

    # example of chunk
    if split_docs:
        print(f"\n Example chunk")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    return split_docs

all_pdf_documents = [
    Document(page_content="This is page one with some text about AI and RAG pipelines.",metadata={"page":1}),
    Document(page_content="Second page with more text, embeddings and transformers.",metadata={"page":2}),
]
chunks = split_documents(pdfs,chunk_size=50,chunk_overlap=10)
print(f"Total chunks created: {len(chunks)}")

split42 documents into 1319 chunks

 Example chunk
Content: AI Agents...
Metadata: {'producer': '', 'creator': '', 'creationdate': '', 'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'file_path': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'total_pages': 42, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}
Total chunks created: 1319


## Embeddings and VectoreStoreDB

In [10]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [11]:
class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformer"""
    def __init__(self,model_name: str = "all-MiniLM-L6-v2"):
        """
        Initialize the embedding manager

        Args:
            model_name: HuggingFace model name for sentence embeddings
        
        """

        self.model_name = model_name
        self.model = None
        self._load_model()
    
    def _load_model(self):
        """Load the SentenceTransformer model"""
        try:
            print(f"Loading embedding model:{self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded Successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"error loading model{self.model_name}: {e}")
            raise

    def generate_embeddings(self,texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts
        Args:
            texts: List of text strings to embed
        
        returns:
            Numpy array of embeddings with shape(len(texts),embedding_dim)
        """

        if not self.model:
            raise ValueError("Model not loaded")
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts,show_progress_bar= True)
        print(f"Generated embeddings with shape:{embeddings.shape}")
        return embeddings

## Initialize the embedding manager
embedding_manager = EmbeddingManager()
embedding_manager


Loading embedding model:all-MiniLM-L6-v2
Model loaded Successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x129dda0b0>

### vectorstore

In [12]:
class VectorStore:
    """Manages document in a ChromaDB vector store"""
    def __init__(self,collection_name: str = 'pdf_documents',persist_directory:str = "../data/vector_store"):
        """
        Initialize the vector store

        Args: 
            collection_name: Name of the chromadb collection
            persist_directory: Directory to persist the vector store
        """
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """Initialize chromaDB Client and collection"""
        try:
            #create persistent ChromaDB client
            os.makedirs(self.persist_directory,exist_ok=True)
            self.client = chromadb.PersistentClient(path= self.persist_directory)

            # Get or create collection
            self.collection = self.client.get_or_create_collection(
                name = self.collection_name,
                metadata = {"description":"PDF document embeddings for RAG"}
            )
            print(f"vector store initiallized. Collection:{self.collection_name}")
            print(f"Existing documents in collection:{self.collection.count()}")
        except Exception as e:
            print(f"Error initializing vector store:{e}")
            raise
    
    def add_documents(self,documents: List[Any], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store

        Args:
            documents: List of Langchain documents
            embeddings: corresponding embeddings for the documents
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of ducments must match number of embeddings")
        
        print(f"adding {len(documents)} documents to vector store...")

        # prepare data for chromaDB
        ids=[]
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i , (doc,embeddings) in enumerate(zip(documents, embeddings)):
            #generate unique Id
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # prepare metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document content
            documents_text.append(doc.page_content)

            # embeddings
            embeddings_list.append(embeddings.tolist())
        
        # Add to collection
        try:
            self.collection.add(
                ids=ids,
                embeddings = embeddings_list,
                metadatas = metadatas,
                documents = documents_text
            )
            print(f"succesfully added {len(documents)} documents to vector store")
            print(f"Total documents in collection:{self.collection.count()}")
        
        except Exception as e:
            print(f"error adding documents to vector store:{e}")
            raise

vectorstore = VectorStore()
vectorstore

vector store initiallized. Collection:pdf_documents
Existing documents in collection:11971


<__main__.VectorStore at 0x14e0aebc0>

In [13]:
chunks

[Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'file_path': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'total_pages': 42, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='AI Agents'),
 Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'file_path': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'total_pages': 42, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='Authors: Julia Wiesinger, Patrick Marlow and'),
 Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'file_path': '../data/text_

In [None]:
### Convert the text to embeddings
texts = [doc.page_content for doc in chunks]

embeddings = embedding_manager.generate_embeddings(texts)
## store in the VectorDB
vectorstore.add_documents(chunks,embeddings)

Generating embeddings for 1319 texts...


Batches: 100%|██████████| 42/42 [00:01<00:00, 33.77it/s]


Generated embeddings with shape:(1319, 384)
adding 1319 documents to vector store...
succesfully added 1319 documents to vector store
Total documents in collection:13290


Retriever Pipeline From Vectorstore

In [15]:
class RAGRetriever:
    """Handles query-based retrieval from the vectore store"""
    def __init__(self,vector_store:VectorStore,embedding_manager: EmbeddingManager):
        """
        Initialize the retriever

        Args:
            vector_store: vector store containing document embeddings
            embedding_manager: Manager for generating query embeddings
        
        """
        self.vector_store = vector_store
        self.embedding_manager= embedding_manager

    def retrieve(self,query:str , top_k: int = 5, score_threshold:float = 0.0) -> List[Dict[str,Any]]:
        """
            Retrieve relevant documents for a query

            Args:
                query: The search query
                top_k: Number of top results  to return
                score_threshold: Minimum similarity score threshold
            
            Returns:
                List of dictionaries containing retrieved documents and metadata
        """
        print(f"Retrieving documents for query: '{query}'")
        print(f"Top k: {top_k}, score threshold: {score_threshold}")

        # generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        # Search in vector store 
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )
            # process results
            retrieved_docs =[]
            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]

                for i , (doc_id,document,metadata,distance) in enumerate(zip(ids,documents,metadatas,distances)):
                    # Convert distance to similarity score (chromadb uses cosine distance)
                    similarity_score  = 1 - distance

                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id':doc_id,
                            'content':document,
                            'metadata':metadata,
                            'similarity_score':similarity_score,
                            'distance':distance,
                            'rank':i+1
                        })
                print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
            else:
                print("No documents found")
            return retrieved_docs
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return []

rag_retrieval = RAGRetriever(vectorstore,embedding_manager)
rag_retrieval

<__main__.RAGRetriever at 0x14e0af430>

In [16]:
rag_retrieval.retrieve("Tools: Our keys to the outside world")

Retrieving documents for query: 'Tools: Our keys to the outside world'
Top k: 5, score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 21.46it/s]

Generated embeddings with shape:(1, 384)
Retrieved 5 documents (after filtering)





[{'id': 'doc_f1892653_414',
  'content': 'Tools: Our keys to the outside world',
  'metadata': {'page': 11,
   'content_length': 36,
   'subject': '',
   'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf',
   'file_path': '../data/text_files/AI_Agent_Basics_1736794799.pdf',
   'trapped': '',
   'creator': '',
   'creationDate': '',
   'format': 'PDF 1.4',
   'title': '',
   'moddate': '',
   'doc_index': 414,
   'keywords': '',
   'modDate': '',
   'author': '',
   'total_pages': 42,
   'producer': '',
   'creationdate': ''},
  'similarity_score': 0.9999999999993058,
  'distance': 6.941819623469681e-13,
  'rank': 1},
 {'id': 'doc_256b9f9e_414',
  'content': 'Tools: Our keys to the outside world',
  'metadata': {'modDate': '',
   'file_path': '../data/text_files/AI_Agent_Basics_1736794799.pdf',
   'creationdate': '',
   'creationDate': '',
   'trapped': '',
   'creator': '',
   'doc_index': 414,
   'source': '../data/text_files/AI_Agent_Basics_1736794799.pdf',
   'author': ''

### Integration Vectordb Context pipeline With LLM output

In [None]:
### Simple RAG pipeline with groq LLM
from langchain_groq import ChatGroq
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv
load_dotenv()

### initialize the groq LLM
groq_api_key = os.getenv("GROQ_API_KEY")
GOOGLE_API_KEY =  os.getenv("GOOGLE_API_KEY")

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash",temperature=0.8, api_key=GOOGLE_API_KEY)



### Simple Rag Function: Retrieve context + generate response
def rag_simple(query,retriever,llm,top_k=3):
    ## Retrieval the context
    results = retriever.retrieve(query,top_k=top_k)
    context = "\n\n".join([doc['content'] for doc in results] if results else "")
    if not context:
        return "No relevant context found to answer the question"
    
    ## generate the answer using Groq LLM
    prompt = f""" use the following context to answer the question concisely.

    Context:
    {context}

    Question: {query}

    Answer:"""

    response = llm.invoke([prompt.format(context = context,query=query)])
    return response.content

In [18]:
answer = rag_simple("which programming language are used in this book for sample code",rag_retrieval,llm)
print(answer)

Retrieving documents for query: 'which programming language are used in this book for sample code'
Top k: 3, score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 18.17it/s]

Generated embeddings with shape:(1, 384)
Retrieved 3 documents (after filtering)





Python


In [None]:
def rag_advanced(query,retriever,llm,top_k=5,min_Score=0.2,return_context=False):
    """
        RAG pipeline with extra features:
        -   Returns answer, sources, confidence score, and optionally full context.
    """
    results = retriever.retrieve(query,top_k=top_k,score_threshold = min_Score)
    if not results:
        return {'answer': 'No relevant context found.','sources':[],'confidence':0.0,'context':''}

    # Prepare context and sources
    context = "\n\n".join([doc['content'] for doc in results])
    sources = [{
        'sources': doc['metadata'].get('source_file',doc['metadata'].get('source','unknown')),
        'page': doc['metadata'].get('page','unknown'),
        #'score': doc['similarity_score'],
        'preview': doc["content"][:300] + '...'
    } for doc in results]

    confidence = max([doc['similarity_score'] for doc in results])

    # Generate answer
    prompt = f"""Use the following context to answer the question concisely. \nContext:\n{context}\n\nQuestion:{query}\n\nAnswer: """
    response = llm.invoke([prompt.format(context=context,query=query)])

    output = {
        'answer': response.content,
        'sources':sources,
        'confidence':confidence
    }
    if return_context:
        output['context'] = context
    return output

# example usage:
result = rag_advanced("Enhancing model performance with targeted learning?", rag_retrieval,llm,top_k=3,min_Score=0.1,return_context=True)
print("Answer:",result['answer'])
print("Sources:",result['sources'])
print("confidence:",result['confidence'])
print("Context Preview:",result['context'][:300])


Retrieving documents for query: 'Enhancing model performance with targeted learning?'
Top k: 3, score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 13.06it/s]

Generated embeddings with shape:(1, 384)
Retrieved 3 documents (after filtering)





Answer: Enhancing model performance with targeted learning.
Sources: [{'sources': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'page': 2, 'preview': 'Enhancing model performance with targeted learnin...'}, {'sources': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'page': 2, 'preview': 'Enhancing model performance with targeted learnin...'}, {'sources': '../data/text_files/AI_Agent_Basics_1736794799.pdf', 'page': 2, 'preview': 'Enhancing model performance with targeted learnin...'}]
confidence: 0.7360277473926544
Context Preview: Enhancing model performance with targeted learnin

Enhancing model performance with targeted learnin

Enhancing model performance with targeted learnin
