RAG:

Data ingestion --> Chunking --> Embedding --> VectorDB -->Retriever --> LLM

In [1]:
import os
from langchain.document_loaders import PyPDFLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


Loading all the files

In [2]:
def process_all_pdfs(pdf_directory):
    """process all pdf files from directory"""
    all_documents = []
    pdf_dir = Path(pdf_directory)

    #find all pdf files recursively and list them
    pdf_files = list(pdf_dir.glob("**/*.pdf"))

    
    for pdf_file in pdf_files:
        print(f"\nProcessing:{pdf_file.name}")
        try:
            loader = PyPDFLoader(str(pdf_file))
            documents = loader.load()

            #add source information to metadata
            for doc in documents:
                doc.metadata['source_file'] = pdf_file.name
                doc.metadata['file_type'] = 'pdf'

            all_documents.extend(documents)
            print(f"loaded: {len(documents)} pages")
        
        except Exception as e:
            print(f" error:{e}")

    print(f"\nTotal documents loaded:{len(all_documents)}")
    return all_documents

all_pdf_documents = process_all_pdfs("../data")


Processing:1301.3781v3.pdf
loaded: 12 pages

Processing:1706.03762v7_Attention Is All You Need.pdf
loaded: 15 pages

Processing:1758078134582-sml_for_agentic_ai.pdf
loaded: 17 pages

Processing:LEAF-Net_A_Unified_Framework_for_Leaf_Extraction_a.pdf
loaded: 11 pages

Processing:Research Pape-pabbisetty_pranavir.pdf
loaded: 7 pages

Total documents loaded:62


In [3]:
all_pdf_documents

[Document(metadata={'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2013-09-10T00:03:46+00:00', 'author': '', 'keywords': '', 'moddate': '2013-09-10T00:03:46+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf_files\\1301.3781v3.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1', 'source_file': '1301.3781v3.pdf', 'file_type': 'pdf'}, page_content='Efﬁcient Estimation of Word Representations in\nVector Space\nTomas Mikolov\nGoogle Inc., Mountain View, CA\ntmikolov@google.com\nKai Chen\nGoogle Inc., Mountain View, CA\nkaichen@google.com\nGreg Corrado\nGoogle Inc., Mountain View, CA\ngcorrado@google.com\nJeffrey Dean\nGoogle Inc., Mountain View, CA\njeff@google.com\nAbstract\nWe propose two novel model architectures for computing continuous vector repre-\nsentations of words from very large data sets. The quali

Chunking

In [4]:
def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """Split documents into smaller chunks for better vector embeedings clustering"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size, 
        chunk_overlap = chunk_overlap,  #contains the few words that from the previous chunk, that makes it relative to each other
        length_function = len,
        separators = ["\n\n","\n"," ",""]
    )
    split_docs = text_splitter.split_documents(documents)
    print(f"split {len(documents)} documents into {len(split_docs)} chunk")

    if split_docs:
        print(f"\nExample chunk:")
        print(f"content: {split_docs[0].page_content[:200]}...")
        print(f"metadata: {split_docs[0].metadata}")

    return split_docs

In [5]:
chunks = split_documents(all_pdf_documents)
chunks

split 62 documents into 273 chunk

Example chunk:
content: Efﬁcient Estimation of Word Representations in
Vector Space
Tomas Mikolov
Google Inc., Mountain View, CA
tmikolov@google.com
Kai Chen
Google Inc., Mountain View, CA
kaichen@google.com
Greg Corrado
Goo...
metadata: {'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2013-09-10T00:03:46+00:00', 'author': '', 'keywords': '', 'moddate': '2013-09-10T00:03:46+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf_files\\1301.3781v3.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1', 'source_file': '1301.3781v3.pdf', 'file_type': 'pdf'}


[Document(metadata={'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2013-09-10T00:03:46+00:00', 'author': '', 'keywords': '', 'moddate': '2013-09-10T00:03:46+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf_files\\1301.3781v3.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1', 'source_file': '1301.3781v3.pdf', 'file_type': 'pdf'}, page_content='Efﬁcient Estimation of Word Representations in\nVector Space\nTomas Mikolov\nGoogle Inc., Mountain View, CA\ntmikolov@google.com\nKai Chen\nGoogle Inc., Mountain View, CA\nkaichen@google.com\nGreg Corrado\nGoogle Inc., Mountain View, CA\ngcorrado@google.com\nJeffrey Dean\nGoogle Inc., Mountain View, CA\njeff@google.com\nAbstract\nWe propose two novel model architectures for computing continuous vector repre-\nsentations of words from very large data sets. The quali

Embeddings
* converting text into vectors
* for Embeddings we can use sentance transformers --hugging face

In [6]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [7]:
class EmbeddingManager:
        """Handles document embedding generation using SentanceTransformer"""

        def __init__(self, model_name: str="all-MiniLM-L6-v2"):
            """ Initialise the embeddings manager"""

            self.model_name = model_name
            self.model = None
            self._load_model()

        def _load_model(self):
            """load the SentenceTransformer model"""

            try:
                print(f"Loading embedding model: {self.model_name}")
                self.model = SentenceTransformer(self.model_name)
                print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
            except Exception as e:
                print(f"Error loading model{self.model_name}:{e}")
                raise
        
        def generate_embeddings(self, texts: List[str]) -> np.array:
            """Generate embeddings for list of texts"""

            if not self.model:
                raise ValueError("Model not loaded")

            print(f"Generating embeddings for {len(texts)} texts...")
            embeddings = self.model.encode(texts, show_progress_bar = True)
            print(f"generated embeddings with shape:{embeddings.shape}")
            return embeddings


embeding_manager = EmbeddingManager()

embeding_manager
        

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x24d5141e0a0>

VectorDB

In [36]:
class VectorStore:
        """Manages document embeddings in a chromadb"""

        def __init__(self,collection_name: str="pdf_documents", persistant_directory: str="../data/vector_store"):

            self.collection_name = collection_name
            self.persistant_directory = persistant_directory
            self.Client = None
            self.collection = None
            self._initialize_store()

        def _initialize_store(self):
            """Initialize ChromaDB clinet and collection"""
            try:
                os.makedirs(self.persistant_directory, exist_ok=True)
                self.client = chromadb.PersistentClient(path=self.persistant_directory)
                self.collection = self.client.get_or_create_collection(
                    name=self.collection_name,
                    metadata={"description":"PDF document embeddings for RAG"}
                    )
                print(f"Vector store initialized. collection: {self.collection_name}")
                print(f"Existing document in collection: {self.collection.count()}")


            except Exception as e:
                print(f"Error initializing the vectore store {e}")
                raise


        def add_documents(self, documents: List[Any], embeddings: np.ndarray):

            """Adding documents and their embeddings to vector store"""
             
            if len(documents) != len(embeddings):
                raise ValueError("No.of documents must matach no.of embeddings")
                
            print(f"Adding {len(documents)} documents to vectore store...")

                #preparing data for Chromadb
            ids = []
            metadatas = []
            documents_text = []
            embeddings_list = []

            for i, (doc, embeddings) in enumerate(zip(documents,embeddings)):
                #generate unique ID
                doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
                ids.append(doc_id)

                    #prepare metadata
                metadata = dict(doc.metadata)
                metadata['doc_index'] = i
                metadata['content_length'] = len(doc.page_content)
                metadatas.append(metadata)

                #Document content
                documents_text.append(doc.page_content)

                
                #embeddings
                embeddings_list.append(embeddings.tolist())

            try:
                self.collection.add(
                    ids=ids,
                    embeddings=embeddings_list,
                    metadatas=metadatas,
                    documents=documents_text
                )
                print(f"Successfully added {len(documents)} to the vectorr store")
                print(f"total documents in the vector store: {self.collection.count()}")

            except Exception as e:
                print(f"error adding documents to the  vector")
                raise
                

vectorstore = VectorStore()

vectorstore

Vector store initialized. collection: pdf_documents
Existing document in collection: 0


<__main__.VectorStore at 0x24d6b67aa60>

In [24]:
#now i need to convert this chunks to embeddings then add to vectordb
chunks

[Document(metadata={'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2013-09-10T00:03:46+00:00', 'author': '', 'keywords': '', 'moddate': '2013-09-10T00:03:46+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf_files\\1301.3781v3.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1', 'source_file': '1301.3781v3.pdf', 'file_type': 'pdf'}, page_content='Efﬁcient Estimation of Word Representations in\nVector Space\nTomas Mikolov\nGoogle Inc., Mountain View, CA\ntmikolov@google.com\nKai Chen\nGoogle Inc., Mountain View, CA\nkaichen@google.com\nGreg Corrado\nGoogle Inc., Mountain View, CA\ngcorrado@google.com\nJeffrey Dean\nGoogle Inc., Mountain View, CA\njeff@google.com\nAbstract\nWe propose two novel model architectures for computing continuous vector repre-\nsentations of words from very large data sets. The quali

Text to embeddings

In [25]:
text = [doc.page_content for doc in chunks]

text

['Efﬁcient Estimation of Word Representations in\nVector Space\nTomas Mikolov\nGoogle Inc., Mountain View, CA\ntmikolov@google.com\nKai Chen\nGoogle Inc., Mountain View, CA\nkaichen@google.com\nGreg Corrado\nGoogle Inc., Mountain View, CA\ngcorrado@google.com\nJeffrey Dean\nGoogle Inc., Mountain View, CA\njeff@google.com\nAbstract\nWe propose two novel model architectures for computing continuous vector repre-\nsentations of words from very large data sets. The quality of these representations\nis measured in a word similarity task, and the results are compared to the previ-\nously best performing techniques based on different types of neural networks. We\nobserve large improvements in accuracy at much lower computational cost, i.e. it\ntakes less than a day to learn high quality word vectors from a 1.6 billion words\ndata set. Furthermore, we show that these vectors provide state-of-the-art perfor-\nmance on our test set for measuring syntactic and semantic word similarities.\n1 Intro

In [26]:
#embeddings

embeddings = embeding_manager.generate_embeddings(text)

embeddings

Generating embeddings for 273 texts...


Batches:   0%|          | 0/9 [00:00<?, ?it/s]

Batches: 100%|██████████| 9/9 [00:08<00:00,  1.10it/s]

generated embeddings with shape:(273, 384)





array([[-0.06362029, -0.16879195, -0.00668469, ...,  0.03083144,
        -0.03387373,  0.02504949],
       [-0.05534703, -0.11472886, -0.00890647, ..., -0.03675747,
        -0.01979521, -0.02677884],
       [-0.07660776, -0.11830959,  0.02248334, ...,  0.01811084,
        -0.06948708, -0.01521258],
       ...,
       [-0.03222892, -0.05980909,  0.01131762, ...,  0.03133549,
        -0.06806459, -0.00500511],
       [-0.02302128,  0.03677272, -0.04058523, ...,  0.0375716 ,
        -0.03008257, -0.00182065],
       [-0.09816772, -0.01036338,  0.02062584, ...,  0.04619452,
        -0.04580394, -0.02005525]], dtype=float32)

In [37]:
#vecor store

vectorstore.add_documents(chunks,embeddings)

Adding 273 documents to vectore store...
Successfully added 273 to the vectorr store
total documents in the vector store: 273


Retriver Pipeline

* embedd the query and get the relevent information formt the vector store with similarity score and top k results

In [70]:
class RAGRetriver:
    """Handles query-based retriever from the vector store"""

    def __init__(self, vector_store: VectorStore, embeding_manager: EmbeddingManager):
        """"initialize the retriever"""

        self.vector_store = vector_store
        self.embedding_manager = embeding_manager

    def retrieve(self, query:str, top_k: int=3, score_theshold:float=0.0) -> List[Dict[str,Any]]:
        """Retriver relevant documnets for a query
        
        Args:
            top_k: no.of results to return 
            score_threshold: minimum similarity score threshold

        Returns:
                List of dictionaries containing retrived documents and metadata
        """

        print(f"Retriving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score threshold: {score_theshold}")

        #generate query embeddings
        query_embeddings = self.embedding_manager.generate_embeddings([query])[0]

        #search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings = [query_embeddings.tolist()],
                n_results = top_k
            )

            #process resutls
            retrieved_docs = []

            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]

                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids,documents,metadatas,distances)):
                    #convert distance to similarity score (chromadb used cosine distance)
                    similarity_score = 1 - distance

                    if similarity_score >= score_theshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distances,
                            'rank': i+1
                        })

                        print(f"Retrived {len(retrieved_docs)} documents (after filtering)")

                else:
                    print("No documents found")
                    return retrieved_docs
            
        except Exception as e:
            print(f"Error during retrival {e}")
            return []


rag_retiver = RAGRetriver(vectorstore,embeding_manager)

In [71]:
rag_retiver

<__main__.RAGRetriver at 0x24d6bd7b940>

In [72]:
rag_retiver.retrieve("what is attention is all you need?")

Retriving documents for query: 'what is attention is all you need?'
Top K: 3, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 99.76it/s]

generated embeddings with shape:(1, 384)
Retrived 1 documents (after filtering)
No documents found





[{'id': 'doc_914740ad_65',
  'content': '3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3',
  'metadata': {'author': '',
   'creator': 'LaTeX with hyperref',
   'content_length': 216,
   'page_label': '3',
   'title': '',
   'page': 2,
   'total_pages': 15,
   'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5',
   'keywords': '',
   'doc_index': 65,
   'file_type': 'pdf',
   'source': '..\\data\\pdf_files\\1706.03762v7_Attention Is All You Need.pdf',
   'producer': 'pdfTeX-1.40.25',
   'subject': '',
   'trapped': '/False',
   'moddate': '2024-04-10T21:11:43+00:00',
   'creationdate': '2024-04-10T21:11:43+00:00',
   'source_file': '1706.03762v7_Attention Is All You Need.pdf'},
  'similarity_score': 0.12962305545806885,
  'distance': [0.8703769445419312, 1.018078

In [73]:
rag_retiver.retrieve("what is Scaled Dot-Product Attention?")

Retriving documents for query: 'what is Scaled Dot-Product Attention?'
Top K: 3, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 166.55it/s]

generated embeddings with shape:(1, 384)
Retrived 1 documents (after filtering)
Retrived 2 documents (after filtering)
Retrived 3 documents (after filtering)
No documents found





[{'id': 'doc_44c3e5d0_68',
  'content': 'dot product attention without scaling for larger values of dk [3]. We suspect that for large values of\ndk, the dot products grow large in magnitude, pushing the softmax function into regions where it has\nextremely small gradients 4. To counteract this effect, we scale the dot products by 1√dk\n.\n3.2.2 Multi-Head Attention\nInstead of performing a single attention function with dmodel-dimensional keys, values and queries,\nwe found it beneficial to linearly project the queries, keys and values h times with different, learned\nlinear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of\nqueries, keys and values we then perform the attention function in parallel, yielding dv-dimensional\n4To illustrate why the dot products get large, assume that the components of q and k are independent random\nvariables with mean 0 and variance 1. Then their dot product, q · k = Pdk\ni=1 qiki, has mean 0 and variance dk.

LLM pipeline

In [86]:
#Using Groq LLM

from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv
load_dotenv()

groq_api = os.getenv("GROQ_API_KEY")


# print(groq_api)

llm = ChatGroq(
    api_key=groq_api,
    model_name="openai/gpt-oss-120b",
    temperature=0.1,
    max_tokens=1024
               )

def rag(query,retiver,llm,top_k=3):

    results = retiver.retrieve(query,top_k=top_k)
    context = "\n\n".join([doc['content'] for doc in results]) if results else ""
    if not context:
        return "No relevant context found to answer this question"
    
    prompt = """ 
        Use the following context to answer the question conciesely

            context:
            {context}

            Question:
            {query}

            Answer:
        """

    response = llm.invoke([prompt.format(context=context, query=query)])
    return response.content



In [87]:
query = "what is the attention ?"

answer = rag(query,rag_retiver,llm)

answer

Retriving documents for query: 'what is the attention ?'
Top K: 3, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 90.93it/s]

generated embeddings with shape:(1, 384)
Retrived 1 documents (after filtering)
Retrived 2 documents (after filtering)
No documents found





'Attention is a mechanism that maps a query vector and a set of key‑value pairs to an output vector by computing similarity scores between the query and each key, turning those scores into weights (typically via softmax), and then taking a weighted sum of the corresponding value vectors.'

In [None]:
def rag_advance(query, retiver, llm, top_k=5, min_score=0.2, return_context=False):
    """
    RAG pipeline with extra features like - Answer, Score, Confidence score, and optionally full context.
    """

    results = retiver.retrieve(query, top_k=top_k, score_theshold=min_score)
    if not results:
        return {'answer':'No relevant context found.','sources':[], 'confidence':0.0, 'context':''}
    
    #prepare context and source
    context = "\n\n".join([doc['content'] for doc in results])
    sources = [{
        'source':doc['metadata'].get('source_file', doc['metadata'].get('source','unknown')),
        'page': doc['metadata'].get('page','unknown'),
        'score': doc['similarity_score'],
        'preview': doc['content'][:300] + "..."
    } for doc in results]

    confidence = max([doc['similarity_score'] for doc in results])

    prompt = f"""

    use the folowing context to answer the question concisely .
    Context:{context}
    Query:{query} 
    
    Answer:
    """

    response = llm.invoke([prompt.format(query=query,context=context)])

    output = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence
    }

    if return_context:
        output['context'] = context
        
    return output

    # return response.content

In [None]:
result = rag_advance( "LLM-to-SLM Agent Conversion Algorithm", rag_retiver, llm, top_k=3,min_score=0.1, return_context=True)

print("Answer:", result['answer'])
print("Source:", result['sources'])
print("Confidence", result['confidence'])
print("context Preview:", result['context'][:200],"...")

Retriving documents for query: 'LLM-to-SLM Agent Conversion Algorithm?'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 16.39it/s]

generated embeddings with shape:(1, 384)
Retrived 1 documents (after filtering)
Retrived 2 documents (after filtering)
Retrived 3 documents (after filtering)
No documents found





Answer: **LLM‑to‑SLM Agent Conversion Algorithm (concise version)**  

1. **Secure usage‑data collection (S1)** – Instrument the agent to log every non‑HCI call: input prompts, model outputs, tool‑call payloads, and latency. Store logs in an encrypted, role‑restricted pipeline and anonymize identifiers.  

2. **Data curation & filtering (S2)** – Pull the logged traces, remove noisy or privacy‑sensitive entries, and group them by task or domain (e.g., code generation, web‑search, summarisation).  

3. **Task‑level clustering (S3)** – Cluster the curated traces to discover recurring sub‑tasks that can be handled by a specialist model rather than a general‑purpose LLM.  

4. **Specialist‑model training (S4)** – For each cluster, fine‑tune a small‑scale LM (≤10 B parameters) on the corresponding subset of data, optimizing for latency, memory footprint, and task‑specific metrics.  

5. **Evaluation & benchmarking (S5)** – Compare each specialist SLM against the original LLM on held‑out trac