# Data Ingestion Pipeline

In [1]:
from langchain_community.document_loaders import TextLoader 

loader = TextLoader("../data/text_files/llm_learning.txt")

In [2]:
document = loader.load()
document

[Document(metadata={'source': '../data/text_files/llm_learning.txt'}, page_content="Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand and generate human-like text. They are trained on vast amounts of textual data and can perform a variety of tasks such as answering questions, summarizing content, translating languages, and even writing code.\n\n### Key Concepts\n\n1. **Training Data**: LLMs learn from large datasets containing books, articles, websites, and more. The quality and diversity of this data influence the model's capabilities.\n\n2. **Tokens**: LLMs process text as sequences of tokens, which are often words or subwords. Understanding tokenization is important for working with LLMs.\n\n3. **Prompting**: You interact with an LLM by providing a prompt—a piece of text or a question. The model generates a response based on this input.\n\n4. **Fine-tuning**: While base LLMs are trained on general data, they can be fine-tuned on specifi

In [3]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

# Example usage:
loader = DirectoryLoader("../data/text_files/", loader_cls=TextLoader, loader_kwargs={"encoding": "utf-8"}, show_progress=False, glob="*.txt")
documents = loader.load()
print(documents)


[Document(metadata={'source': '../data/text_files/llm_learning.txt'}, page_content="Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand and generate human-like text. They are trained on vast amounts of textual data and can perform a variety of tasks such as answering questions, summarizing content, translating languages, and even writing code.\n\n### Key Concepts\n\n1. **Training Data**: LLMs learn from large datasets containing books, articles, websites, and more. The quality and diversity of this data influence the model's capabilities.\n\n2. **Tokens**: LLMs process text as sequences of tokens, which are often words or subwords. Understanding tokenization is important for working with LLMs.\n\n3. **Prompting**: You interact with an LLM by providing a prompt—a piece of text or a question. The model generates a response based on this input.\n\n4. **Fine-tuning**: While base LLMs are trained on general data, they can be fine-tuned on specifi

In [4]:
from langchain_community.document_loaders import PyMuPDFLoader

# Example usage:
loader = DirectoryLoader("../data/pdf_files/", loader_cls=PyMuPDFLoader, show_progress=False, glob="*.pdf")
documents = loader.load()
print(documents)


[Document(metadata={'producer': 'PDFCreator Online (www.pdfforge.org/online)', 'creator': 'PDFCreator Online (www.pdfforge.org/online)', 'creationdate': '2025-09-18T14:22:04+00:00', 'source': '../data/pdf_files/agentic_ai_tutorial.pdf', 'file_path': '../data/pdf_files/agentic_ai_tutorial.pdf', 'total_pages': 74, 'format': 'PDF 1.4', 'title': 'Merged with PDFCreator Online', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-09-18T14:22:04+00:00', 'trapped': '', 'modDate': 'D:20250918142204Z', 'creationDate': "D:20250918142204+00'00'", 'page': 0}, page_content='Agentic AI Tutorial: Building Intelligent\nAutonomous Systems\nTable of Contents\n1. Introduction to Agentic AI\n2. Core Concepts and Architecture\n3. Agent Frameworks\n4. Planning and Reasoning\n5. Memory and Knowledge Management\n6. Tool Use and Actions\n7. Multi-Agent Systems\n8. Real-World Applications\n9. Implementation Guide\n10. Best Practices\nIntroduction to Agentic AI\nAgentic AI refers to AI systems that can

In [5]:
print(documents[0])

page_content='Agentic AI Tutorial: Building Intelligent
Autonomous Systems
Table of Contents
1. Introduction to Agentic AI
2. Core Concepts and Architecture
3. Agent Frameworks
4. Planning and Reasoning
5. Memory and Knowledge Management
6. Tool Use and Actions
7. Multi-Agent Systems
8. Real-World Applications
9. Implementation Guide
10. Best Practices
Introduction to Agentic AI
Agentic AI refers to AI systems that can act autonomously to achieve goals, make decisions, and interact with their environment. Unlike traditional AI that responds to
prompts, agentic systems proactively plan, execute, and adapt their behavior.
Key Characteristics:
Autonomy: Can operate without constant human supervision
Goal-oriented: Pursues objectives over multiple steps
Adaptive: Learns from experience and adjusts strategies
Interactive: Communicates with humans and other systems
Persistent: Maintains context and memory across sessions
Agent Types:
Reactive Agents: Respond to immediate stimuli
Deliberative

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """Split documents into smaller chunks for better RAG performance"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # Show example of a chunk
    if split_docs:
        print(f"\nExample chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    
    return split_docs

In [7]:
chunks = split_documents(documents)

Split 167 documents into 316 chunks

Example chunk:
Content: Agentic AI Tutorial: Building Intelligent
Autonomous Systems
Table of Contents
1. Introduction to Agentic AI
2. Core Concepts and Architecture
3. Agent Frameworks
4. Planning and Reasoning
5. Memory a...
Metadata: {'producer': 'PDFCreator Online (www.pdfforge.org/online)', 'creator': 'PDFCreator Online (www.pdfforge.org/online)', 'creationdate': '2025-09-18T14:22:04+00:00', 'source': '../data/pdf_files/agentic_ai_tutorial.pdf', 'file_path': '../data/pdf_files/agentic_ai_tutorial.pdf', 'total_pages': 74, 'format': 'PDF 1.4', 'title': 'Merged with PDFCreator Online', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-09-18T14:22:04+00:00', 'trapped': '', 'modDate': 'D:20250918142204Z', 'creationDate': "D:20250918142204+00'00'", 'page': 0}


In [8]:
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformer"""
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        """
        Initialize the embedding manager
        
        Args:
            model_name: HuggingFace model name for sentence embeddings
        """
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load the SentenceTransformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts
        
        Args:
            texts: List of text strings to embed
            
        Returns:
            numpy array of embeddings with shape (len(texts), embedding_dim)
        """
        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings


## initialize the embedding manager

embedding_manager=EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x33be989e0>

In [9]:
import os
import chromadb
import uuid

class VectorStore:
    """Manages document embeddings in a ChromaDB vector store"""
    
    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store"):
        """
        Initialize the vector store
        
        Args:
            collection_name: Name of the ChromaDB collection
            persist_directory: Directory to persist the vector store
        """
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """Initialize ChromaDB client and collection"""
        try:
            # Create persistent ChromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)
            
            # Get or create collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "PDF document embeddings for RAG"}
            )
            print(f"Vector store initialized. Collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")
            
        except Exception as e:
            print(f"Error initializing vector store: {e}")
            raise

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store
        
        Args:
            documents: List of LangChain documents
            embeddings: Corresponding embeddings for the documents
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        
        print(f"Adding {len(documents)} documents to vector store...")
        
        # Prepare data for ChromaDB
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []
        
        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique ID
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)
            
            # Prepare metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)
            
            # Document content
            documents_text.append(doc.page_content)
            
            # Embedding
            embeddings_list.append(embedding.tolist())
        
        # Add to collection
        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=documents_text 
            )
            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collection: {self.collection.count()}")
            
        except Exception as e:
            print(f"Error adding documents to vector store: {e}")
            raise

vectorstore=VectorStore()
vectorstore

Vector store initialized. Collection: pdf_documents
Existing documents in collection: 1896


<__main__.VectorStore at 0x33d0b5d00>

In [13]:
texts=[doc.page_content for doc in chunks]

embeddings = embedding_manager.generate_embeddings(texts)

vectorstore.add_documents(chunks, embeddings)

Generating embeddings for 316 texts...


Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Generated embeddings with shape: (316, 384)
Adding 316 documents to vector store...
Successfully added 316 documents to vector store
Total documents in collection: 2212


In [14]:
class RAGRetriever:
    """Handles query-based retrieval from the vector store"""
    
    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):
        """
        Initialize the retriever
        
        Args:
            vector_store: Vector store containing document embeddings
            embedding_manager: Manager for generating query embeddings
        """
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.0) -> List[Dict[str, Any]]:
        """
        Retrieve relevant documents for a query
        
        Args:
            query: The search query
            top_k: Number of top results to return
            score_threshold: Minimum similarity score threshold
            
        Returns:
            List of dictionaries containing retrieved documents and metadata
        """
        print(f"Retrieving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score threshold: {score_threshold}")
        
        # Generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])
        
        # Search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=query_embedding.tolist(),
                n_results=top_k
            )
            
            # Process results
            retrieved_docs = []
            
            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]
                
                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                    # Convert distance to similarity score (ChromaDB uses cosine distance)
                    similarity_score = 1 - distance
                    
                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i + 1
                        })
                
                print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
            else:
                print("No documents found")
            
            return retrieved_docs
            
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return []

rag_retriever=RAGRetriever(vectorstore,embedding_manager)


In [16]:
results = rag_retriever.retrieve("What is agentic AI?")
print("Retrieval successful!")
if results:
    print(f"Found {len(results)} relevant documents")
    print(f"Top result similarity: {results[0]['similarity_score']:.4f}")
    print(f"Content preview: {results[0]['content']}...")

Retrieving documents for query: 'What is agentic AI?'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings with shape: (1, 384)
Retrieved 5 documents (after filtering)
Retrieval successful!
Found 5 relevant documents
Top result similarity: 0.3206
Content preview: Agentic AI Tutorial: Building Intelligent
Autonomous Systems
Table of Contents
1. Introduction to Agentic AI
2. Core Concepts and Architecture
3. Agent Frameworks
4. Planning and Reasoning
5. Memory and Knowledge Management
6. Tool Use and Actions
7. Multi-Agent Systems
8. Real-World Applications
9. Implementation Guide
10. Best Practices
Introduction to Agentic AI
Agentic AI refers to AI systems that can act autonomously to achieve goals, make decisions, and interact with their environment. Unlike traditional AI that responds to
prompts, agentic systems proactively plan, execute, and adapt their behavior.
Key Characteristics:
Autonomy: Can operate without constant human supervision
Goal-oriented: Pursues objectives over multiple steps
Adaptive: Learns from experience and adjusts strategies
Interactive: Communi

In [74]:
from agents import Agent, Runner, trace, function_tool
import asyncio
import os
from dotenv import load_dotenv

load_dotenv(override=True)


True

In [77]:
@function_tool
def retrieve_with_rag_retriever(query: str):
    """Retrieve relevant documents using rag_retriever."""
    return rag_retriever.retrieve(query)

instructions = "Use the retrieve_with_rag_retriever tool to answer the user's question using retrieved data."
rag_agent = Agent(
    name="RAG-Agent",
    tools=[retrieve_with_rag_retriever],
    instructions=instructions,
    model="gpt-4o-mini"
)

In [82]:

async def run_rag_agent(query:str):
    with trace("rag_agent"):
        response = await Runner.run(rag_agent, query)
        return response

response = await run_rag_agent("What is agentic AI?")
print(response)


Retrieving documents for query: 'What is agentic AI?'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings with shape: (1, 384)
Retrieved 5 documents (after filtering)
RunResult:
- Last agent: Agent(name="RAG-Agent", ...)
- Final output (str):
    Agentic AI refers to AI systems designed to act autonomously, achieving goals, making decisions, and interacting with their environments. Unlike traditional AI, which typically responds to user prompts, agentic systems can proactively plan, execute tasks, and adapt their behaviors based on experiences.
    
    ### Key Characteristics of Agentic AI:
    - **Autonomy**: Functions independently without constant human oversight.
    - **Goal-Oriented**: Pursues specific objectives through a series of steps.
    - **Adaptive**: Learns and adjusts strategies based on experiences.
    - **Interactive**: Capable of communication with humans and other systems.
    - **Persistent**: Maintains context and memory across sessions.
    
    ### Agent Types:
    1. **Reactive Agents**: Respond immediately to stimuli.
    
    Overall, agent

In [85]:
import gradio as gr

def gradio_rag_agent(query):
    # Use asyncio.run to avoid event loop issues in worker threads
    return asyncio.run(run_rag_agent(query))

with gr.Blocks() as iface:
    gr.Markdown("# 🦙 RAG Agent Demo")
    gr.Markdown("Ask a question and get an answer using the RAG agent. Try something like: **What is agentic AI?**")
    with gr.Row():
        with gr.Column():
            query = gr.Textbox(lines=3, label="Ask a question", placeholder="Type your question here...")
            submit_btn = gr.Button("Submit", variant="primary")
        with gr.Column():
            output = gr.Textbox(label="RAG Agent Response", lines=8, interactive=False)
    submit_btn.click(fn=gradio_rag_agent, inputs=query, outputs=output)

iface.launch()

* Running on local URL:  http://127.0.0.1:7862
* To create a public link, set `share=True` in `launch()`.




Retrieving documents for query: 'what is agentic AI?'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings with shape: (1, 384)
Retrieved 5 documents (after filtering)
Retrieving documents for query: 'how to install langgraph'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings with shape: (1, 384)
Retrieved 0 documents (after filtering)
Retrieving documents for query: 'What all information do you have?'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings with shape: (1, 384)
Retrieved 0 documents (after filtering)
Retrieving documents for query: 'What is Hugging Face?'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings with shape: (1, 384)
Retrieved 0 documents (after filtering)
