# Agentic Retrieval-Augmented Generation (RAG) with Local Llama 2 & ChromaDB

## Overview
This notebook implements an **Agentic Retrieval-Augmented Generation (RAG) pipeline** using a local **Llama 2** and **ChromaDB** for intelligent question-answering. The system determines whether additional context is needed before generating responses, ensuring high accuracy.

### Key Features:
- **Llama 2 Model** for high-quality text generation.
- **PDF Document Processing** to extract relevant information.
- **ChromaDB Vector Store** for efficient semantic search.
- **Dynamic Context Retrieval** to improve answer accuracy.
- **Two Answering Modes**:
  - With RAG (Retrieves relevant document content before responding).
  - Without RAG (Directly generates responses).

In [1]:
# Install required packages if you have not installed them already
!pip install -r requirements.txt --quiet

In [2]:
# Standard Libraries
import os

# Numerical Computing
import numpy as np
import torch

# Hugging Face Hub for Model Download
from huggingface_hub import hf_hub_download

# LangChain Components
from langchain_community.llms import LlamaCpp
from langchain_community.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

# Sentence Transformers for Embeddings
from sentence_transformers import SentenceTransformer

# ChromaDB for Vector Storage
import chromadb
from chromadb.utils import embedding_functions

# Transformers 
import transformers
import sentence_transformers

## 🔧 Step 1: Model Setup

We will set up **Llama 2 (7B)** for text generation. If the model is not found locally, it will be downloaded from Hugging Face.

In [3]:
MODEL_FILENAME = "llama-2-7b-chat.Q4_K_M.gguf"
MODEL_DIR = "model"
EXPECTED_PATH = os.path.join(MODEL_DIR, MODEL_FILENAME)

# Ensure model directory exists
os.makedirs(MODEL_DIR, exist_ok=True)

# Check if model already exists
if os.path.exists(EXPECTED_PATH):
    print(f"Model already exists at: {EXPECTED_PATH}")
    model_path = EXPECTED_PATH
else:
    print("Model not found locally. Downloading Llama 2 model...")
    
    # Download the model
    model_path = hf_hub_download(
        repo_id="TheBloke/Llama-2-7B-Chat-GGUF",
        filename=MODEL_FILENAME,
        local_dir=MODEL_DIR
    )
    print(f"Model downloaded to: {model_path}")

print(f"Using model at: {model_path}")

Model already exists at: model/llama-2-7b-chat.Q4_K_M.gguf
Using model at: model/llama-2-7b-chat.Q4_K_M.gguf


In [4]:
# Initialize the model with the local path and GPU acceleration
llm = LlamaCpp(
    model_path=EXPECTED_PATH,
    temperature=0.25,
    max_tokens=2000,
    n_ctx=4096,
    top_p=1.0,
    verbose=False,
    n_gpu_layers=30,  # Utilize some available GPU layers
    n_batch=512,      # Optimize batch size for parallel processing
    f16_kv=True,      # Enable half-precision for key/value cache
    use_mlock=True,   # Lock memory to prevent swapping
    use_mmap=True     # Utilize memory mapping for faster loading
)

## 📄 Step 2: Loading and Processing the PDF Document

To enable context-aware question-answering, we load a **PDF document**, extract its content, and split it into manageable chunks for efficient retrieval.

In [5]:
# --- Load the PDF Document ---

# Define the PDF file path
PDF_PATH = "./data/AIStudioDoc.pdf"
print(f"Loading PDF from: {PDF_PATH}")

# Load the PDF document
pdf_loader = PyPDFLoader(PDF_PATH)
documents = pdf_loader.load()

print(f"Successfully loaded {len(documents)} document(s) from the PDF.")

Loading PDF from: ./data/AIStudioDoc.pdf
Successfully loaded 8 document(s) from the PDF.


## ✂️ Step 3: Splitting the Document into Chunks

Since large documents are difficult to process in full, we split the text into **small overlapping chunks** of approximately **500 characters**. These chunks will later be embedded and stored in ChromaDB.

In [6]:
# --- Split the PDF Content into Manageable Chunks ---

# Define text splitting parameters
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

# Initialize the text splitter
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Split the PDF content into chunks
docs = text_splitter.split_documents(documents)

print(f"Successfully split PDF into {len(docs)} text chunks.")

Successfully split PDF into 8 text chunks.


## 🔍 Step 4: Initializing the Embedding Model

To convert text into numerical representations for efficient similarity search, we use **all-MiniLM-L6-v2** from `sentence-transformers`.

In [7]:
# --- Initialize the Embedding Model ---

# Define the embedding model name
MODEL_NAME = "all-MiniLM-L6-v2"

# Load the embedding model
embedding_model = SentenceTransformer(MODEL_NAME)

print(f"Successfully loaded embedding model: {MODEL_NAME}")



Successfully loaded embedding model: all-MiniLM-L6-v2


## 🧠 Step 5: Computing Embeddings for Document Chunks

Each chunk is converted into a **vector representation** using our embedding model. This allows us to perform **semantic similarity searches** later.

In [8]:
# --- Compute Embeddings for Each Text Chunk ---

# Extract text content from each chunk
doc_texts = [doc.page_content for doc in docs]

# Compute embeddings for the extracted text chunks
document_embeddings = embedding_model.encode(doc_texts, convert_to_numpy=True)

# Display the result
print("Successfully computed embeddings for each text chunk.")
print(f"Embeddings Shape: {document_embeddings.shape}")

Successfully computed embeddings for each text chunk.
Embeddings Shape: (8, 384)


## 🗄️ Step 6: Storing Document Embeddings in ChromaDB

We initialize **ChromaDB**, a high-performance **vector database**, and store our computed embeddings to enable efficient retrieval of relevant text chunks.

In [9]:
# --- Initialize and Populate the Chroma Vector Database ---

# Define Chroma database path and collection name
CHROMA_DB_PATH = "./chroma_db"
COLLECTION_NAME = "document_embeddings"

# Initialize Chroma client
chroma_client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

# Add document embeddings to the Chroma collection
for i, embedding in enumerate(document_embeddings):
    collection.add(
        ids=[str(i)],  # Chroma requires string IDs
        embeddings=[embedding.tolist()],
        metadatas=[{"text": doc_texts[i]}]
    )

print("Successfully populated Chroma database with document embeddings.")

Add of existing embedding ID: 0
Insert of existing embedding ID: 0
Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
Add of existing embedding ID: 4
Insert of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 5
Add of existing embedding ID: 6
Insert of existing embedding ID: 6
Add of existing embedding ID: 7
Insert of existing embedding ID: 7


Successfully populated Chroma database with document embeddings.


## 🔎 Step 7: Implementing Vector Search Tool

To retrieve relevant text passages from the database, we define a **vector search function** that finds the most relevant chunks based on a user query.

In [10]:
# --- Define the Vector Search Tool ---
def vector_search_tool(query: str) -> str:
    """
    Searches the Chroma database for relevant text chunks based on the query.
    Computes the query embedding, retrieves the top 5 most relevant text chunks,
    and returns them as a formatted string.
    """
    # Compute the query embedding
    query_embedding = embedding_model.encode(query, convert_to_numpy=True).tolist()
    
    # Define the number of nearest neighbors to retrieve
    TOP_K = 5
    
    # Perform the search in the Chroma database
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=TOP_K
    )
    
    # Retrieve and format the corresponding text chunks
    retrieved_chunks = [metadata["text"] for metadata in results["metadatas"][0]]
    return "\n\n".join(retrieved_chunks)

## 🤖 Step 8: Context Need Assessment

Instead of always retrieving context, we determine if the query **requires external document context** before generating a response. This creates an agentic workflow that makes autonomous decisions to complete the task at hand.

In [11]:
# --- Define the Meta-Evaluation Function ---
def needs_context(query: str) -> bool:
    """
    Determines if additional context from an external document is required to generate an accurate and detailed answer.
    Returns True if context is needed (response contains "YES"), False otherwise.

    Args:
        query (str): The user's query to evaluate.

    Returns:
        bool: True if external context is required, False otherwise.
    """
    meta_prompt = (
        "Based on the following query, decide if additional context from an external document is needed "
        "to generate an accurate and detailed answer. Have a tendency to use an external document if the query is not a very familiar topic. If in doubt, assume context is required and answer 'YES'.\n"
        "Answer with a single word: YES if additional context from an external document would be helpful to answer the query, "
        "or NO if not. Do not say anything other than YES or NO.\n"
        f"Query: {query}\n"
        "Answer:"
    )
    meta_response = llm.invoke(meta_prompt)
    print("Meta Response (is external document retrieval necessary?):", meta_response)
    return "YES" in meta_response.upper()


# --- Define the Main Answer Generation Function with RAG (Retrieve and Generate) ---
def generate_answer_with_agentic_rag(query: str) -> str:
    """
    Generates a detailed and accurate answer to the user's query by using context when needed.
    If additional context is required, it is retrieved from the vector store and included in the prompt.
    If not, the answer is generated using the query alone.

    Args:
        query (str): The user's query to answer.

    Returns:
        str: The generated answer based on the query.
    """
    if needs_context(query):
        # Retrieve additional context from the vector store
        context = vector_search_tool(query)
        
        # Construct the enriched prompt with the additional context
        enriched_prompt = (
            "Here is additional context from our document:\n"
            f"{context}\n\n"
            f"Based on this context and the query: {query}\n"
            "Please provide a detailed and accurate answer.\n"
            "Answer:"
        )
        final_response = llm.invoke(enriched_prompt)
    else:
        # Generate an answer using the original query directly
        direct_prompt = (
            "Please provide a detailed and accurate answer to the following query:\n"
            f"{query}\n"
            "Answer:"
        )
        final_response = llm.invoke(direct_prompt)
    
    return final_response


# --- Define the Answer Generation Function without RAG ---
def generate_answer_without_rag(query: str) -> str:
    """
    Generates a detailed and accurate answer to the user's query without using any additional context from external documents.
    
    Args:
        query (str): The user's query to answer.

    Returns:
        str: The generated answer based on the query.
    """
    direct_prompt = (
        "Please provide a detailed and accurate answer to the following query:\n"
        f"{query}\n"
        "Answer:"
    )
    final_response = llm.invoke(direct_prompt)
    
    return final_response

## 💡 Step 9: Answer Generation with Agentic RAG

If additional context is needed, the model retrieves **relevant document chunks** and incorporates them into the response prompt.

In [12]:
query = "What are the key features of Z by HP AI Studio?"
print("User Query:", query)
final_answer = generate_answer_with_agentic_rag(query)
print("\nFinal Answer:")
print(final_answer)

User Query: What are the key features of Z by HP AI Studio?
Meta Response (is external document retrieval necessary?):  YES

Final Answer:
 Based on the provided context, Z by HP AI Studio is a standalone application designed for data scientists and engineers that offers several key features to enhance their productivity and collaboration. Here are some of the key features of Z by HP AI Studio:
1. Data Connectors: Z by HP AI Studio allows users to connect to multiple data-stores across local and cloud networks, making it easier to access the correct data and packages wherever they are.
2. Local Computation: The platform enables users to perform all their computations locally without interruption, allowing them to manage development, data, and model environments without any disruptions.
3. Monitoring: AI Studio runs the tools users select natively, providing real-time monitoring of GPU, CPU, and memory consumption. Users can visualize the effects of tests they run in real-time, giving t

## ⚡ Step 10: Answer Generation Without RAG

In this case, we generate an answer without using RAG to show the difference between 2 answers

In [13]:
query = "What are the key features of Z by HP AI Studio?"
print("User Query:", query)
final_answer = generate_answer_without_rag(query)
print("\nFinal Answer:")
print(final_answer)

User Query: What are the key features of Z by HP AI Studio?

Final Answer:

Z by HP AI Studio is an all-in-one creative tool that enables users to design, edit, and share their digital content. Here are some of its key features:
1. Design: Z by HP AI Studio offers a wide range of design tools, including templates, graphics, and text options. Users can create custom designs using these tools or use the AI-powered design assistant to generate unique designs based on their preferences.
2. Editing: The platform provides advanced editing features, such as color correction, resizing, and cropping. Users can also add special effects, filters, and overlays to enhance their content.
3. Collaboration: Z by HP AI Studio allows users to collaborate on designs in real-time. They can invite others to edit or view their designs, making it easier to work together on projects.
4. Sharing: Once a design is complete, users can easily share it on social media platforms, messaging apps, or via email. The p