# Agentic Retrieval-Augmented Generation (RAG) with Local Llama 2 & ChromaDB

## Overview
This notebook implements an **Agentic Retrieval-Augmented Generation (RAG) pipeline**. It focuses on transcribing audio data, potentially from an Omi streaming device, storing both the transcription and audio, and then using the transcription with a local **Ai Studio** model and **ChromaDB** for intelligent question-answering. The system determines whether additional context is needed before generating responses.

### Key Features:
- **Audio Transcription Workflow** for processing data from devices like Omi.
- **Storage of Audio and Transcriptions** for AI processing.
- **Llama 2 Model** for high-quality text generation.
- **ChromaDB Vector Store** for efficient semantic search on transcriptions.
- **Dynamic Context Retrieval** to improve answer accuracy.
- **Two Answering Modes**:
  - With RAG (Retrieves relevant document content before responding).
  - Without RAG (Directly generates responses).

In [None]:
# Install required packages if you have not installed them already
%pip install -r requirements.txt --verbose --quiet
%pip install -q --upgrade pip

## 🔧 Step 1: Model Setup

We will set up **Llama 2 (7B)** for text generation. If the model is not found locally, it will be downloaded from Hugging Face.

In [None]:
%pip install -q huggingface-hub

import os
from huggingface_hub import hf_hub_download

MODEL_FILENAME = "llama-2-7b-chat.Q4_K_M.gguf"
MODEL_DIR = "model"
EXPECTED_PATH = os.path.join(MODEL_DIR, MODEL_FILENAME)

# Ensure model directory exists
os.makedirs(MODEL_DIR, exist_ok=True)

# Check if model already exists
if os.path.exists(EXPECTED_PATH):
    print(f"Model already exists at: {EXPECTED_PATH}")
    model_path = EXPECTED_PATH
else:
    print("Model not found locally. Downloading Llama 2 model...")
    
    # Download the model
    model_path = hf_hub_download(
        repo_id="TheBloke/Llama-2-7B-Chat-GGUF",
        filename=MODEL_FILENAME,
        local_dir=MODEL_DIR
    )
    print(f"Model downloaded to: {model_path}")

print(f"Using model at: {model_path}")

In [None]:
%pip install -q llama-cpp-python
# Check if the model file exists
if not os.path.exists(model_path):
    raise FileNotFoundError(f"Model file not found at {model_path}")

# Import the Llama class from llama_cpp
from llama_cpp import Llama

# Initialize the model with the local path and GPU acceleration
llm = Llama(
    model_path=EXPECTED_PATH,
    temperature=0.25,
    max_tokens=2000,
    n_ctx=4096,
    top_p=1.0,
    verbose=False,
    n_gpu_layers=30,  # Utilize some available GPU layers
    n_batch=512,      # Optimize batch size for parallel processing
    f16_kv=True,      # Enable half-precision for key/value cache
    use_mlock=True,   # Lock memory to prevent swapping
    use_mmap=True     # Utilize memory mapping for faster loading
)

## 📄 Step 2: Loading, Transcribing, and Storing Audio Data

This step outlines the process for loading audio data (e.g., from an Omi streaming device), transcribing it, and preparing it for storage and further processing. Both the raw audio and its transcription are valuable assets.


In [None]:
# --- Load the Audio File Document ---
# --- Load the Audio File Document and Audio Collection System ---

# Install whisper if not already installed
%pip install -q openai-whisper

# Install ffmpeg-python bindings if not already installed
%pip install -q ffmpeg-python

import shutil
import sys
import subprocess

# Check if ffmpeg is available and working
try:
    subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
    print("ffmpeg found and working.")
except FileNotFoundError:
    raise FileNotFoundError("ffmpeg not found. Please install ffmpeg and ensure it's in your PATH.")
except subprocess.CalledProcessError as e:
    print(f"Error running ffmpeg: {{e}}")
    raise RuntimeError("ffmpeg is not working correctly.") from e

import whisper

# No need to manually set ffmpeg_dir if ffmpeg is installed system-wide

# Define the Audio File file path
AUDIO_PATH = "./data/tester.mp3"
# Check if the file exists
print(f"Loading AUDIO from: {AUDIO_PATH}")

# Define the audio collection system
AUDIO_COLLECTION_SYSTEM = "MyAudioSystem"

# Load and transcribe the audio file using whisper
model = whisper.load_model("base")
result = model.transcribe(AUDIO_PATH)
text_content = result["text"]

# For compatibility with the rest of your code, wrap the text in a document-like object
class AudioDocument:
    def __init__(self, text):
        self.text = text
    def getPageText(self):
        return self.text

documents = [AudioDocument(text_content)]

print(f"Successfully loaded {len(documents)} document(s) from the AUDIO.")
# Initialize an empty list for the audio files
audio_files = []

# Iterate through each document
for document in documents:
    # Get the current page's text content
    text_content = document.getPageText()

    # Extract relevant information from the text, e.g., keywords or phrases
    def extractRelevantInfo(text):
        # Placeholder: just return the text itself
        return text
    extracted_info = extractRelevantInfo(text_content)

    # Define a placeholder for createAudioFile
    def createAudioFile(system, info):
        # Placeholder: just return a tuple for demonstration
        return (system, info)

    # Create an audio file based on the extracted information
    audio_file = createAudioFile(AUDIO_COLLECTION_SYSTEM, extracted_info)

    # Add the audio file to the collection system's list
    audio_files.append(audio_file)
print(f"Successfully loaded {len(documents)} document(s) from the AUDIO and created {len(audio_files)} audio file(s).")




In [None]:
# --- Audio Event Detection (Sound Tagging) ---

import torch
import torchaudio

# Download and load a pre-trained PANNs model for sound event detection
panns_model = torch.hub.load('qiuqiangkong/panns_transfer_to_audio_tagging', 'Cnn14', pretrained=True)
panns_model.eval()

# Load and preprocess audio
waveform, sr = torchaudio.load(AUDIO_PATH)
if sr != 32000:
    waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=32000)(waveform)
    sr = 32000

# PANNs expects mono audio
if waveform.shape[0] > 1:
    waveform = torch.mean(waveform, dim=0, keepdim=True)

# Run model and get tags
with torch.no_grad():
    output = panns_model(waveform)
    # Get top 3 predicted sound classes
    labels = panns_model.labels
    topk = torch.topk(output['clipwise_output'][0], 3)
    sound_tags = [labels[i] for i in topk.indices.tolist()]

print("Detected sound types:", sound_tags)

## ✂️ Step 3: Chunking Audio Transcriptions for RAG

The transcribed text from the audio data is split into **small overlapping chunks** (approximately **500 characters**). These chunks are then used for embedding and storage in ChromaDB to enable semantic search for the RAG pipeline.


In [None]:
# --- Split the Audio Content into Manageable Chunks with Sound Tags ---
%pip install -q langchain

from langchain.text_splitter import CharacterTextSplitter

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Split the transcription into chunks
docs = text_splitter.split_documents(documents)

# Attach sound tags to each chunk as metadata
for doc in docs:
    doc.page_content = f"Transcription: {doc.page_content}\nSound tags: {', '.join(sound_tags)}"

## 🔍 Step 4: Initializing the Embedding Model

To convert text into numerical representations for efficient similarity search, we use **all-MiniLM-L6-v2** from `sentence-transformers`.

In [None]:
# --- Initialize the Embedding Model ---
%pip install -q sentence-transformers
# Define the embedding model name
MODEL_NAME = "all-MiniLM-L6-v2"

# Load the embedding model
embedding_model = SentenceTransformer(MODEL_NAME)

print(f"Successfully loaded embedding model: {MODEL_NAME}")

## 🧠 Step 5: Computing Embeddings for Document Chunks

Each chunk is converted into a **vector representation** using our embedding model. This allows us to perform **semantic similarity searches** later.

In [None]:
# --- Compute Embeddings for Each Text Chunk ---

# Extract text content from each chunk
doc_texts = [doc.page_content for doc in docs]

# Compute embeddings for the extracted text chunks
document_embeddings = embedding_model.encode(doc_texts, convert_to_numpy=True)

# Display the result
print("Successfully computed embeddings for each text chunk.")
print(f"Embeddings Shape: {document_embeddings.shape}")

## 🗄️ Step 6: Storing Audio Transcription Embeddings in ChromaDB

We initialize **ChromaDB**, a high-performance **vector database**, and store our computed embeddings to enable efficient retrieval of relevant text chunks.

In [None]:
# --- Initialize and Populate the Chroma Vector Database ---

# Define Chroma database path and collection name
CHROMA_DB_PATH = "./chroma_db"
COLLECTION_NAME = "document_embeddings"

# Initialize Chroma client
chroma_client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

# Add document embeddings to the Chroma collection
for i, embedding in enumerate(document_embeddings):
    collection.add(
        ids=[str(i)],  # Chroma requires string IDs
        embeddings=[embedding.tolist()],
        metadatas=[{"text": doc_texts[i]}]
    )

print("Successfully populated Chroma database with document embeddings.")

## 🔎 Step 7: Implementing Vector Search Tool

To retrieve relevant text passages from the database, we define a **vector search function** that finds the most relevant chunks based on a user query.

In [None]:
# --- Define the Vector Search Tool ---
def vector_search_tool(query: str) -> str:
    """
    Searches the Chroma database for relevant text chunks based on the query.
    Computes the query embedding, retrieves the top 5 most relevant text chunks,
    and returns them as a formatted string.
    """
    # Compute the query embedding
    query_embedding = embedding_model.encode(query, convert_to_numpy=True).tolist()
    
    # Define the number of nearest neighbors to retrieve
    TOP_K = 5
    
    # Perform the search in the Chroma database
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=TOP_K
    )
    
    # Retrieve and format the corresponding text chunks
    retrieved_chunks = [metadata["text"] for metadata in results["metadatas"][0]]
    return "\n\n".join(retrieved_chunks)

## 🤖 Step 8: Context Need Assessment

Instead of always retrieving context, we determine if the query **requires external document context** before generating a response. This creates an agentic workflow that makes autonomous decisions to complete the task at hand.

In [None]:
# --- Define the Meta-Evaluation Function ---
def needs_context(query: str) -> bool:
    """
    Determines if additional context from an external document is required to generate an accurate and detailed answer.
    Returns True if context is needed (response contains "YES"), False otherwise.

    Args:
        query (str): The user's query to evaluate.

    Returns:
        bool: True if external context is required, False otherwise.
    """
    meta_prompt = (
        "Based on the following query, decide if additional context from an external document is needed "
        "to generate an accurate and detailed answer. Have a tendency to use an external document if the query is not a very familiar topic. If in doubt, assume context is required and answer 'YES'.\n"
        "Answer with a single word: YES if additional context from an external document would be helpful to answer the query, "
        "or NO if not. Do not say anything other than YES or NO.\n"
        f"Query: {query}\n"
        "Answer:"
    )
    meta_response = llm.invoke(meta_prompt)
    print("Meta Response (is external document retrieval necessary?):", meta_response)
    return "YES" in meta_response.upper()


# --- Define the Main Answer Generation Function with RAG (Retrieve and Generate) ---
def generate_answer_with_agentic_rag(query: str) -> str:
    """
    Generates a detailed and accurate answer to the user's query by using context when needed.
    If additional context is required, it is retrieved from the vector store and included in the prompt.
    If not, the answer is generated using the query alone.

    Args:
        query (str): The user's query to answer.

    Returns:
        str: The generated answer based on the query.
    """
    if needs_context(query):
        # Retrieve additional context from the vector store
        context = vector_search_tool(query)
        
        # Construct the enriched prompt with the additional context
        enriched_prompt = (
            "Here is additional context from our document:\n"
            f"{context}\n\n"
            f"Based on this context and the query: {query}\n"
            "Please provide a detailed and accurate answer.\n"
            "Answer:"
        )
        final_response = llm.invoke(enriched_prompt)
    else:
        # Generate an answer using the original query directly
        direct_prompt = (
            "Please provide a detailed and accurate answer to the following query:\n"
            f"{query}\n"
            "Answer:"
        )
        final_response = llm.invoke(direct_prompt)
    
    return final_response


# --- Define the Answer Generation Function without RAG ---
def generate_answer_without_rag(query: str) -> str:
    """
    Generates a detailed and accurate answer to the user's query without using any additional context from external documents.
    
    Args:
        query (str): The user's query to answer.

    Returns:
        str: The generated answer based on the query.
    """
    direct_prompt = (
        "Please provide a detailed and accurate answer to the following query:\n"
        f"{query}\n"
        "Answer:"
    )
    final_response = llm.invoke(direct_prompt)
    
    return final_response

## 💡 Step 9: Answer Generation with Agentic RAG

