# Notebook 04 ‚Äì RAG Query Engine

## Objective

The objective of this notebook is to implement the Retrieval-Augmented Generation (RAG) query engine.  
This stage integrates semantic retrieval with a Large Language Model (LLM) to generate grounded, context-aware responses.

---

## Input

- FAISS vector store containing embedded transcript chunks
- User query
- Google Gemini LLM

---

## Output

- Retrieved top-k relevant transcript chunks
- Context-aware generated answer based only on retrieved content

---

## Methodology

1. Load the persisted FAISS vector index.
2. Configure the retriever with top-k similarity search.
3. Retrieve the most semantically relevant transcript chunks.
4. Construct a structured prompt that injects retrieved context.
5. Invoke the Gemini LLM with grounded context.
6. Generate a final answer constrained to retrieved knowledge.

---

## Why This Step is Important

Standard LLMs may hallucinate when answering domain-specific questions.
By incorporating semantic retrieval:

- The model is grounded in factual transcript data.
- Hallucination risk is reduced.
- Responses are context-aware and source-aligned.
- The system becomes scalable to long-form content.

This notebook completes the full end-to-end RAG pipeline.


In [1]:
"""
PROJECT: 
NeuralTranscript: A RAG-Based Semantic Search & Q&A System for YouTube Content

-------------------------------------------------------------------------
AUTHOR: Engr. Inam Ullah Khan
Master's Student in Data Science | Al-Farabi Kazakh National University
-------------------------------------------------------------------------
"""

import os
from dotenv import load_dotenv

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser


# --------------------------------------------------
# 1. CONFIGURATION & ENVIRONMENT SETUP
# --------------------------------------------------

load_dotenv()  # Make sure GOOGLE_API_KEY is inside your .env file
INDEX_PATH = "data/faiss_index"


# --------------------------------------------------
# 2. CORE FUNCTIONS
# --------------------------------------------------

def load_vector_store():
    """
    Loads the FAISS index.
    IMPORTANT: The embedding model must match the one used during indexing.
    """
    print("üìÇ Loading Vector Database...")

    embeddings = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2"
    )

    vector_db = FAISS.load_local(
        INDEX_PATH,
        embeddings,
        allow_dangerous_deserialization=True
    )

    return vector_db


def build_rag_chain(vector_db):
    """
    Builds modern LangChain v1 RAG pipeline using LCEL.
    """
    print("ü§ñ Initializing Google Gemini Pro...")

    # 1. Initialize Gemini LLM
    llm = ChatGoogleGenerativeAI(
        model="gemini-2.5-flash",
        temperature=0.2,
        top_p=0.9
    )

    # 2. Prompt Template (LCEL style)
    prompt = ChatPromptTemplate.from_template("""
You are an AI Assistant specialized in analyzing video content.
Use the following transcript context to answer the question.
If the answer is not contained in the context, say you don't know.
Keep the answer concise and professional.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
""")

    # 3. Create Retriever
    retriever = vector_db.as_retriever(search_kwargs={"k": 3})

    # 4. Build RAG Chain using LCEL
    rag_chain = (
        {
            "context": retriever,
            "question": RunnablePassthrough()
        }
        | prompt
        | llm
        | StrOutputParser()
    )

    return rag_chain


# --------------------------------------------------
# 3. EXECUTION PIPELINE
# --------------------------------------------------

if __name__ == "__main__":

    print("\n--- Starting NeuralTranscript Query Engine ---\n")

    # Step 1: Load Vector Database
    db = load_vector_store()

    # Step 2: Build RAG Chain
    neural_qa = build_rag_chain(db)

    # Step 3: User Query
    user_query = "What is the main topic of this video??"

    print(f"\n‚ùì User Query:\n{user_query}")
    print("\n‚è≥ Processing answer...\n")

    # Step 4: Invoke Chain
    response = neural_qa.invoke(user_query)

    # Step 5: Display Result
    print("‚ú® AI RESPONSE:\n")
    print(response)
    print("\n--- Query Completed Successfully ---\n")



--- Starting NeuralTranscript Query Engine ---

üìÇ Loading Vector Database...
ü§ñ Initializing Google Gemini Pro...

‚ùì User Query:
What is the main topic of this video??

‚è≥ Processing answer...

‚ú® AI RESPONSE:

The main topic is a conversation with Demas about the mission to solve intelligence and use it to address fundamental scientific and philosophical mysteries, including those related to physics, consciousness, life, and gravity, and to overcome impossible challenges.

--- Query Completed Successfully ---



## Observations

- Retriever returned top 3 semantically relevant chunks.
- Retrieved context contained direct information related to the query.
- Gemini generated an answer grounded in the retrieved content.
- No hallucinated information observed in test query.
- Response quality improved compared to direct LLM call without retrieval.


## Summary

The Retrieval-Augmented Generation pipeline was successfully implemented.
The retriever dynamically selects semantically relevant transcript segments, which are injected into the prompt for grounded response generation.
This architecture improves factual reliability and reduces hallucination risk compared to standalone LLM usage.
The system demonstrates a functional end-to-end RAG workflow.


### üöÄ Project Conclusion

The **NeuralTranscript** project is now a fully functional end-to-end RAG system. It demonstrates proficiency in **Data Engineering** (Ingestion/Chunking), **Vector Mathematics** (Indexing), and **Generative AI** (RAG Orchestration).

---