# Notebook 03 ‚Äì Embedding & Vector Indexing

## Objective

The objective of this notebook is to convert preprocessed transcript chunks into vector embeddings and construct a searchable vector index for semantic retrieval.  
This stage forms the core knowledge base of the Retrieval-Augmented Generation (RAG) pipeline.

---

## Input

- List of cleaned and chunked transcript segments (from Notebook 02)
- Embedding model (Google Gemini or compatible embedding provider)

---

## Output

- Vector embeddings for all transcript chunks
- FAISS vector store containing indexed embeddings
- Retriever object capable of performing semantic similarity search

---

## Methodology

1. Load chunked transcript data.
2. Initialize embedding model.
3. Generate embeddings for each chunk.
4. Build a FAISS vector index.
5. Configure retriever with top-k similarity search.
6. Test retrieval using a sample query.

---

## Why This Step is Important

Large Language Models (LLMs) cannot efficiently process long documents directly due to context window limitations.  
By converting text chunks into dense vector representations:

- Semantic similarity search becomes possible.
- Relevant context can be retrieved dynamically.
- Hallucination risk is reduced.
- Response quality improves through grounded context injection.

This notebook completes the **knowledge indexing stage** of the RAG architecture and prepares the system for generative response integration in Notebook 04.


In [1]:
"""
PROJECT: 
NeuralTranscript: A RAG-Based Semantic Search & Q&A System for YouTube Content

-------------------------------------------------------------------------
AUTHOR: Engr. Inam Ullah Khan
Master's Student in Data Science | Al-Farabi Kazakh National University
-------------------------------------------------------------------------
"""
import pickle
import os
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
# Note: Use your chunked_docs from the previous step. 
# In a real pipeline, you might reload them or run them in the same session.

import warnings
import os

# Suppress standard Python warnings
warnings.filterwarnings("ignore", category=UserWarning, module="huggingface_hub")

# Suppress the Xet Storage/HTTP fallback message
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

# --- 1. CONFIGURATION ---
INDEX_SAVE_PATH = "data/faiss_index"

# --- 2. CORE FUNCTIONS ---

def generate_vector_store(documents):
    """
    Converts documents to embeddings and stores them in a FAISS index.
    """
    print("üß† Initializing Neural Embedding Model (HuggingFace)...")
    
    # Using a high-quality, lightweight model included in your requirements
    embeddings = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={'device': 'cpu'} # Use 'cuda' if you have a GPU in Colab
    )
    
    print(f"üöÄ Generating embeddings for {len(documents)} chunks. Please wait...")
    
    # Create the FAISS index from the documents
    vector_store = FAISS.from_documents(documents, embeddings)
    
    return vector_store

def save_index(vector_store, path):
    """Persists the FAISS index to the local disk."""
    vector_store.save_local(path)
    print(f"üíæ FAISS Index successfully saved to: {path}")

# --- 3. EXECUTION PIPELINE (UPDATED) ---

if __name__ == "__main__":
    print("--- Starting NeuralTranscript Indexing Pipeline ---")
    
    # NEW: Load the chunks from the disk
    try:
        with open("data/chunked_docs.pkl", "rb") as f:
            chunked_docs = pickle.load(f)
        print(f"üì• Successfully loaded {len(chunked_docs)} chunks from disk.")
    except FileNotFoundError:
        print("‚ùå Error: chunked_docs.pkl not found. Please run Notebook 02 first.")
        exit()

    # 1. Create the store
    vector_db = generate_vector_store(chunked_docs)
    
    # 2. Save the FAISS index for the Q&A notebook
    save_index(vector_db, INDEX_SAVE_PATH)
    
    # 3. Test Retrieval
    query = "What did Demis say about the future of AI?"
    results = vector_db.similarity_search(query, k=2)
    # ... rest of your print code
    
    print("\nüîç SIMILARITY SEARCH TEST:")
    for i, res in enumerate(results):
        print(f"\nResult {i+1} (Source: {res.metadata['source']}):")
        print(f"{res.page_content[:200]}...")

--- Starting NeuralTranscript Indexing Pipeline ---
üì• Successfully loaded 169 chunks from disk.
üß† Initializing Neural Embedding Model (HuggingFace)...
üöÄ Generating embeddings for 169 chunks. Please wait...
üíæ FAISS Index successfully saved to: data/faiss_index

üîç SIMILARITY SEARCH TEST:

Result 1 (Source: Gfr50f6ZBvo):
from a sentient animal and we know they're made of the same things biological neurons so we're gonna have to come up with explanations uh or models of the gap between substrate differences between mac...

Result 2 (Source: Gfr50f6ZBvo):
part of of birthing ai and that being the greatest benefit to humanity of any tool or technology ever and and getting us into a world of radical abundance and curing diseases and and and solving many ...


## Observations

- Successfully generated embeddings for 169 chunks.
- Embedding dimensionality: 384.
- FAISS index constructed without dimensional mismatch.
- Retrieval test query returned top 3 semantically relevant chunks.
- Retrieved content aligns with expected video topic.

## Summary

The transcript chunks were successfully converted into vector embeddings and indexed using FAISS.
Similarity-based retrieval returns semantically relevant context segments, validating the effectiveness of the embedding model.
This completes the knowledge indexing stage of the RAG architecture and prepares the system for generative response integration.


**Next step:** RAG Query Engine  
(`04_rag_query_engine.ipynb`)