# Embedding Generation and Semantic Retrieval

This notebook implements the **embedding and retrieval stage** of the RAG pipeline.
Chunked transcript data is converted into dense vector representations and indexed
using a vector database to enable semantic similarity search.


In [None]:
import os
from typing import List

from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS


In [None]:
DATA_PATH = "../data/chunks"
VIDEO_ID = "Gfr50f6ZBvo"

file_path = os.path.join(DATA_PATH, f"{VIDEO_ID}_chunks.txt")

with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Chunk file loaded.")
print(f"Total characters: {len(raw_text)}")


In [None]:
def parse_chunks(text: str) -> List[Document]:
    chunks = []
    current_chunk = []

    for line in text.splitlines():
        if line.startswith("--- Chunk"):
            if current_chunk:
                chunks.append(Document(page_content=" ".join(current_chunk)))
                current_chunk = []
        else:
            if line.strip():
                current_chunk.append(line.strip())

    if current_chunk:
        chunks.append(Document(page_content=" ".join(current_chunk)))

    return chunks


documents = parse_chunks(raw_text)

print(f"Total chunks loaded: {len(documents)}")


In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("Embedding model initialized.")


In [None]:
vectorstore = FAISS.from_documents(
    documents=documents,
    embedding=embeddings
)

print("Vector store created.")


In [None]:
query = "What is the main topic discussed in this video?"

results = vectorstore.similarity_search(query, k=4)

print("Top retrieved chunks:\n")

for i, doc in enumerate(results, start=1):
    print(f"--- Result {i} ---")
    print(doc.page_content[:400])
    print()


In [None]:
results_with_scores = vectorstore.similarity_search_with_score(query, k=4)

for i, (doc, score) in enumerate(results_with_scores, start=1):
    print(f"Result {i} | Distance Score: {score:.4f}")
    print(doc.page_content[:300])
    print()


In [None]:
VECTOR_DB_PATH = "../vectorstore/faiss_index"

os.makedirs(VECTOR_DB_PATH, exist_ok=True)

vectorstore.save_local(VECTOR_DB_PATH)

print(f"Vector store saved to: {VECTOR_DB_PATH}")


In [None]:
loaded_vectorstore = FAISS.load_local(
    VECTOR_DB_PATH,
    embeddings,
    allow_dangerous_deserialization=True
)

test_results = loaded_vectorstore.similarity_search(query, k=2)

print("Vector store successfully reloaded.")
print(test_results[0].page_content[:300])


## Observations

- Dense embeddings capture semantic similarity effectively for long-form transcripts.
- FAISS provides fast and reliable nearest-neighbor retrieval.
- Retrieved chunks are contextually aligned with the query, validating the chunking strategy.

The vector store is now ready to be integrated with an LLM for answer generation.


## Summary

- Generated dense embeddings for transcript chunks
- Indexed chunks using FAISS vector database
- Validated semantic retrieval through similarity search
- Persisted vector store for reuse

**Next step:** Retrieval-Augmented Generation using an LLM  
(`04_rag_pipeline.ipynb`)


In [5]:
"""
PROJECT: NeuralTranscript: Semantic Search & Q&A for YouTube Content
MODULE: 03_VECTOR_INDEXING
-------------------------------------------------------------------------
DESCRIPTION:
This module converts text chunks into high-dimensional vector embeddings 
and indexes them using FAISS (Facebook AI Similarity Search). This enables 
the system to retrieve context based on semantic similarity.

AUTHOR: Engr. Inam Ullah Khan
-------------------------------------------------------------------------
"""
import pickle
import os
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
# Note: Use your chunked_docs from the previous step. 
# In a real pipeline, you might reload them or run them in the same session.

# --- 1. CONFIGURATION ---
INDEX_SAVE_PATH = "data/faiss_index"

# --- 2. CORE FUNCTIONS ---

def generate_vector_store(documents):
    """
    Converts documents to embeddings and stores them in a FAISS index.
    """
    print("üß† Initializing Neural Embedding Model (HuggingFace)...")
    
    # Using a high-quality, lightweight model included in your requirements
    embeddings = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={'device': 'cpu'} # Use 'cuda' if you have a GPU in Colab
    )
    
    print(f"üöÄ Generating embeddings for {len(documents)} chunks. Please wait...")
    
    # Create the FAISS index from the documents
    vector_store = FAISS.from_documents(documents, embeddings)
    
    return vector_store

def save_index(vector_store, path):
    """Persists the FAISS index to the local disk."""
    vector_store.save_local(path)
    print(f"üíæ FAISS Index successfully saved to: {path}")

# --- 3. EXECUTION PIPELINE (UPDATED) ---

if __name__ == "__main__":
    print("--- Starting NeuralTranscript Indexing Pipeline ---")
    
    # NEW: Load the chunks from the disk
    try:
        with open("data/chunked_docs.pkl", "rb") as f:
            chunked_docs = pickle.load(f)
        print(f"üì• Successfully loaded {len(chunked_docs)} chunks from disk.")
    except FileNotFoundError:
        print("‚ùå Error: chunked_docs.pkl not found. Please run Notebook 02 first.")
        exit()

    # 1. Create the store
    vector_db = generate_vector_store(chunked_docs)
    
    # 2. Save the FAISS index for the Q&A notebook
    save_index(vector_db, INDEX_SAVE_PATH)
    
    # 3. Test Retrieval
    query = "What did Demis say about the future of AI?"
    results = vector_db.similarity_search(query, k=2)
    # ... rest of your print code
    
    print("\nüîç SIMILARITY SEARCH TEST:")
    for i, res in enumerate(results):
        print(f"\nResult {i+1} (Source: {res.metadata['source']}):")
        print(f"{res.page_content[:200]}...")

--- Starting NeuralTranscript Indexing Pipeline ---
üì• Successfully loaded 169 chunks from disk.
üß† Initializing Neural Embedding Model (HuggingFace)...


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


üöÄ Generating embeddings for 169 chunks. Please wait...
üíæ FAISS Index successfully saved to: data/faiss_index

üîç SIMILARITY SEARCH TEST:

Result 1 (Source: Gfr50f6ZBvo):
from a sentient animal and we know they're made of the same things biological neurons so we're gonna have to come up with explanations uh or models of the gap between substrate differences between mac...

Result 2 (Source: Gfr50f6ZBvo):
part of of birthing ai and that being the greatest benefit to humanity of any tool or technology ever and and getting us into a world of radical abundance and curing diseases and and and solving many ...
