# Embedding Generation and Semantic Retrieval

This notebook implements the **embedding and retrieval stage** of the RAG pipeline.
Chunked transcript data is converted into dense vector representations and indexed
using a vector database to enable semantic similarity search.


In [None]:
import os
from typing import List

from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS


In [None]:
DATA_PATH = "../data/chunks"
VIDEO_ID = "Gfr50f6ZBvo"

file_path = os.path.join(DATA_PATH, f"{VIDEO_ID}_chunks.txt")

with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Chunk file loaded.")
print(f"Total characters: {len(raw_text)}")


In [None]:
def parse_chunks(text: str) -> List[Document]:
    chunks = []
    current_chunk = []

    for line in text.splitlines():
        if line.startswith("--- Chunk"):
            if current_chunk:
                chunks.append(Document(page_content=" ".join(current_chunk)))
                current_chunk = []
        else:
            if line.strip():
                current_chunk.append(line.strip())

    if current_chunk:
        chunks.append(Document(page_content=" ".join(current_chunk)))

    return chunks


documents = parse_chunks(raw_text)

print(f"Total chunks loaded: {len(documents)}")


In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("Embedding model initialized.")


In [None]:
vectorstore = FAISS.from_documents(
    documents=documents,
    embedding=embeddings
)

print("Vector store created.")


In [None]:
query = "What is the main topic discussed in this video?"

results = vectorstore.similarity_search(query, k=4)

print("Top retrieved chunks:\n")

for i, doc in enumerate(results, start=1):
    print(f"--- Result {i} ---")
    print(doc.page_content[:400])
    print()


In [None]:
results_with_scores = vectorstore.similarity_search_with_score(query, k=4)

for i, (doc, score) in enumerate(results_with_scores, start=1):
    print(f"Result {i} | Distance Score: {score:.4f}")
    print(doc.page_content[:300])
    print()


In [None]:
VECTOR_DB_PATH = "../vectorstore/faiss_index"

os.makedirs(VECTOR_DB_PATH, exist_ok=True)

vectorstore.save_local(VECTOR_DB_PATH)

print(f"Vector store saved to: {VECTOR_DB_PATH}")


In [None]:
loaded_vectorstore = FAISS.load_local(
    VECTOR_DB_PATH,
    embeddings,
    allow_dangerous_deserialization=True
)

test_results = loaded_vectorstore.similarity_search(query, k=2)

print("Vector store successfully reloaded.")
print(test_results[0].page_content[:300])


## Observations

- Dense embeddings capture semantic similarity effectively for long-form transcripts.
- FAISS provides fast and reliable nearest-neighbor retrieval.
- Retrieved chunks are contextually aligned with the query, validating the chunking strategy.

The vector store is now ready to be integrated with an LLM for answer generation.


## Summary

- Generated dense embeddings for transcript chunks
- Indexed chunks using FAISS vector database
- Validated semantic retrieval through similarity search
- Persisted vector store for reuse

**Next step:** Retrieval-Augmented Generation using an LLM  
(`04_rag_pipeline.ipynb`)
