# Session 19: The RAG Pipeline - Indexing Knowledge

**Learning Objectives:**

* Understand why RAG is the primary solution to the LLM knowledge problem.
* Learn the components of the data ingestion and indexing pipeline using PDF documents.
* Create and store vector embeddings from custom documents.
* Perform similarity searches to find relevant information.

## Part 1: The Theory of Augmented Generation

### Deep Dive: The Indexing Pipeline

This is the "preparation" phase—making our knowledge searchable for the retrieval step. It's a one-time, upfront process for a given set of documents.

**1. Loading Documents:** The process begins with loading our data. This can come from various sources:
   * PDFs
   * Text files (`.txt`, `.md`)
   * Web pages
   * Databases
   * APIs
   
**2. Chunking (Splitting):** We can't feed an entire book into an embedding model due to token limits and computational cost. More importantly, retrieving an entire book to answer a specific question is inefficient. We split the documents into small, semantically meaningful chunks. Common strategies include:
   * **Fixed-size chunking:** Easy but can break sentences apart.
   * **Recursive character splitting:** A smarter method that tries to split based on semantic boundaries like paragraphs (`\n\n`), sentences (`.`), and spaces (` `).

**3. Embedding:** This is where we convert our text chunks into numerical representations (vectors). These vectors capture the semantic meaning of the text. Models called "embedding models" (like `all-MiniLM-L6-v2` or OpenAI's `text-embedding-ada-002`) are specifically trained for this task. The key idea is that chunks with similar meanings will have vectors that are close to each other in a high-dimensional space.

**4. Storing (Vector Stores):** A Vector Store, or Vector Database, is a specialized database designed to efficiently store and search through millions or billions of vectors. It's the core component that enables fast similarity searches. 
   * **Popular examples:** Chroma, FAISS, Pinecone, Weaviate.

### The Power of Semantic Search

The magic of RAG comes from **semantic search**, which is fundamentally different from traditional **keyword search** (like Ctrl+F).

* **Keyword Search:** Finds exact matches of words. A search for "car" will miss documents that only mention "automobile."
* **Semantic Search:** Finds documents based on meaning. Because the vectors for "car" and "automobile" are very close in vector space, a semantic search for one will easily find the other. This allows us to find relevant information even if the user's query uses completely different wording than the source document.

## Part 2: Practical - Building Your Knowledge Base

### Setup

First, let's install the necessary libraries. We're adding `pypdf` to handle the PDF loading.

In [1]:
%%capture
%pip install -q langchain langchain_community langchain_huggingface sentence-transformers faiss-cpu pypdf

### Code Demo 1: Loading & Chunking a PDF

**Goal:** Load a PDF document from the web and split it into manageable chunks.

In [2]:
%%capture
%pip install wikipedia

In [3]:
import os
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

url = "https://en.wikipedia.org/wiki/Nefertiti"

loader = WikipediaLoader(query="Nefertiti", load_max_docs=1)
documents = loader.load()

print(f"Loaded {len(documents)} document(s) from {url}")

# 2. Chunk the document
# We'll use a slightly larger chunk size for a dense research paper.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

print(f"\nDocument split into {len(chunks)} chunks.")
print("\n--- Sample Chunk ---")
print(chunks[0].page_content)
print("------------------------------------------")
print(chunks[1].page_content)
print("------------------------------------------")

Loaded 1 document(s) from https://en.wikipedia.org/wiki/Nefertiti

Document split into 13 chunks.

--- Sample Chunk ---
Nefertiti () (c. 1370 – c. 1330 BC) was a queen of the 18th Dynasty of Ancient Egypt, the great royal wife of Pharaoh Akhenaten. Nefertiti and her husband were known for their radical overhaul of state religious policy, in which they promoted an exclusivist and possibly even monotheistic religion, Atenism, centered on the sun disc and its direct connection to the royal household. With her husband, she reigned at what was arguably the wealthiest period of ancient Egyptian history.
------------------------------------------
After her husband's death, some scholars believe that Nefertiti ruled briefly as the female pharaoh known by the throne name Neferneferuaten, and before the ascension of Tutankhamun, although this identification is a matter of ongoing debate. If Nefertiti did rule as pharaoh, her reign was marked by the fall of Amarna and the relocation of the capita

### Code Demo 2: Creating Embeddings

**Goal:** Convert the text chunks into numerical vectors using an embedding model. (This part remains the same!)

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings

# 3. Initialize the embedding model
# We use a popular open-source model from Hugging Face
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'} # Use CPU for this example
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

print("Embedding model loaded successfully.")

# Let's test it on a single sentence
test_sentence = "This is a test sentence."
test_embedding = embeddings.embed_query(test_sentence)

print(f"\nEmbedding for the test sentence (first 5 dimensions): {test_embedding[:5]}")
print(f"Vector dimension: {len(test_embedding)}")

2025-11-07 09:22:03.485484: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762507323.676223      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762507323.732158      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded successfully.

Embedding for the test sentence (first 5 dimensions): [0.08429646492004395, 0.05795372277498245, 0.004493352957069874, 0.1058211401104927, 0.007083415519446135]
Vector dimension: 384


### Code Demo 3: Storing and Searching in a Vector Database

**Goal:** Set up a simple vector store (FAISS), add our documents, and perform a similarity search. (This part also remains the same!)

In [11]:
from langchain_community.vectorstores import FAISS

# 4. Store the chunks and their embeddings in a vector store
# FAISS (Facebook AI Similarity Search) is an efficient in-memory vector store.
print("Creating vector store from document chunks...")
vectorstore = FAISS.from_documents(chunks, embeddings)
print("Vector store created successfully!")

# Now, let's perform a similarity search
query = "Who is Nefertiti?"
print(f"\nQuery: '{query}'")

# Retrieve the top 3 most relevant chunks
results = vectorstore.similarity_search(query, k=3)

print("\n--- Top 3 Relevant Chunks Found ---")
for i, doc in enumerate(results):
    print(f"Result {i+1} (from page {doc.metadata.get('source', 'N/A')}):\n{doc.page_content}\n")

# Another example
query_2 = "What was Nefertiti's reign like ?"
print(f"\nQuery: '{query_2}'")
results_2 = vectorstore.similarity_search(query_2, k=3)

print("\n--- Top 3 Relevant Chunks Found ---")
for i, doc in enumerate(results_2):
    print(f"Result {i+1} (from page {doc.metadata.get('source', 'N/A')}):\n{doc.page_content}\n")

Creating vector store from document chunks...
Vector store created successfully!

Query: 'Who is Nefertiti?'

--- Top 3 Relevant Chunks Found ---
Result 1 (from page https://en.wikipedia.org/wiki/Nefertiti):
== Names and titles ==
Nefertiti had many titles, including:

Result 2 (from page https://en.wikipedia.org/wiki/Nefertiti):
The exact dates when Nefertiti married Akhenaten and became the king's g

Result 3 (from page https://en.wikipedia.org/wiki/Nefertiti):
It has also been proposed that Nefertiti was Akhenaten's full sister, though this is contradicted by her titles which do not include the title of "King's Daughter" or "King's Sister," usually used to indicate a relative of a pharaoh. Another theory about her parentage that gained some support identified Nefertiti with the Mitanni princess Tadukhipa, partially based on Nefertiti's name ("The Beautiful Woman has Come") which has been interpreted by some scholars as signifying a foreign origin.


Query: 'What was Nefertiti's reig