 # Introduction to Vector Databases and Retrieval-Augmented Generation (RAG)

In this notebook, we explore how to use **vector databases** and **RAG pipelines** to retrieve and generate information from both unstructured (PDFs) and structured (Excel) data sources using Langchain.

## What is a Vector Database?

A vector database is a special type of database designed to store and search **embeddings** — numerical representations of text, images, or other data. These embeddings allow us to compare content based on **semantic similarity**, not just keywords.

**Key features:**
- Stores high-dimensional vectors (embeddings)
- Enables similarity search (e.g., cosine, dot product)
- Useful for searching across documents, FAQs, transcripts, etc.

**Popular libraries/tools:**
- FAISS (Facebook AI Similarity Search)
- Chroma (lightweight & Langchain-native)
- Pinecone, Weaviate, Qdrant (production-ready managed services)

## What is Retrieval-Augmented Generation (RAG)?

RAG is an approach that combines **retrieval systems** (like vector DBs) with **language models** to produce more accurate and context-aware responses.

Instead of relying only on the model's internal knowledge, it:
1. Retrieves relevant documents based on the query
2. Feeds those docs into the LLM to answer the question

**Why use RAG?**
- Reduces hallucination
- Keeps answers grounded in real, updated data
- Supports domain-specific or private data without retraining the LLM

## Why is This Important?

- Large Language Models (LLMs) have limited memory and context windows
- RAG pipelines allow retrieval from large corpora of custom data
- Enables applications like:
  - Smart document search
  - Personalized assistants
  - Internal knowledge bots
  - Legal/medical document summarizers

In this notebook, we'll build and run end-to-end RAG pipelines on:
- A single PDF
- Multiple PDFs
- Structured Excel sheets

Using **Langchain**, **Chroma**, and **OpenAI embeddings**.


## Part 1: Information Retrieval from a Single PDF

We begin by installing and importing all necessary libraries. This includes:

- `langchain`: the framework we'll use to create RAG pipelines
- `chromadb`: lightweight in-memory vector database
- `openai`: for generating text embeddings (you can substitute with other models too)
- `PyPDFLoader`: to load and split PDF files into text
- `tiktoken`: tokenizer for estimating token counts (optional but helpful)

We'll also set up API keys and suppress unnecessary warnings.


In [1]:
# Install core packages
!pip install langchain langchain-community openai chromadb tiktoken pypdf --quiet

# Optional: suppress warnings
import warnings
warnings.filterwarnings("ignore")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m96.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m110.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.7/309.7 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m79.6 MB/s[0m eta [36m0:00:

### Loading and Splitting the PDF

To work with PDFs in Langchain, we use the `PyPDFLoader` from `langchain.document_loaders`.

**Why do we need to split the document?**
- LLMs have a token limit, and feeding large PDFs directly will exceed it
- Splitting into smaller chunks lets us embed and store them independently
- This is the foundation for retrieval: search is performed at the chunk level

We’ll:
1. Load the Harry Potter PDF using `PyPDFLoader`
2. Split the text into overlapping chunks using `RecursiveCharacterTextSplitter`
   - Each chunk will have a fixed size (e.g., 500 characters) with some overlap
   - Overlap ensures context isn't lost between chunks


In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [26]:
# Load the PDF (upload your file if not already in Colab)
pdf_path = "/content/Sorcerer's Stone.pdf"
loader = PyPDFLoader(pdf_path)

In [27]:
# Load pages
pages = loader.load()

In [28]:
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = text_splitter.split_documents(pages)

print(f"Loaded {len(docs)} document chunks")
print(docs[1].page_content[0:300])

Loaded 1225 document chunks
ALSO BY J. K. ROWLING 
Harry Potter and the Sorcerer’s Stone 
Year One at Hogwarts 
Harry Potter and the Chamber of Secrets 
Year Two at Hogwarts 
Harry Potter and the Prisoner of Azkaban 
Year Three at Hogwarts 
Harry Potter and the Goblet of Fire 
Year Four at Hogwarts 
Harry Potter and the Order 


### Embedding and Vector Store Creation

To search semantically through the document chunks, we convert them into **vector embeddings** using a pre-trained model.

We’ll use:
- `OpenAIEmbeddings` from Langchain (requires an OpenAI API key)
- `Chroma` as the vector database to store and retrieve embeddings

**Steps:**
1. Convert each chunk into a high-dimensional vector using `OpenAIEmbeddings`
2. Store those vectors in Chroma, which supports fast similarity search
3. This enables us to later retrieve the most relevant chunks for a query

Make sure to set your OpenAI API key before running this cell.


In [29]:
from google.colab import userdata
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import os

In [30]:
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [31]:
# Create embedding model and populate vector DB
embedding_model = OpenAIEmbeddings()
vector_store = Chroma.from_documents(docs, embedding=embedding_model)

print("Vector store created and ready.")

Vector store created and ready.


### Initialize RetrievalQA Chain for Querying

To perform question answering over our PDF embeddings, we need to:

- Initialize an LLM instance (`ChatOpenAI`)
- Create a `RetrievalQA` chain combining the vector store retriever with the LLM

This chain handles retrieving the most relevant document chunks and generating answers based on them.


In [32]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

In [33]:
# Initialize the language model
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

In [34]:
# Create the RetrievalQA chain using the vector store as retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_store.as_retriever(),
    return_source_documents=True
)

print("RetrievalQA chain ready.")

RetrievalQA chain ready.


###  Ask Questions About the PDF (RAG in Action)

Now that we have:
- Chunks of the document stored as embeddings
- A vector database (Chroma) ready for semantic search
- A language model (OpenAI GPT) initialized

We can use **Langchain’s RetrievalQA chain** to:
1. Accept a natural language query
2. Retrieve the most relevant chunks from the vector store
3. Pass those chunks as context to the LLM
4. Generate an informed, grounded answer

This is Retrieval-Augmented Generation (RAG) in action — fetching knowledge from the actual content instead of relying on model memory alone.


In [35]:
# Ask a question to the PDF
query = "What is the name of the school Harry Potter goes to?"
result = qa_chain(query)

In [36]:
# Show the answer
print("Answer:", result['result'])

Answer: The school Harry Potter attends is called Hogwarts School of Witchcraft and Wizardry.


In [37]:
# (Optional) View the first retrieved source chunk
print("\n--- Source Document Sample ---\n")
print(result['source_documents'][0].page_content[:500])


--- Source Document Sample ---

with a great destiny proves his worth while attending Hogwarts School 
of Witchcraft and Wizardry. 
ISBN 0-590-35340-3 
[1. Fantasy — Fiction.    2. Witches — Fiction.    3. Wizards — Fiction. 
4. Schools — Fiction.    5. England — Fiction.]    I. Title. 
PZ7.R79835Har    1998 
[Fic] — dc21    97-39059 
 
64 65 66 67 68 69 70 71 72    05 
Printed in U.S.A.     10 
First American edition, October 1998


In [38]:
query = "What is the Mirror of Erised and what does it show Harry?"
result = qa_chain(query)

# Show the answer
print("Answer:", result['result'])

# Show a snippet from the top source document
print("\n--- Source Document Excerpt ---\n")
print(result['source_documents'][0].page_content[:700])

Answer: The Mirror of Erised is a magical mirror that shows the deepest, most desperate desires of a person's heart. When Harry looks into the mirror, he sees his family, whom he has never known, standing around him. Additionally, the mirror showed Harry's friend Ron as Head Boy, reflecting his own desires and aspirations.

--- Source Document Excerpt ---

THE  MIRROR  OF  ERISED 
 213  
“Strange how nearsighted being invisible can make you,” said 
Dumbledore, and Harry wa s relieved to see that he was smiling. 
“So,” said Dumbledore, slipping off the desk to sit on the floor 
with Harry, “you, like hundreds before you, have discovered the 
delights of the Mi rror of Erised.” 
“I didn’t know it was called that, sir.” 
“But I expect you’ve realized by now what it does?” 
“It — well — it show s me my family —”


## Part 2: Ingesting Multiple PDFs

In real-world applications, you often need to retrieve information from **multiple documents** (e.g., reports, research papers, product manuals).

Langchain and Chroma make this simple:
- You load each PDF separately
- Split their content into chunks
- Combine all the chunks into a single vector store

This unified store lets your retrieval pipeline pull relevant context across all uploaded sources.

**Use case examples:**
- Search across policy documents
- Ask questions across multiple case files
- Build multi-source knowledge bots


### Unzip Folder and Prepare PDFs

We’ve uploaded a zipped folder containing multiple PDFs (`Harry Potter.zip`). Now we’ll:

1. Unzip the folder
2. Collect all `.pdf` file paths
3. Prepare them for processing

This sets us up to ingest all documents in a loop and build a combined vector store.


In [39]:
from google.colab import files
import zipfile
import os

In [40]:
# Unzip the uploaded folder
zip_path = "/content/Harry Potter.zip"
extract_dir = "/content/harry_potter_pdfs"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

In [43]:
import glob

In [44]:
# Path to the actual PDF folder
pdf_dir = "/content/harry_potter_pdfs/Harry Potter"
pdf_files = glob.glob(os.path.join(pdf_dir, "*.pdf"))

print(f"Found {len(pdf_files)} PDF files in subfolder:\n")
for f in pdf_files:
    print("-", os.path.basename(f))


Found 3 PDF files in subfolder:

- Chamber of Secrets.pdf
- prizoner of Azkaban.pdf
- Sorcerer's Stone.pdf


### Load and Chunk All PDFs

Now we’ll process all PDFs together:

1. Load each PDF using `PyPDFLoader`
2. Extract its pages as `Document` objects
3. Split all pages into chunks using `RecursiveCharacterTextSplitter`

All chunks will be stored in a single list, which we’ll later embed and store in a combined vector DB.

**Why do this?**
- It allows unified search across all documents
- Preserves semantic relationships per chunk
- Efficient and scalable for multi-source RAG


In [45]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [47]:
all_docs = []

# Loop through all PDFs and split
for path in pdf_files:
    loader = PyPDFLoader(path)
    pages = loader.load()

    # Split pages into chunks
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    chunks = splitter.split_documents(pages)

    all_docs.extend(chunks)

print(f"Total chunks across all PDFs: {len(all_docs)}")
print("Sample chunk:\n", all_docs[0].page_content[:400])


Total chunks across all PDFs: 3377
Sample chunk:
 Harry Potter and the Chamber
of Secrets PDF
J.K. Rowling
Scan to Download


### Embed and Store All Chunks in a Unified Vector Store

With all document chunks prepared, we now:

1. Use `OpenAIEmbeddings` to convert each chunk to a vector
2. Store all vectors in one `Chroma` instance

This enables cross-document retrieval — meaning you can ask questions that span across multiple Harry Potter books or chapters.

We'll reuse the embedding model setup, and store everything in-memory for now.


In [48]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Initialize embedding model
embedding_model = OpenAIEmbeddings()

# Create a single Chroma vector store for all chunks
multi_doc_vector_store = Chroma.from_documents(all_docs, embedding=embedding_model)

print("Multi-document vector store is ready.")


Multi-document vector store is ready.


### Setup RetrievalQA Chain for Multi-Document Retrieval

Now that all PDFs are embedded in one vector store, we set up the `RetrievalQA` chain again, but this time using `multi_doc_vector_store`.

This allows us to:
- Ask questions without knowing which PDF contains the answer
- Retrieve relevant chunks across all documents
- Keep the pipeline exactly the same as before — just with more data


In [49]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Reinitialize the LLM (if not done already)
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

# Create RetrievalQA with new multi-PDF retriever
multi_doc_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=multi_doc_vector_store.as_retriever(),
    return_source_documents=True
)

print("Multi-document QA chain initialized.")


Multi-document QA chain initialized.


### Ask Questions Across All PDFs

Let’s test the multi-document RAG pipeline with a query that may be answered in **any one of the PDFs**.

>  **Query**: "Who gives Harry the invisibility cloak and when?"

This checks whether:
- Relevant chunks are correctly retrieved from the right document
- The model synthesizes context into a meaningful, accurate answer

This simulates how multi-source retrieval would work in real applications like case files, legal docs, or product manuals.


In [50]:
query = "What happens when Harry speaks Parseltongue in the dueling club?"
result = multi_doc_qa_chain(query)

# Show the answer
print("Answer:", result['result'])

# Show excerpt of retrieved source
print("\n--- Source Document Snippet ---\n")
print(result['source_documents'][0].page_content[:700])


Answer: When Harry speaks Parseltongue in the dueling club, he unintentionally reveals his ability to communicate with snakes. This causes confusion and fear among the other students, as speaking Parseltongue is traditionally associated with dark wizards. Harry's actions are misinterpreted, leading to a sense of isolation and concern about how others perceive him.

--- Source Document Snippet ---

tries to protect Harry. Each character faces their fears and
challenges, emphasizing that courage can manifest in many
forms.
Chapter 11 | THE DUELING CLUB| Q&A
1.Question
What does Harry's experience as a Parselmouth reveal
about his character, and how does it affect his
relationships with others?
Answer:Harry discovers that he can speak
Parseltongue, the language of snakes, which is
traditionally associated with dark wizards,
Scan to Download


## Part 3: Working with Structured Data (Excel Sheets)

Besides unstructured PDFs, many real-world datasets exist in structured formats like Excel spreadsheets.

We’ll:
- Create a sample Excel sheet representing tabular data (e.g., student records or book info)
- Load it using `pandas`
- Convert rows into Langchain `Document` objects
- Embed and store them in a vector database
- Query the data with the same RAG approach

This extends our RAG pipeline beyond text documents to structured datasets.


In [51]:
import pandas as pd

# Sample data: Harry Potter characters info
data = {
    "Name": ["Harry Potter", "Hermione Granger", "Ron Weasley", "Albus Dumbledore"],
    "House": ["Gryffindor", "Gryffindor", "Gryffindor", "Gryffindor"],
    "Role": ["Student", "Student", "Student", "Headmaster"],
    "Wand": [
        "11-inch Holly, Phoenix feather",
        "10¾-inch Vine wood, Dragon heartstring",
        "14-inch Willow, Unicorn hair",
        "15-inch Elder, Thestral tail hair"
    ],
    "Patronus": ["Stag", "Otter", "Jack Russell Terrier", "Phoenix"]
}

df = pd.DataFrame(data)

# Save as Excel
excel_path = "/content/harry_potter_characters.xlsx"
df.to_excel(excel_path, index=False)

print(f"Sample Excel sheet saved at {excel_path}")


Sample Excel sheet saved at /content/harry_potter_characters.xlsx


### Load Excel and Convert Rows to Documents

We’ll load the Excel sheet and treat each row as a separate “document” containing concatenated cell values.

This allows us to:
- Create embeddings for each row
- Search across structured data with semantic queries
- Use the same RAG approach as with PDFs

Each row’s content is combined into a single text string to be embedded.


In [52]:
from langchain.schema import Document
import pandas as pd

# Load the Excel file
df = pd.read_excel(excel_path)

# Convert each row to a Document
docs = []
for _, row in df.iterrows():
    content = ", ".join([f"{col}: {val}" for col, val in row.items()])
    docs.append(Document(page_content=content))

print(f" Converted {len(docs)} rows to Documents. Sample:\n")
print(docs[0].page_content)


 Converted 4 rows to Documents. Sample:

Name: Harry Potter, House: Gryffindor, Role: Student, Wand: 11-inch Holly, Phoenix feather, Patronus: Stag


### Embed and Store Excel Rows in Vector Store

We will use the same embedding model (`OpenAIEmbeddings`) to convert each row into a vector and store them in Chroma.

This lets us run semantic searches over structured tabular data just like unstructured text.


In [53]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Initialize embedding model (reuse if already initialized)
embedding_model = OpenAIEmbeddings()

# Create vector store from the Excel row documents
excel_vector_store = Chroma.from_documents(docs, embedding=embedding_model)

print(" Excel data embedded and stored in vector store.")


 Excel data embedded and stored in vector store.


### Setup RetrievalQA Chain for Excel Data

We’ll use the same approach as before:

- Initialize the LLM (`ChatOpenAI`)
- Use the Excel vector store’s retriever
- Create a RetrievalQA chain to answer queries over the structured data


In [55]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Initialize LLM (reuse if possible)
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

# Create RetrievalQA with Excel vector store retriever
excel_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=excel_vector_store.as_retriever(),
    return_source_documents=True
)

print(" RetrievalQA chain for Excel data ready.")


 RetrievalQA chain for Excel data ready.


In [58]:
query = "What is the patronus of Hermione Granger?"
result = excel_qa_chain(query)

print("Answer:", result['result'])

print("\n--- Source Document ---\n")
print(result['source_documents'][0].page_content)


Answer: Hermione Granger's Patronus is an otter.

--- Source Document ---

Name: Hermione Granger, House: Gryffindor, Role: Student, Wand: 10¾-inch Vine wood, Dragon heartstring, Patronus: Otter
