[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ovaccarelli/LLM-RAG/blob/main/notebooks/llm_rag_Open_Source_AI_Workshop_3.ipynb)


# 🔧 Setup

In [None]:
# Install all required Python packages for this workshop

!pip install wget langchain langchain-community pypdf faiss-cpu sentence-transformers rank_bm25

In [None]:
import os
from pathlib import Path
import wget
from rich.console import Console
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.retrievers import BM25Retriever

console = Console()

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 3. Construct the vectorstore

In this step, we take the PDF documents and transform them into a searchable vector database.


In [None]:
# Create the "data/PDFs" folder if it doesn't exist
PDF_FOLDER = Path("../data/PDFs")
os.makedirs(PDF_FOLDER, exist_ok=True)

urls = [
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/PDFs/Open_Source_AI_workshop.pdf",
]

# Download the PDFs
for url in urls:
    name = url.split("/")[-1]
    if not (PDF_FOLDER / name).is_file():
        filename = wget.download(url, f"data/PDFs/{name}")
console.print("Pdf file downloaded successfully.", style="bold green")

In [None]:
# 1. Create a folder to store the vector index
VECTORSTORES_DIR = Path("../data/vectorstores")
os.makedirs(VECTORSTORES_DIR, exist_ok=True)

# 2. Point to the directory containing our PDFs
PDF_FOLDER = Path("../data/PDFs")

# 3. Use PyPDFDirectoryLoader to load every PDF page as a Document
loader = PyPDFDirectoryLoader(PDF_FOLDER)
documents = loader.load()

# 4. Verify how many pages are loaded
print(f"Loaded {len(documents)} PDF pages")

### ✂️ Split Documents into Chunks

We break documents into smaller overlapping chunks using `RecursiveCharacterTextSplitter`.

- `chunk_size`: The number of characters per chunk.

- `chunk_overlap`: Ensures that we maintain context between chunks.

This is crucial for preserving semantic meaning across sentences and paragraphs.

In [None]:
# Set chunk size (how many characters per chunk) and overlap
CHUNK_SIZE = 500
CHUNK_OVERLAP = 10

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

# Split the loaded PDFs into smaller, overlapping chunks
all_splits = text_splitter.split_documents(documents)

print(f"✅ Split into {len(all_splits)} chunks")

In [None]:
# Preview the chunks
for i, chunk in enumerate(all_splits[:2]):  # print first 2 for brevity
    print(f"--- 📑 Chunk {i+1} ---")
    print(chunk.page_content) 
    #print("Metadata:", chunk.metadata)
    print()

### 🔍 Convert Text Chunks to Embeddings

We now convert each text chunk into a high-dimensional vector using an embedding model. These vectors capture the semantic meaning of the text.

- We use `HuggingFaceBgeEmbeddings from LangChain`.

- Normalizing embeddings helps improve similarity search accuracy.

- We set the device to "cpu" for compatibility with Colab. (If you're running this on a local machine with GPU, you can switch "cpu" to "cuda" for better performance.)

In [None]:
# Define the embedding model 
EMBEDDING_MODEL_NAME = "Qwen/Qwen3-Embedding-0.6B"

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={"device": "cpu"},  # "cuda" if you run locally with a GPU
    encode_kwargs={"normalize_embeddings": True},
)

In [None]:
# Example - Create one vector from a text
sample_text = "..."
vec = embedding_model.embed_query(sample_text)

print("Vector length:", len(vec))
print("First 10 values:", vec[:3])

### 🏗️ Create and Save the Vectorstore

Using the text chunks and embeddings, we build our vectorstore:

- FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search.

- This index will let us retrieve the most relevant chunks given a user question.

We also save the vectorstore locally so that it can be reused later without recomputing everything.

In [None]:
# Create a FAISS index from the text chunks and their embeddings
vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding_model)

# Save the vectorstore locally for reuse
vectorstore.save_local(VECTORSTORES_DIR)

print("✅ Vectorstore created and saved successfully.")

💾 Reload the Vectorstore (Optional)

In [None]:
# You can reload the saved vectorstore anytime without recomputing everything
vectorstore = FAISS.load_local(
    VECTORSTORES_DIR,
    embedding_model,
    allow_dangerous_deserialization=True  # Required in Colab environments
)

print("✅ Vectorstore reloaded successfully.")

### Test a similarity search

In [None]:
query = "..."
results_faiss = vectorstore.similarity_search(query, k=2)

for i, res in enumerate(results_faiss, 1):
    print(f"\n🔎 Result {i}")
    print(res.page_content)  # preview chunk text


#### 🔹 Minimal example: see BM25 sparse vector

In [None]:
query = "..."

# Create BM25 retriever with top-k limit
retriever = BM25Retriever.from_documents(all_splits)
retriever.k = 2   # limit results to top-2

results_BM25 = retriever.get_relevant_documents(query)

for i, res in enumerate(results_BM25, 1):
    print(f"\n🔎 Result {i}")
    print(res.page_content[:300])  # preview chunk text

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------