<div style="background-color: #e6f0fa; border-left: 6px solid #1c6ea4; padding: 1em;">

<h1> Key Learning Outcomes from This Notebook</h1>

By completing this notebook, you will have:

<ul>
  <li>📄 Gained a practical introduction to Retrieval-Augmented Generation (RAG) using modern language models.</li>
  <li>🧠 Understood how embeddings work and how they can be used to represent text in vector space.</li>
  <li>🗂️ Built a vector store (e.g., FAISS) for storing and retrieving semantically relevant documents.</li>
  <li>🧩 Combined retrieval with generation by integrating a RAG pipeline.</li>
  <li>💬 Created prompt-based systems that generate context-aware answers based on retrieved documents.</li>
  <li>⚙️ Learned to structure RAG systems to enhance the factual accuracy of LLM outputs by grounding them in external data.</li>
</ul>

</div>


#1) Retrieval-Augmented Generation (RAG) ##

> **“Bring your own knowledge base to the language model.”**

---

### 1a. Why do we need RAG?

Retrieval-Augmented Generation, or **RAG**, is a powerful method that combines the reasoning capabilities of large language models (LLMs) with access to external information sources such as documents, websites, or databases.

The idea is simple: instead of expecting the language model to know everything, we **retrieve relevant content on demand** from your own data, and provide that context to the model as part of the prompt. The result is a more accurate, up-to-date, and trustworthy response... in principle :)

We can summarise this below:

| Challenge | Consequence without RAG |
|-----------|-------------------------|
| LLMs are **trained on a fixed snapshot** of the internet. | They may be *out of date* or simply *lack niche domain details*. |
| They have a limited context window. | We can’t fit entire libraries of PDFs or long knowledge bases into a single prompt. |
| Pure LLM generation can hallucinate. | Answers may be fluent but **factually wrong** or missing citations. |

Language models are impressive, but they come with key limitations:

- They are trained on a **fixed dataset**, which means they may lack recent developments, niche scientific knowledge, or anything outside their training cutoff.
- They can **hallucinate** facts — generating answers that sound correct but are actually untrue.
- They have a **limited context window** — you can’t fit an entire research paper or website into one prompt.

RAG solves these problems by **bringing in external knowledge at runtime** — just before asking the model to generate an answer.


### 1b). How RAG works — two-step loop

At its core, RAG is a two-step process:

1. **Retrieve**: When a user asks a question, we first **search a custom knowledge source** (like uploaded PDFs or a scraped website) to find the most relevant text passages. This is typically done using semantic search with **embeddings** (vector representations of text), although keyword search can also be used.

2. **Generate**: We then pass both the **question and the retrieved text** to the language model. The model uses this injected context to generate a grounded and accurate answer.

This gives the model **temporary access to specific information** — like letting it “look something up” before answering.

#### An Example

Let’s say a student asks:

> *“What are the health risks associated with indoor mold exposure?”*

A standard language model might give a generic answer — or hallucinate.  
With RAG, we:

- Search a folder of academic papers on environmental health and find 3 passages about mold exposure.
- Pass those passages along with the question into GPT-3.5 or GPT-4.
- The model now generates an answer that draws directly from the retrieved documents.

This answer is both accurate and traceable to real sources.


### 1c). Why RAG is powerful

* **Up-to-date** — you can swap in fresh documents without retraining the model.  
* **Domain-specific** — feed the LLM specialised papers, policies, codebases, etc.  
* **Explainable** — you can surface the exact passages returned by the retriever as citations.  
* **Cost-efficient** — store millions of documents in a vector index; send only the top-k snippets to the LLM.

### 1d). Typical RAG pipeline in practice

```text
User Question  ──►  Embed (OpenAI)  ──►  Vector Search (FAISS)
                                        │
                                        ▼
                          Top-k text chunks (context)
                                        │
              ──────────────────────────┘
              ▼
       Prompt LLM (GPT-3.5, GPT-4o, etc.)
              ▼
         Grounded Answer + Citations


### 1e) What Libraries Do We Use in This Notebook?

To build our RAG pipeline using PDFs, webpages, and OpenAI models, we rely on the following Python libraries:

| Library / Tool                 | What It Does                                                                                             | Docs / Link |
|-------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
| `openai`                      | - Sends text to OpenAI models<br>- Creates embeddings with `text-embedding-3-small`<br>- Generates answers with `gpt-3.5-turbo`, `gpt-4o`, etc. | [API Docs](https://platform.openai.com/docs) |
| `PyMuPDF` (`fitz`)            | - Opens and reads PDF files<br>- Extracts text from each page<br>- Converts entire PDF to plain text     | [Docs](https://pymupdf.readthedocs.io) |
| `requests` + `beautifulsoup4` | - Fetches web content with `requests`<br>- Parses HTML using `BeautifulSoup` to extract paragraphs (`<p>`) | [Requests](https://requests.readthedocs.io) <br> [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) |
| `faiss` (Facebook AI)         | - Stores and searches document embeddings<br>- Performs fast similarity search for top-k relevant chunks | [FAISS GitHub](https://github.com/facebookresearch/faiss) |
| `tiktoken` *(optional)*       | - Counts tokens in text (for OpenAI models)<br>- Helps manage context length before sending prompts       | [GitHub](https://github.com/openai/tiktoken) |
| `transformers` + `bitsandbytes` *(optional)* | - Loads small local models (e.g. Gemma-2B, Phi-2)<br>- `bitsandbytes` enables 4-bit quantisation to run on limited GPUs | [Transformers Docs](https://huggingface.co/docs/transformers) <br> [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) |

---

> Some of these are **essential** (like `openai`, `faiss`, `fitz`), while others are **optional** but useful for comparison or performance (like `transformers`, `tiktoken`).




In [1]:
!pip install -q openai faiss-cpu PyMuPDF beautifulsoup4 requests tiktoken
!pip install -U transformers bitsandbytes accelerate


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m72.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m101.8 MB/s[0m eta [36m0:00:00[0m
Collecting bitsandbytes
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.2->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-p

# 2) · Configure OpenAI & import modules  
  
Before we begin loading documents or embedding anything, we need to **set up a few foundational components**:

1. **Connect to your Google Drive** — This allows us to access your PDFs or other local files stored in your Drive. We'll use it to load academic papers or datasets for the RAG pipeline.

2. **Set up your OpenAI API key** — This authenticates your access to OpenAI’s models (for both embeddings and chat completions). We'll store the key securely and use the new `openai>=1.0` client.

3. **Import essential libraries** — We import only what we need at this stage:
   - file system tools
   - OpenAI client
   - PDF and webpage handling libraries
   - FAISS for vector search

---

Colab will prompt you to **authorize access to your Google Drive** when it runs the mounting step. Make sure you're signed into the correct Google account and have already uploaded your documents.

> 💡 This setup cell is the


### Library Overview

| Library / Module              | Purpose                                                                 |
|------------------------------|-------------------------------------------------------------------------|
| `google.colab.drive`         | Mounts your Google Drive inside the Colab virtual machine              |
| `pathlib`, `os`, `glob`      | Handy tools for working with files and folders                         |
| `fitz` (PyMuPDF)             | Extracts text content from PDF documents                               |
| `requests` + `BeautifulSoup` | Scrapes and cleans text from websites (HTML parsing)                   |
| `faiss`                      | Performs fast similarity search on vector embeddings (retrieval step)  |
| `OpenAI`                     | Python SDK (v1.x) for calling embeddings and LLMs like GPT-3.5, GPT-4o |



In [2]:
from google.colab import drive
import os, shutil, glob, re
from pathlib import Path
import openai, fitz, requests
from bs4 import BeautifulSoup
import numpy as np
import faiss                # Facebook AI Similarity Search
import openai
from openai import OpenAI

# 🔑 Set your key
os.environ["OPENAI_API_KEY"] = "<<INSERT YOUR TOKEN HERE>>"
client = OpenAI()  # uses environment variable or openai.api_key under the hood

mount='/content/drive'
print("Colab: mounting Google drive on ", mount)      # You’ll be prompted to authenticate
drive.mount(mount)

PDF_FOLDER = Path("/content/drive/MyDrive/RAG_papers")

import textwrap
if not os.path.isdir(PDF_FOLDER):
    print(f"Folder not found: {PDF_FOLDER}")
    print("Please check the path and ensure the folder exists in your Google Drive.")
else:
    print(f"Folder found: {PDF_FOLDER}")
    print("Files:", len(glob.glob(f"{PDF_FOLDER}/**/*.pdf", recursive=True)))

Colab: mounting Google drive on  /content/drive
Mounted at /content/drive
Folder found: /content/drive/MyDrive/RAG_papers
Files: 3


# 3) · Read all PDFs in a folder  
  
* Colab pops up a Google sign-in; approve it.  
* Put your academic PDFs in a folder—e.g. `/MyDrive/SciencePapers`.  
* The code extracts **raw text** from every page of every PDF and returns a list of strings (one per file).

#### What the function does — step by step

1. **Loop through the folder** and examine every filename.  
2. **Filter by extension** (`.pdf`) so non-PDF files are skipped.  
3. **Open each PDF** with PyMuPDF (`fitz`); the `with` block closes the file automatically.  
4. **Iterate page-by-page** inside the opened PDF.  
5. **Extract text** from each page (`page.get_text()`) and store it in a temporary list called `full`.  
6. **Concatenate** all page texts into one long string and append that string to `texts`.  
7. **Return** the final list — every element corresponds to the complete text of one PDF.



In [3]:
def load_pdfs(folder_path):
    texts = []
    for fname in os.listdir(folder_path):
        if fname.lower().endswith(".pdf"):
            with fitz.open(os.path.join(folder_path, fname)) as doc:
                full = []
                for page in doc:
                    full.append(page.get_text())
                texts.append(" ".join(full))
    return texts

pdf_docs = load_pdfs(PDF_FOLDER)
print(f"Loaded {len(pdf_docs)} PDFs.")

Loaded 3 PDFs.


# 4 · Pull clean text from a webpage  

This function is designed to load and extract clean text from a webpage, specifically the main article content. Many scientific articles, reports, and medical bulletins are published on the web as HTML. This function allows us to extract meaningful content from public webpages and use it in our RAG system, just like we do with PDFs.

Here's how it works:

```res = requests.get(url, timeout=10)``` - This fetches the full HTML of the page. If the page takes more than 10 seconds to respond, the request will time out.

```soup = BeautifulSoup(res.text, "html.parser")``` - BeautifulSoup turns the raw HTML into a structured format that we can easily search and manipulate.

```soup.find_all("p")``` finds every paragraph in the page.
```p.get_text(" ", strip=True)``` gets the actual text from inside each paragraph.
The ```" ".join(...)``` command combines all paragraph texts into one clean string, separated by spaces.
  

In [None]:
def load_webpage(url):
    res = requests.get(url, timeout=10)
    soup = BeautifulSoup(res.text, "html.parser")
    return " ".join(p.get_text(" ", strip=True) for p in soup.find_all("p"))

web_doc = load_webpage("https://www.cdc.gov/mold-health/about/index.html")   # change URL as needed


We can actually preview the text that has been extracted from the webpage as per the code below. You might find some of it seems entirely irrelevant. We get to this shortly, this is the job of embedding and retrieving to sort through.

In [None]:
# First 50 words
words = web_doc.split()
print(" ".join(words[:50]) + " …")


A .gov website belongs to an official government organization in the United States. A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites. Related Topics: There is always some mold around. Molds have been on the Earth for …


# 5 · Chunk Documents and Create Embeddings

In Retrieval-Augmented Generation, we first **break long texts into smaller “chunks”** and then **turn each chunk into a numeric vector** (an *embedding*). These two steps—*chunking* and *embedding*—are the backbone of the retrieval phase.

#### Why do we chunk documents?

* **Context-window limits**. Even GPT-4o can read only a few thousand tokens at once. A 20-page PDF can easily exceed that limit.
* **Granular retrieval**. When a user asks a question, we want to pull back *just the most relevant passages*—not an entire 20-page paper.
* **Reduced hallucinations**. Smaller, focused chunks lower the risk of the model mis-summarising unrelated sections.

> **Analogy**: Think of chunking as cutting a thick textbook into sticky-note-sized excerpts you can quickly flip to.


#### How does the sliding-window chunker work?

```text
            chunk_size = 500 tokens
           ─────────────────────────
doc:  [A][B][C][D][E][F][G]     ← imaginary tokens
overlap = 100 tokens
           |---- chunk 1 ----|
                     |---- chunk 2 ----|
                               |---- chunk 3 ----|



We start at token 0 and take ```chunk_size``` tokens.
We then slide forward by ```chunk_size``` – overlap tokens, so each new chunk overlaps slightly with the previous one.
Overlap ensures that if a key fact sits on a boundary between chunks, it still appears in at least one chunk in full.

### What are embeddings?

An embedding is a dense vector (e.g. 1 × 1536 floats) that captures the semantic meaning of a text chunk.
Two chunks that talk about “mold spores” will be close in embedding space, even if they use different words.
We’ll use the model ```text-embedding-3-small``` first because:

 - It’s the cheapest option in the new OpenAI embedding family.
 - Its 1000-token context window comfortably covers our chunk sizes.
 - It produces vectors compatible with FAISS out of the box.

```chunk_text``` splits a raw document string into overlapping slices.

```embed_chunks``` sends those slices to the OpenAI Embeddings endpoint in one batch for efficiency and returns a list of vectors.

Each vector and its corresponding text chunk are then stored side-by-side in FAISS.

In [None]:
def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

def embed_chunks(chunks, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=chunks,
        model=model
    )
    return [record.embedding for record in response.data]

# 6 · Create a FAISS index and add all vectors  
  
We keep the *same order* between `chunks` and vectors so we can retrieve the text later.
We’ll keep things minimal:

1.  `build_faiss_index(chunks, vectors)`  
    *returns* a FAISS `IndexFlatL2` already loaded with all your vectors.

2.  `retrieve(query, index, chunks, k=5)`  
    Takes a raw query string, embeds it with OpenAI, searches the index, and
    returns the *k* most relevant text chunks.


In [None]:
# ---------- 1 · build the index ----------
def build_faiss_index(chunks, vectors):
    """
    chunks  : list[str]   – text chunks (same order as vectors)
    vectors : list[list]  – embedding vectors (float32)
    returns : FAISS index
    """
    dim = len(vectors[0])
    index = faiss.IndexFlatL2(dim)                  # pure L2 index
    index.add(np.asarray(vectors).astype("float32"))
    return index

def retrieve(query, index, chunks, k=5, model="text-embedding-3-small"):
    query_vec = embed_chunks([query], model)[0]
    D, I = index.search(np.array([query_vec], dtype="float32"), k)
    return [chunks[i] for i in I[0]]

In [None]:
# 1. Gather all documents
all_texts = pdf_docs + [web_doc]

# 2. Split & embed
all_chunks, all_vectors = [], []
for doc in all_texts:
    ch = chunk_text(doc)
    vec = embed_chunks(ch)
    all_chunks.extend(ch)
    all_vectors.extend(vec)

print(f"Prepared {len(all_chunks)} chunks.")

# 3. Build the FAISS index
faiss_index = build_faiss_index(all_chunks, all_vectors)

# 4. Try a query
query = "What health risks does indoor mold pose?"
top_chunks = retrieve(query, faiss_index, all_chunks, k=3)

print("\n🔍 Top retrieved chunks:\n")
for i, c in enumerate(top_chunks, 1):
    print(f"{i}. {c[:200]}…\n")       # print first 200 chars of each chunk


Prepared 1255 chunks.

🔍 Top retrieved chunks:

1. ilding materials on
which it grows, mold growth is un-
sightly and may produce offensive
odors, and mold is likely to sensitize
and produce allergic responses in
allergic individuals. Except for per-
…

2. t on the 
other hand exposures can lead to respiratory symptoms, asthma and infections (WHO 
2009). This is especially true for spore-forming moulds which have been perennially linked to 
health impac…

3. is found both indoors and outdoors. Mold can enter your home through open doorways, windows, vents, and heating and air conditioning systems. Mold in the air outside can also attach itself to clothing…



# 7 · Generate an Answer with the Retrieved Context

Now that we can pull the most relevant chunks for any question, the next job is to feed **question + context** into an LLM and let it craft a final answer. If this isnt clear, check out the schematics that explain the RAG process.

We’ll start with OpenAI’s `gpt-3.5-turbo` (cheap, fast) and then show how to call a local model if you want to compare. Identify where you can change the model you are interacting with.


In [None]:
# ─── helper to call GPT-3.5 with context ─────────────────────
def rag_answer(query, index, chunks, k=5, model="gpt-3.5-turbo"):
    # 1. retrieve
    context_chunks = retrieve(query, index, chunks, k=k)
    context = "\n\n".join(context_chunks)

    # 2. build prompt
    prompt = (
        "You are a helpful scientific assistant.\n\n"
        "Context:\n"
        f"{context}\n\n"
        f"Question: {query}\n"
        "Answer as a concise paragraph, citing facts from the context where possible."
    )

    # 3. ask the model (OpenAI ≥1.0)
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
        max_tokens=400
    )
    return response.choices[0].message.content.strip()

# ─── try it ──────────────────────────────────────────────────
print("💬 GPT-3.5-turbo says:\n")
print(rag_answer(query, faiss_index, all_chunks, k=3))


💬 GPT-3.5-turbo says:

Indoor mold poses health risks such as respiratory symptoms, asthma, and infections. Mold growth can lead to offensive odors and sensitization of allergic individuals. While mold is not a source of fungal infections for most people, exposure to mold spores can cause allergic responses. Scientific evidence does not support the idea that inhaled mycotoxins in indoor environments have adversely affected human health. Mold can enter homes through various means and thrive in damp environments, potentially causing a variety of health issues.


### Understanding `rag_answer()` — How Retrieval Supports Generation

The function `rag_answer()` is the core of our RAG (Retrieval-Augmented Generation) system.  It combines **semantic search** with **language generation** to produce answers that are grounded in real, relevant documents.


---

### 🟩 Step-by-step breakdown

```python
def rag_answer(query, index, chunks, k=5, model="gpt-3.5-turbo"):
    ...```


1. k=5: How many chunks should we retrieve?

The k parameter controls how many top-matching chunks we pull from the vector index for each query. These are the passages from your documents that are most semantically similar to the user’s question.

2. Combine retrieved chunks into a single context block

```context_chunks = retrieve(query, index, chunks, k=k)
    context = "\n\n".join(context_chunks)```
This builds a string that includes the top k relevant text snippets, separated by double line breaks.
This "mini knowledge base" becomes the LLM's temporary memory for this specific query.

3. Construct the full prompt

```prompt = (
    "You are a helpful scientific assistant.\n\n"
    "Context:\n"
    f"{context}\n\n"
    f"Question: {query}\n"
    "Answer as a concise paragraph, citing facts from the context where possible."
)```

We explicitly instruct the model to use the provided context, not just general knowledge.
This is crucial to reduce hallucination and encourage grounded answers.

4. Call the language model

```response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2,
    max_tokens=400
)```

We use OpenAI's chat API to generate the final answer. A low temperature (e.g. 0.2) helps reduce randomness and improve factual accuracy.
```max_tokens``` limits the length of the response.

5. Return the model's answer

```return response.choices[0].message.content.strip()```
This returns the model's reply as a clean string, ready to display in your notebook or app.

The LLM doesn’t need to know everything—it just needs the right context at the right time.

In a real application, you could:

 - Tune k dynamically based on query length
 - Rerank retrieved chunks
 - Add citations by showing which chunk each sentence came from



 We havent yet comnpared a RAG versus non-RAG answer..lets do that below:

In [None]:
# Non-RAG answer (baseline)
def baseline_answer(query, model="gpt-3.5-turbo"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}],
        temperature=0.2,
        max_tokens=400
    )

    return response.choices[0].message.content.strip()

query = "What are the health risks associated with indoor mold exposure?"

print("🔵 RAG-augmented answer:\n")
print(rag_answer(query, faiss_index, all_chunks, k=3))

print("\n⚫ Baseline answer (no retrieval):\n")
print(baseline_answer(query))


🔵 RAG-augmented answer:

Indoor mold exposure has been linked to various health risks, including upper respiratory tract symptoms, cough, and wheeze in otherwise healthy individuals, as well as asthma symptoms in people with asthma. Additionally, hypersensitivity pneumonitis can occur in individuals susceptible to immune-mediated conditions. Mold exposure can also lead to respiratory symptoms, asthma, and infections, especially with spore-forming molds. Allergic responses, both IgE and IgG mediated, have been associated with exposure to indoor molds. Uncommon allergic syndromes such as allergic bronchopulmonary aspergillosis and allergic fungal sinusitis have also been linked to mold exposure.

⚫ Baseline answer (no retrieval):

Exposure to indoor mold can pose several health risks, including:

1. Respiratory issues: Mold spores can trigger allergic reactions and asthma symptoms in individuals who are sensitive to them. Symptoms may include coughing, wheezing, shortness of breath, and 

The RAG answer should contain facts, phrases, or citations that clearly came from your PDFs or the webpage.
The non-RAG answer may still be good, but it *might*:
 - Miss specific details
 - Sound more generic
 - Introduce hallucinated facts

 Shall we now see what happens when we call a local model rather than an OpenAI model?

### Comparing RAG vs Non-RAG with a Small Local Model

To really showcase the value of Retrieval-Augmented Generation, it helps to use a **tiny local LLM**.  On its own, a 2-billion-parameter model often gives vague or hallucinated answers.  
Add the retrieved context, and suddenly it can perform almost as well as a much larger cloud model.

Below we:

1. **Load** Gemma-2B-It in 4-bit quantised mode (fits on Colab’s free GPU).  
2. Define `rag_answer_local()` and `baseline_answer_local()`.  
3. Ask the exact same question you used earlier and print the outputs side-by-side.

Note, you need a Hugging Face access token, usually referred to as ```HF_TOKEN```.  To get this,  -
Log in at: https://huggingface.co/login
 - Go to your settings: https://huggingface.co/settings/tokens
 - Click "New token"
 - Name it something like ```colab-rag```
 - Scope: leave as "Read"
 - Click Create, then copy the token to the code below(**you'll only see it once!**)


In [None]:

from getpass import getpass

# Store token securely during session
os.environ["HF_TOKEN"] = "your-huggingface-token-here"
HF_TOKEN = os.environ["HF_TOKEN"]

# ─── 1 · Load Gemma-2B-It ─────────────────────────
try:
    local_model, tokenizer  # will raise NameError if not defined
except:
    from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
    import torch, os

    model_id = "tiiuae/falcon-7b-instruct"
    bnb_cfg  = BitsAndBytesConfig(load_in_4bit=True,
                                  bnb_4bit_compute_dtype=torch.bfloat16)

    tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN)
    local_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        token=HF_TOKEN,
        device_map="auto",
        quantization_config=bnb_cfg
    )
    local_model.eval()
    print("✅ Model loaded.")

def local_chat(prompt, max_tokens=400, temperature=0.2):
    inputs = tokenizer(prompt, return_tensors="pt").to(local_model.device)
    outputs = local_model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    # Decode and extract only new output (after prompt)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer_only = generated[len(prompt):].strip()
    return answer_only

# ─── 2 · RAG + baseline helpers for the local model ───────────────────────
def rag_answer_local(query, index, chunks, k=3):
    retrieved = retrieve(query, index, chunks, k)
    context   = "\n\n".join(retrieved)
    # 2. build prompt
    prompt = (
        "You are a helpful scientific assistant.\n\n"
        "Context:\n"
        f"{context}\n\n"
        f"Question: {query}\n"
        "Answer as a concise paragraph, citing facts from the context where possible."
    )
    return local_chat(prompt)

def baseline_answer_local(query):
    return local_chat(query)

# ─── 3 · Run the comparison ───────────────────────────────────────────────
query = "What are the health risks associated with indoor mold exposure?"

print("🟢 Local model WITH RAG:\n")
print(rag_answer_local(query, faiss_index, all_chunks, k=3))

print("\n🔴 Local baseline (no retrieval):\n")
print(baseline_answer_local(query))


tokenizer_config.json: 0.00B [00:00, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

✅ Model loaded.
🟢 Local model WITH RAG:

Indoor mold exposure has been associated with a variety of health effects, including respiratory symptoms, asthma, and infections. Although uncommon, allergic syn-
dromes, allergic bronchopulmonary aspergillosis, and allergic fungal sinusitis have been suggested as possible health risks. Indoor mold has not been definitively linked to any specific health risks, but it is possible that it may contribute to other health problems.

In 2009, the World Health Organization issued additional guidance, the WHO Guidelines for Indoor Air Quality: Dampness and Mould. Other recent studies have suggested a potential link of early mold exposure to development of respiratory symptoms, asthma, and infections. This is especially true for spore-forming molds, which have been perennially linked to an increased risk of asthma. Indoor mold has not been definitively linked to any specific health risks, but it is possible that it may contribute to other health problem

# 8) Hybrid Search

**Hybrid search** combines two powerful retrieval techniques:

- **Semantic search** (via vector embeddings): finds text passages that are *meaningfully* related to a query, even if they don’t share the same words.
- **Keyword search** (exact match): looks for passages that contain specific terms or phrases, ensuring high *precision*.

By combining them, we get the **best of both worlds**: semantic flexibility *and* lexical precision.


### Why Use Hybrid Search?

| Capability                  | Vector Search ✅ | Keyword Search ✅ | Hybrid Search ✅ |
|----------------------------|------------------|-------------------|------------------|
| Finds related meaning       | ✅                | ❌                | ✅                |
| Finds exact words/terms     | ❌                | ✅                | ✅                |
| Handles typos/synonyms      | ✅                | ❌                | ✅                |
| Ensures required concepts   | ❌                | ✅                | ✅                |
| Best for long documents     | ✅                | ❌                | ✅                |


### Example Scenario

Imagine the user asks:

> “What are the symptoms of mold exposure in children?”

- **Vector search** might retrieve a chunk that says:  
  *“Mold exposure is associated with respiratory symptoms in young individuals...”*

- **Keyword search** would retrieve a chunk containing:  
  *“Common mold symptoms in children include...”*

- **Hybrid search** retrieves both kinds — ensuring high relevance *and* precision.



###  Benefits of Hybrid Search

- Captures **semantic nuance** *and* **exact match**
- Improves retrieval quality for scientific or legal texts
- Useful when queries must include specific terms (e.g., “asthma”, “children”, “mold”)
- Ideal when users might **expect exact keywords**, but **ask imprecisely**

Hybrid search is especially valuable in Retrieval-Augmented Generation (RAG), where high-quality context directly affects answer quality. It ensures the LLM sees *both semantically relevant* and *precisely matching* content.

### How Hybrid-Search Weighting Works

When we merge **semantic** (vector) and **keyword** signals, we often want to *rank* the results rather than simply concatenating two lists.  
We do this by giving each signal a **weight** and computing a *composite score* for every candidate chunk.

| Notation | Meaning                                 |
|----------|-----------------------------------------|
| `d`      | FAISS **distance** (lower = better)     |
| `s`      | Semantic similarity = `1 / (1 + d)`     |
| `k`      | Keyword score (e.g., how many keywords appear in the chunk), scaled to 0-1 |
| `α` (alpha) | Weight for semantic score (default 0.7) |
| `β` (beta)  | Weight for keyword score  (default 0.3) |
| Composite | `score = α·s + β·k` (higher = better) |

**Why do this?**

* If the question includes critical terms (“children”, “asthma”), keyword recall becomes essential.  
* Semantic vectors may rank a chunk highly even if it never mentions the *exact* term you care about.  
* Weighting lets you tune that balance: `α = 1.0` → pure semantic; `β = 1.0` → pure keyword.

We’ll now implement a hybrid retriever that:

1. Runs FAISS to get top-`m` candidates (we’ll use `m = 20` for a wider net)  
2. Computes a keyword score for *all* chunks (or just the FAISS candidates)  
3. Combines the two scores with weights `α` and `β`  
4. Returns the top-`k` chunks by composite score



In [None]:
# ── Hybrid retrieval (keyword + semantic) and RAG answer ──────────────────
import re, numpy as np

def hybrid_retrieve_weighted(
    query: str,
    index,
    chunks,
    k: int = 5,
    m: int = 20,                  # look at top-m semantic hits before re-ranking
    alpha: float = 0.7,           # weight for semantic similarity
    beta: float = 0.3             # weight for keyword overlap
):
    """
    Return top-k chunks ranked by a weighted combo of:
      • semantic similarity  (alpha)
      • keyword match score  (beta)
    """
    # 1) semantic search
    q_vec = client.embeddings.create(input=[query], model="text-embedding-3-small").data[0].embedding
    D, I = index.search(np.array([q_vec], dtype="float32"), m)
    sem_chunks   = [chunks[i] for i in I[0]]
    sem_scores   = 1 / (1 + D[0])           # distance → similarity (0-1)

    # 2) keyword score for each candidate
    kws      = re.findall(r"\w+", query.lower())      # crude keywords = query words
    kw_scores = []
    for ch in sem_chunks:
        hits = sum(kw in ch.lower() for kw in kws)
        kw_scores.append(hits / max(1, len(kws)))     # normalise 0-1

    # 3) composite score & top-k
    comp = alpha * sem_scores + beta * np.array(kw_scores)
    topk = np.argsort(-comp)[:k]
    return [sem_chunks[i] for i in topk]

def rag_answer_hybrid(query, index, chunks, k=5, model="gpt-3.5-turbo"):
    ctx = "\n\n".join(hybrid_retrieve_weighted(query, index, chunks, k=k))
    prompt = (
        "You are a helpful scientific assistant.\n\n"
        "Context:\n"
        f"{ctx}\n\n"
        f"Question: {query}\n"
        "Answer concisely, citing facts from the context."
    )
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
        max_tokens=400
    )
    return response.choices[0].message.content.strip()

# ── Example query ─────────────────────────────────────────────────────────
query = "What are the symptoms of mold exposure in children?"
print(rag_answer_hybrid(query, faiss_index, all_chunks, k=5))


Symptoms of mold exposure in children can include stuffy nose, wheezing, red or itchy eyes, or skin rash. Early mold exposure has been linked to the development of asthma in some children, particularly those who are genetically susceptible.


<div style="background-color: #e6f0fa; border-left: 6px solid #1c6ea4; padding: 1em;">

<h3>📘 Further Reading & Suggested Next Steps</h3>

To deepen your understanding of RAG pipelines and apply them more effectively, consider exploring the following:

<h4>🔍 Expand Your Skills</h4>
<ul>
  <li>Experiment with different embedding models like <code>OpenAI</code>, <code>Cohere</code>, or <code>HuggingFace Transformers</code>.</li>
  <li>Swap out the vector store backend (e.g., try <code>Chroma</code>, <code>Weaviate</code>, or <code>Qdrant</code>) to compare retrieval performance.</li>
  <li>Explore hybrid retrieval methods (dense + keyword-based) for improved relevance.</li>
</ul>

<h4>📚 Recommended Resources</h4>
<ul>
  <li><a href="https://docs.langchain.com/docs/use-cases/question-answering/" target="_blank">LangChain RAG Documentation</a></li>
  <li><a href="https://faiss.ai/" target="_blank">FAISS Documentation</a></li>
  <li><a href="https://huggingface.co/docs/transformers/index" target="_blank">Hugging Face Transformers</a></li>
  <li><a href="https://github.com/openai/openai-cookbook" target="_blank">OpenAI Cookbook – RAG Examples</a></li>
</ul>

<p>By diving deeper into these areas, you’ll be equipped to build scalable, production-ready RAG systems that combine the power of LLMs with reliable, external knowledge.</p>

</div>
