<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_051_huggingFace_RAG_HotpotQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 🔧 Step 1: Install Required Libraries


| Package              | Purpose |
|----------------------|---------|
| `transformers`       | LLMs like FLAN-T5 |
| `datasets`           | Load and manage `HotpotQA` |
| `sentence-transformers` | Create vector embeddings |
| `faiss-cpu`          | Fast vector similarity search |
| `python-dotenv`      | Manage your Hugging Face token securely |
| `huggingface_hub`    | Login and download models/data |
| `tqdm`               | Pretty progress bars |



In [4]:
# # 🔧 Environment + model access
# !pip install python-dotenv
# !pip install huggingface_hub
# !pip install transformers

# # 📦 Hugging Face datasets
# !pip install datasets

# # 💬 Sentence embeddings
# !pip install sentence-transformers

# # 🔍 Vector search
# !pip install faiss-cpu  # Use faiss-gpu if on GPU runtime

# # 📄 Optional: Progress bars and data cleaning
# !pip install tqdm


###  Step 2: Load Environment Variables + Login to Hugging Face

In [5]:
from huggingface_hub import login
from dotenv import load_dotenv
import os

# Load the .env file containing your token
load_dotenv("/content/HUGGINGFACE_HUB_TOKEN.env")

# Login using the token
login(token=os.environ["HUGGINGFACE_HUB_TOKEN"])

### 📚 Step 3: Load the HotpotQA Dataset


In [10]:
from datasets import load_dataset
import json
from termcolor import colored

# Load the HotpotQA dataset (fullwiki version includes large context)
dataset = load_dataset("hotpot_qa", "fullwiki", trust_remote_code=True)

# Get a sample
example = dataset["train"][0]

# what is it?
# print(type(example["context"]))
# print(example["context"])

example = dataset["train"][0]

# Question and answer
print(colored("📌 Question:", "cyan", attrs=["bold"]))
print(example["question"], "\n")

print(colored("✅ Answer:", "green", attrs=["bold"]))
print(example["answer"], "\n")

# Updated context printer
print(colored("📚 Context (Title → Paragraph):", "yellow", attrs=["bold"]))
titles = example["context"]["title"]
sentences = example["context"]["sentences"]

for title, sentence_list in zip(titles, sentences):
    paragraph = " ".join(sentence_list)
    print(f"📝 {title}:\n{paragraph}\n")

# Supporting facts
print(colored("🔎 Supporting Facts:", "magenta", attrs=["bold"]))
print(example["supporting_facts"])



📌 Question:
Which magazine was started first Arthur's Magazine or First for Women? 

✅ Answer:
Arthur's Magazine 

📚 Context (Title → Paragraph):
📝 Radio City (Indian radio station):
Radio City is India's first private FM radio station and was started on 3 July 2001.  It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).  It plays Hindi, English and regional songs.  It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007.  Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features.  The Radio station currently plays a mix of Hindi and Regional music.  Abraham Thomas is the CEO of the company.

📝 History of Albanian football:
Football in Albania existed before the Albanian Football Federat

### 🧹 Step 4: Extract Contexts and Build a Document Store

### 🔍 **What This Step Does**
This step prepares the **retrieval database** for RAG. We extract individual **context paragraphs** from the HotpotQA dataset so they can be embedded and indexed for semantic search.

Instead of relying on a static knowledge base, we dynamically build a **document store** from the dataset. This allows the model to later retrieve relevant information based on similarity to a user’s question.

---

### 🧠 **Why It's Important**
In a RAG pipeline:
- The **retriever** fetches relevant passages (in this case, context paragraphs)
- The **generator** uses them to answer the question

If the documents aren’t structured well, the retriever won’t perform effectively.


In [12]:
# Create a list of all documents (context paragraphs) for retrieval
docs = []
metadata = []

max_docs = 1000
processed = 0
skipped = 0

for example in dataset["train"]:
    if processed >= max_docs:
        break

    context = example.get("context", {})
    titles = context.get("title")
    sentences = context.get("sentences")

    if isinstance(titles, list) and isinstance(sentences, list):
        for title, sentence_list in zip(titles, sentences):
            paragraph = " ".join(sentence_list)
            docs.append(paragraph)
            metadata.append({"title": title})
        processed += 1
    else:
        skipped += 1

print(f"✅ Collected {len(docs)} paragraphs from {processed} examples. Skipped: {skipped}")


✅ Collected 9930 paragraphs from 1000 examples. Skipped: 0


Absolutely — this is one of the most powerful ideas behind RAG and modern search systems. Let's break it down in a clean, intuitive way for your documentation (and understanding):

---

## 🔍 What Does It Mean to "Embed and Index for Semantic Search"?

### 🧠 1. **Embedding**: Turn Text Into Meaningful Vectors

In traditional search (like CTRL+F or SQL `LIKE`), we match words **exactly**.

In **semantic search**, we want to match **meaning**, even if the words are different.

So we use **embedding models** like [`sentence-transformers`](https://www.sbert.net/) to convert each paragraph (or question) into a **vector** — a list of numbers that captures its meaning in multi-dimensional space.

Example:

| Text | Embedding |
|------|-----------|
| `"Tesla was born in Smiljan."` | `[0.12, -0.45, 0.88, ...]` |
| `"Where was Nikola Tesla born?"` | `[0.14, -0.42, 0.91, ...]` |

The closer these vectors are in space, the more similar the **meaning**.

---

### 🗂️ 2. **Indexing**: Make Searching Fast

Once we have all those vectors for your documents, we need to **store and organize** them so we can:
- Quickly compare any question vector to all document vectors
- Retrieve the **most similar matches** (i.e., relevant knowledge)

This is where **FAISS** (Facebook AI Similarity Search) comes in:
- It builds a **vector index**
- It supports **fast nearest neighbor search**
- You can ask: _"What are the 5 most similar paragraphs to this question?"_

---

### 🔄 In a RAG Pipeline

| Step | What Happens |
|------|--------------|
| 🔹 Embed your documents | Each doc is transformed into a vector |
| 🔹 Index them with FAISS | FAISS stores and organizes them for fast lookup |
| 🔹 Embed a new question | Transform the question into a vector |
| 🔹 Query FAISS | Find top-k matching document vectors |
| 🔹 Feed those docs to a generator | A model like FLAN-T5 uses them to answer the question |

---

### 🔧 Tools You’re Using

| Tool | Role |
|------|------|
| `sentence-transformers` | Create semantic embeddings |
| `faiss` | Search over embeddings |
| `transformers` | Generate answers based on the retrieved context |



#### Remove Widgets from Notebook to save to Github

In [None]:
import json
from google.colab import drive
drive.mount('/content/drive')

# Path to your current notebook file (adjust if different)
notebook_path = "/content/drive/My Drive/LLM/LLM_051_huggingFace_Zero-Shot_RAG_Classification_YELP_Reviews.ipynb"


# Load the notebook JSON
with open(notebook_path, 'r', encoding='utf-8') as f:
    nb = json.load(f)

# Remove the widget metadata if it exists
if 'widgets' in nb.get('metadata', {}):
    del nb['metadata']['widgets']

# Save the cleaned notebook
with open(notebook_path, 'w', encoding='utf-8') as f:
    json.dump(nb, f, indent=2)

print("Notebook metadata cleaned. Try saving to GitHub again.")
