# The Project:The Project: Evaluating Retrieval on SQuAD with Sentence Transformers and FAISS 99.2%

This project implements a simplified **Retrieval-Augmented Generation (RAG)** setup using a subset of the **SQuAD** dataset. Instead of generating answers, it focuses on the **retrieval** component—evaluating whether a retriever can return contexts that contain the correct answers.

A sentence embedding model (`multi-qa-mpnet-base-dot-v1`) encodes both questions and their corresponding "Q: ... C: ..." pairs. The embeddings are normalized, and a **FAISS index** is used for efficient top-5 similarity search.

Evaluation is performed by checking if the **gold answer** appears as a **substring** in any of the top-5 retrieved results. This serves as a proxy for answerability within the retrieved context.

### Summary:
- Dataset: 500 SQuAD validation samples
- Task: Question → Retrieve top-5 relevant "Q + C" pairs
- Metric: Substring-based Top-5 Accuracy
- Result: Shows how well semantic retrievers can support downstream QA in a RAG pipeline


# The Dataset: SQuAD (Stanford Question Answering Dataset)

The **Stanford Question Answering Dataset (SQuAD)** is a benchmark dataset for evaluating machine reading comprehension and question-answering systems. It was developed by researchers at Stanford University and has become a widely-used standard in the field of NLP.

## Key Characteristics

- **Type**: Extractive Question Answering
- **Language**: English
- **Domain**: Wikipedia articles
- **Task**: Given a question and a context paragraph, the goal is to find the answer span within the paragraph.

## Structure

Each sample in SQuAD consists of:

- `context`: A paragraph extracted from a Wikipedia article.
- `question`: A question related to the paragraph.
- `answers`: A list of one or more acceptable answers, each provided as:
  - `text`: The actual answer string.
  - `answer_start`: The character index where the answer begins in the context.

### Example Entry

```json
{
  "context": "The Apollo program was the third United States human spaceflight program carried out by NASA...",
  "question": "What was the name of NASA’s third human spaceflight program?",
  "answers": {
    "text": ["The Apollo program"],
    "answer_start": [0]
  }
}


In [None]:
!pip install -q datasets sentence-transformers faiss-cpu

In [None]:
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss

# Step 1: Load subset of SQuAD (500 samples)
ds = load_dataset("squad", split="validation[:500]")
questions = [x["question"] for x in ds]
answers = [x["answers"]["text"][0] for x in ds]
contexts = [x["context"] for x in ds]

# Step 2: Build retrieval corpus = "Q + C"
corpus = [f"Q: {q}\nC: {c}" for q, c in zip(questions, contexts)]

# Step 3: Load retriever and encode
model = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-dot-v1")
corpus_emb = model.encode(corpus, convert_to_numpy=True, normalize_embeddings=True)
query_emb = model.encode(questions, convert_to_numpy=True, normalize_embeddings=True)

# Step 4: Create FAISS index
index = faiss.IndexFlatIP(corpus_emb.shape[1])
index.add(corpus_emb)

# Step 5: Search top-k
top_k = 5
D, I = index.search(query_emb, top_k)

# Step 6: Evaluate using substring match
def compute_f1_from_substring(docs, gold):
    return 1.0 if any(gold.lower() in doc.lower() for doc in docs) else 0.0

f1_scores = []
correct = 0

for i in range(len(questions)):
    retrieved_docs = [corpus[j] for j in I[i]]
    gold = answers[i]
    f1 = compute_f1_from_substring(retrieved_docs, gold)
    f1_scores.append(f1)
    correct += f1

top5_acc = correct / len(questions)
top5_f1 = np.mean(f1_scores)

print(f"Top-5 Accuracy: {top5_acc:.2%}")
print(f"Top-5 Mean F1: {top5_f1:.2%}")


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Top-5 Accuracy: 99.20%
Top-5 Mean F1: 99.20%


In [None]:
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss
import textwrap

# Load a subset of SQuAD (or use TriviaQA with .["context"] field)
ds = load_dataset("squad", split="validation[:50]")

questions = [x["question"] for x in ds]
answers = [x["answers"]["text"][0] for x in ds]
contexts = list(set(x["context"] for x in ds))  # unique only

# Embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
context_emb = model.encode(contexts, convert_to_numpy=True, normalize_embeddings=True)
query_emb = model.encode(questions, convert_to_numpy=True, normalize_embeddings=True)

# Build FAISS index
index = faiss.IndexFlatIP(context_emb.shape[1])
index.add(context_emb)
top_k = 10
D, I = index.search(query_emb, top_k * 2)

# Print examples with distinct contexts, wrapped to 10 words per line
def wrap_text_by_words(text, words_per_line=10):
    words = text.split()
    lines = [" ".join(words[i:i + words_per_line]) for i in range(0, len(words), words_per_line)]
    return "\n".join(lines)

for i in range(5):
    print("=" * 80)
    print(f"Example #{i+1}")
    print("Question:", questions[i])
    print("Answer:", answers[i])
    print("\nTop-5 Unique Contexts:\n")

    seen = set()
    rank = 1

    for j in I[i]:
        doc = contexts[j]
        normalized = doc.strip().lower()
        if normalized in seen:
            continue
        seen.add(normalized)

        wrapped = wrap_text_by_words(doc, words_per_line=10)
        is_hit = "✓" if answers[i].lower() in normalized else "✗"

        print(f"[Rank {rank}] [{is_hit}]\n{wrapped}\n")
        rank += 1

        if rank > 5:
            break


Example #1
Question: Which NFL team represented the AFC at Super Bowl 50?
Answer: Denver Broncos

Top-5 Unique Contexts:

[Rank 1] [✓]
Super Bowl 50 was an American football game to determine
the champion of the National Football League (NFL) for the
2015 season. The American Football Conference (AFC) champion Denver Broncos
defeated the National Football Conference (NFC) champion Carolina Panthers 24–10
to earn their third Super Bowl title. The game was
played on February 7, 2016, at Levi's Stadium in the
San Francisco Bay Area at Santa Clara, California. As this
was the 50th Super Bowl, the league emphasized the "golden
anniversary" with various gold-themed initiatives, as well as temporarily suspending
the tradition of naming each Super Bowl game with Roman
numerals (under which the game would have been known as
"Super Bowl L"), so that the logo could prominently feature
the Arabic numerals 50.

[Rank 2] [✗]
The Panthers finished the regular season with a 15–1 record,
and quarterbac