## ✅ Homework 3: Search Evaluation – Results Summary

This notebook contains the results for the LLM Zoomcamp Homework on evaluating vector search and RAG-based approaches.

---

### 🔍 Q1. MinSearch with Boosted Fields

- **Method**: MinSearch with `boost = {'question': 1.5, 'section': 0.1}`
- **Hit Rate**: `~0.80`
- **🟩 Closest Answer**: **0.84**

---

### 📐 Q2. Vector Search (Only Question)

- **Embeddings**: TF-IDF + SVD on `question`
- **Hit Rate**: `~0.47`
- **🟩 Closest Answer**: **0.45**

---

### 📘 Q3. Vector Search (Question + Text)

- **Embeddings**: TF-IDF + SVD on `question + text`
- **Hit Rate**: `~0.84`
- **🟩 Closest Answer**: **0.82**

---

### 🧠 Q4. Vector Search with Qdrant

- **Model**: `jinaai/jina-embeddings-v2-small-en` (later replaced with `MiniLM` for performance)
- **Text**: `question + text`
- **Top-K**: 5
- **Hit Rate**: `~0.90`
- **MRR**: `~0.80`
- **🟩 Closest Answer (MRR)**: **0.85**

---

### 📏 Q5. Cosine Similarity (LLM vs Ground Truth)

- **Embedding Method**: TF-IDF + SVD (128 components)
- **Average Cosine Similarity**: `0.8416`
- **🟩 Closest Answer**: **0.84**

---

### 🧪 Q6. ROUGE-1 F1 Score

- **Average ROUGE-1 F1**: `0.3517`
- **🟩 Closest Answer**: **0.35**

---


In [19]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

In [20]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Q1. Minsearch text

In [21]:
from minsearch import Index

search = Index(
    text_fields=["question", "section", "text"],
    keyword_fields=["course", "id"]
)

search.fit(documents)

<minsearch.minsearch.Index at 0x774781cd7bf0>

In [22]:
boost = {'question': 1.5, 'section': 0.1}

def search_function_boosted(q):
    return search.search(q['question'], boost_dict=boost, num_results=5)

In [23]:
evaluate(ground_truth, search_function_boosted)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.8013831856494489, 'mrr': 0.6818312801671366}

## Embeddings

In [24]:
from minsearch import VectorSearch

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

In [26]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

In [27]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x7747818ea870>

In [28]:
def search_vector_question(q):
    vec = pipeline.transform([q['question']])
    return vindex.search(vec[0])

In [29]:
evaluate(ground_truth, search_vector_question)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.4696347525394424, 'mrr': 0.30015360153138376}

## Q3. Vector search for question and answer


In [30]:
texts = [doc['question'] + ' ' + doc['text'] for doc in documents]

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

X = pipeline.fit_transform(texts)

In [31]:
from minsearch import VectorSearch

vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x77478182f4a0>

In [32]:
def search_vector_question_answer(q):
    vec = pipeline.transform([q['question']])
    return vindex.search(vec[0])

In [33]:
evaluate(ground_truth, search_vector_question_answer)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.8415820185865571, 'mrr': 0.625465693085102}

## Q4. Qdrant

In [2]:
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm

model = SentenceTransformer("all-MiniLM-L6-v2")  # much lighter

texts = [doc['question'] + ' ' + doc['text'] for doc in documents]

vectors = []
batch_size = 32

for i in tqdm(range(0, len(texts), batch_size)):
    batch = texts[i:i+batch_size]
    batch_vectors = model.encode(batch)
    vectors.extend(batch_vectors)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


In [3]:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient(":memory:")  # In-memory database

# Create a collection for 384-dim vectors (MiniLM uses 384)
client.recreate_collection(
    collection_name="faq",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

# Upload vectors
client.upload_collection(
    collection_name="faq",
    vectors=vectors,
    payload=documents,
    ids=list(range(len(documents))),
    batch_size=64,
)

  client.recreate_collection(


In [8]:
def search(q):
    vec = model.encode(q['question'])  # just the question
    hits = client.search(collection_name="faq", query_vector=vec, limit=5)
    return [{'id': documents[hit.id]['id']} for hit in hits]

In [9]:
result = evaluate(ground_truth, search)
print(result)

  0%|          | 0/4627 [00:00<?, ?it/s]

  hits = client.search(collection_name="faq", query_vector=vec, limit=5)


{'hit_rate': 0.9007996542035877, 'mrr': 0.7965240256465683}


## Q5. Cosine simiarity


In [10]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

# Load results from RAG evaluation
url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

In [11]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

# Fit the pipeline on all the text data (concatenated)
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)

In [12]:
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

In [13]:
cosine_scores = []

for _, row in df_results.iterrows():
    v_llm = pipeline.transform([row.answer_llm])[0]
    v_orig = pipeline.transform([row.answer_orig])[0]
    score = cosine(v_llm, v_orig)
    cosine_scores.append(score)

avg_cosine = np.mean(cosine_scores)
print("Average Cosine Similarity:", avg_cosine)

Average Cosine Similarity: 0.8415841233490402


## Q6. Rouge


In [17]:
from rouge import Rouge
rouge = Rouge()

r = df_results.iloc[10]
scores = rouge.get_scores(r.answer_llm, r.answer_orig)[0]
print(scores['rouge-1']['f'])  # Should be around 0.45

0.45454544954545456


In [18]:
rouge_1_f1_scores = []

for _, row in df_results.iterrows():
    try:
        score = rouge.get_scores(row.answer_llm, row.answer_orig)[0]
        rouge_1_f1_scores.append(score['rouge-1']['f'])
    except ValueError:
        # Handles rare empty string errors
        rouge_1_f1_scores.append(0.0)

avg_rouge_1_f1 = np.mean(rouge_1_f1_scores)
print("Average ROUGE-1 F1:", avg_rouge_1_f1)

Average ROUGE-1 F1: 0.3516946452113943
