<a href="https://colab.research.google.com/github/leman-cap13/my_projects/blob/main/RAG_COVID_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#model

In [None]:
from google.colab import files
files.upload()

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d xhlulu/covidqa


In [None]:
!unzip -o covidqa.zip -d covidqa_original
!ls covidqa_original


In [None]:
import os

print(os.listdir('/content/covidqa_original'))


In [None]:
import pandas as pd
df = pd.read_csv('/content/covidqa_original/community.csv')
df

In [None]:
df.columns

In [None]:
df.drop(columns=['question_id','answer_id','url', 'source', 'answer_type',  'wrong_answer_type',], inplace=True)

In [None]:
df

In [None]:
df.columns  #Index(['title', 'question', 'answer', 'wrong_answer'], dtype='object')

In [None]:
#Prepare the Retrieval Corpus
#Create a list of documents/passages to index for retrieva

# Combine 'title' and 'answer' as one passage
df['passage'] = df['title'].fillna('') + '. ' + df['answer'].fillna('')

# Check some passages
print(df['passage'].head())


In [None]:
#Embed Passages with SentenceTransformer and Build FAISS Index

In [None]:
!pip install -q sentence-transformers faiss-cpu


In [None]:
from sentence_transformers import SentenceTransformer
from faiss import IndexFlatL2
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# Embed the passages (list of strings)
passages = df['passage'].tolist()
corpus_embeddings = embedder.encode(passages,
                                    convert_to_numpy=True,
                                    show_progress_bar=True)

embedder.encode()
This is a method from the SentenceTransformer model that takes a list of texts (in my case, the passages) and converts each text into a fixed-length numeric vector, called an embedding.

passages
This is the list of your combined 'title' + 'answer' strings from your dataframe. Each passage is a text sample you want to represent as a vector.

convert_to_numpy=True
By default, encode() returns a PyTorch tensor or a list of lists. Setting convert_to_numpy=True means the output will be a NumPy array, which is convenient and compatible with libraries like FAISS.

show_progress_bar=True
This displays a progress bar during embedding, so you can visually see how fast or slow it is, especially useful when embedding many passages.

why convert_to_numpy=True?
Because:

FAISS (Facebook’s fast similarity search library) works with NumPy arrays (or raw float32 arrays).

PyTorch tensors or lists aren’t accepted directly by FAISS without conversion.

NumPy is lightweight and fast for vector ops like cosine similarity, dot product, etc.

What is FAISS?
FAISS stands for:

Facebook AI Similarity Search

It’s an open-source library for fast similarity search over dense vectors (i.e., embeddings).

🔍 Why do we need it?
In Retrieval-Augmented Generation (RAG), we:

Embed all documents (your passages) into vectors using a model like Sentence-BERT

Embed a question into a vector (the user query)

Search for the most similar document vectors to the query vector

🔗 This is where FAISS comes in — it's the search engine for vector similarity.

In [None]:
corpus_embeddings.shape

In [None]:
corpus_embeddings

In [None]:
corpus_embeddings.shape[1]

In [None]:
import faiss

dimension = corpus_embeddings.shape[1]  # embedding size, e.g., 384
index = faiss.IndexFlatIP(dimension)  # Inner Product similarity (cosine if normalized)

# Normalize embeddings if using IP similarity for cosine similarity
faiss.normalize_L2(corpus_embeddings)

# Add embeddings to the index
index.add(corpus_embeddings)

print(f"FAISS index contains {index.ntotal} passages")


corpus_embeddings is your NumPy array of shape (N, D) where:

N = number of passages (e.g., 1000)

D = embedding dimension (e.g., 384 for MiniLM)



index = faiss.IndexFlatIP(dimension)
This creates a flat (brute-force) FAISS index that uses:

IP = Inner Product similarity (also known as dot product)

IP can approximate cosine similarity if your vectors are normalized (see next line).

"Flat" means no approximation — it compares the query to every document embedding.

faiss.normalize_L2(corpus_embeddings)
This transforms each embedding into a unit vector (vector length = 1).

Why?

When vectors are normalized, inner product (dot product) becomes cosine similarity.

This is a standard trick to use cosine similarity with FAISS.

In [None]:
#Perform Query Retrieval

This means:

You enter a question.

We embed that question using the same SentenceTransformer.

We normalize the query vector (just like we did with the corpus).

We pass it to FAISS to get top-k most similar passages.

In [None]:
# Step 1: Define your query
query = "How does COVID-19 spread?"

# Step 2: Embed the query (same model as corpus)
query_embedding = embedder.encode([query], convert_to_numpy=True)

# Step 3: Normalize for cosine similarity
faiss.normalize_L2(query_embedding)

# Step 4: Search FAISS index for top-k (e.g., 5) most similar passages
top_k = 5
D, I = index.search(query_embedding, top_k)

# D = similarity scores, I = indices of top matches
print("Top retrieved passages:")
for i, idx in enumerate(I[0]):
    print(f"\nPassage {i+1} (score={D[0][i]:.4f}):")
    print(df['passage'].iloc[idx])


In [None]:
#Generate Answer Using a Pretrained Model (e.g., BART or T5)

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


model_name = "facebook/bart-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


In [None]:
# Reuse previous query
query = "How does COVID-19 spread?"

# Get top 1 retrieved passage
retrieved_passage = df['passage'].iloc[I[0][0]]

# Concatenate passage and question as input
input_text = f"{retrieved_passage} \n\n Question: {query}"


retrieved_passage = df['passage'].iloc[I[0][0]]
I is the result from your FAISS search:

python
Copy
Edit
D, I = index.search(query_embedding, top_k)
I[0] is the list of top-k document indices for your first (and only) query.

I[0][0] is the index of the most similar passage (rank 1).

In [None]:
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
output_ids = model.generate(**inputs, max_new_tokens=100)

# Decode and print
answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Generated Answer:", answer)


In [None]:
!pip install rouge_score


In [None]:
!pip install evaluate
import evaluate

rouge = evaluate.load('rouge')
bleu = evaluate.load('bleu')

# Example: lists of predictions and references
predictions = ["COVID-19 spreads through droplets."]
references = [["COVID-19 is spread by droplets."]]

rouge_results = rouge.compute(predictions=predictions, references=references)
bleu_results = bleu.compute(predictions=predictions, references=references)

print("ROUGE:", rouge_results)
print("BLEU:", bleu_results)
