# RAG Question Answering System
## DS8008 Final Project - Manav Patel (500967756)

This notebook demonstrates a complete Retrieval-Augmented Generation (RAG) system for question answering using:
- Document chunking and preprocessing
- FAISS vector indexing for semantic search
- SentenceTransformer embeddings
- RoBERTa-based question answering

In [None]:
# Setup and data loading
import os
import sys

# Set working directory to project root
project_root = os.path.abspath('..')  # Assuming notebook is in notebooks/ folder
os.chdir(project_root)

print("Working directory:", os.getcwd())
print("Data files:", os.listdir('data/'))
print("Project structure:")
for root, dirs, files in os.walk('.'):
    level = root.replace('.', '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 2 * (level + 1)
    for file in files[:3]:  # Show first 3 files only
        print(f"{subindent}{file}")
    if len(files) > 3:
        print(f"{subindent}... and {len(files)-3} more files")

Mounted at /content/drive
Working directory: /content/drive/MyDrive/Colab Notebooks/DS8008/FinalProject
Files: ['.ipynb_checkpoints', 'Final Project', 'FinalProjectDocuments.txt', 'numbered_questions.txt']


## 1. Data Loading and Preprocessing

In [None]:
# Import required libraries
import pandas as pd
import nltk
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from tqdm import tqdm
from nltk.tokenize import sent_tokenize
import os
import re
from sklearn.metrics import precision_score, recall_score, f1_score
from collections import Counter

# Download NLTK punkt tokenizer
nltk.download('punkt_tab')

# Load document corpus
with open("data/document_corpus.txt", "r", encoding="utf-8") as f:
    documents_text = f.read()

# Preprocess text
documents_text = documents_text.lower()
documents_text = re.sub(r'\s+', ' ', documents_text)

# Display document statistics
print('Example text from document:')
print(documents_text[:2500])
document_length = len(documents_text)
print(f"\nDocument statistics:")
print(f"- Total characters: {document_length:,}")
print(f"- First 10 chars: '{documents_text[:10]}'")
print(f"- Last 10 chars: '{documents_text[-10:]}'")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!




Example text from the document:  beginners bbq class taking place in missoula! do you want to get better at making delicious bbq? you will have the opportunity, put this on your calendar now. thursday, september 22nd join world class bbq champion, tony balay from lonestar smoke rangers. he will be teaching a beginner level class for everyone who wants to get better with their culinary skills. he will teach you everything you need to know to compete in a kcbs bbq competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information. the cost to be in the class is $35 per person, and for spectators it is free. included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared. discussion in 'mac os x lion (10.7)' started by axboi87, jan 20, 2012. i've got a 500gb internal drive and a 240gb ssd. when trying to restore using disk utility i'm given the error "not enough space on disk ____ to r

In [None]:
# Load evaluation questions
with open("data/evaluation_questions.txt", "r") as f:
    questions_list = f.readlines()

# Clean questions (remove numbering if present)
questions_list = [re.sub(r'^\d+\.\s*', '', q.strip()) for q in questions_list]
questions_df = pd.DataFrame(questions_list, columns=["question"])

print(f"Loaded {len(questions_list)} evaluation questions")
print("\nFirst 5 questions:")
for i, q in enumerate(questions_list[:5]):
    print(f"{i+1}. {q}")
    
print(f"\nQuestions DataFrame shape: {questions_df.shape}")

Loaded 20 questions.
                                             question
0   How many scrolls were in the grand library of ...
1            Where can the bluefire crystal be found?
2   In what year did Dr. Helena Carter win the Nob...
3             What is the national bird of Veridonia?
4   What was the name of the first space station o...
5                    What is Lake Virelia famous for?
6      When was 'Echoes of Tomorrow' first published?
7   How long did it take the Zephyr-9 to fly aroun...
8   Who developed the first AI capable of composin...
9                   How many moons does Xyphora have?
10  What is the seating capacity of the Grand Arca...
11         Where is the silver-tailed lynx native to?
12                 When was the ChronoGem discovered?
13                         How tall is Solaris Tower?
14                       What was Novaterra built on?
15                       When was Vortexium invented?
16          How long does the Luminara Festival last?
17     

##### Chunking (sing nltk tokenize), embedding (using all-MiniLM-L6-v2), FAISS Index

In [9]:
# Load model
print("Loading SentenceTransformer model: all-MiniLM-L6-v2...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded.\n", model)

Loading SentenceTransformer model: all-MiniLM-L6-v2...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Model loaded.
 SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)


In [10]:
# chunking documents
def chunk_lines_with_context(text):
    sentences = sent_tokenize(text)
    if len(sentences) < 2:
        print("Not enough sentences to form chunks with overlap. Returning original text as one chunk.")
        return [text]
    chunks = []
    for i in range(len(sentences) - 1):
        chunk = " ".join([sentences[i], sentences[i+1]])
        chunks.append(chunk)
    print(f"\nSentence-based chunking complete. Total chunks created: {len(chunks)}\n")
    return chunks
print("Starting chunking process...")
documents_chunks = chunk_lines_with_context(documents_text)
print('Chunking done');
for i in range(min(5, len(documents_chunks))):
    print(f"Chunk {i + 1} Preview:\n{documents_chunks[i]}\n")

Starting chunking process...

Sentence-based chunking complete. Total chunks created: 599342

Chunking done
Chunk 1 Preview:
beginners bbq class taking place in missoula! do you want to get better at making delicious bbq?

Chunk 2 Preview:
do you want to get better at making delicious bbq? you will have the opportunity, put this on your calendar now.

Chunk 3 Preview:
you will have the opportunity, put this on your calendar now. thursday, september 22nd join world class bbq champion, tony balay from lonestar smoke rangers.

Chunk 4 Preview:
thursday, september 22nd join world class bbq champion, tony balay from lonestar smoke rangers. he will be teaching a beginner level class for everyone who wants to get better with their culinary skills.

Chunk 5 Preview:
he will be teaching a beginner level class for everyone who wants to get better with their culinary skills. he will teach you everything you need to know to compete in a kcbs bbq competition, including techniques, recipes, timeline

In [11]:
# Compute Embeddings in Batches of 512
def embed_documents_in_batches(docs, batch_size=512):
    print(f"Embedding {len(docs)} chunks in batches of {batch_size}...")
    embeddings = []
    for i in tqdm(range(0, len(docs), batch_size), desc="Embedding chunks"):
        batch = docs[i:i + batch_size]
        batch_embeddings = model.encode(batch, convert_to_numpy=True)
        embeddings.append(batch_embeddings)
    print("Embedding complete.\n")
    return np.vstack(embeddings)
embeddings = embed_documents_in_batches(documents_chunks)
print("Embeddings: \n", embeddings)

Embedding 599342 chunks in batches of 512...


Embedding chunks: 100%|██████████| 1171/1171 [03:01<00:00,  6.46it/s]


Embedding complete.

Embeddings: 
 [[ 0.03934662 -0.00078637 -0.08404719 ...  0.01195637 -0.10189318
  -0.02531926]
 [-0.01363128 -0.00736005 -0.03140329 ... -0.02576192 -0.09384691
  -0.05192094]
 [-0.00906869 -0.00554028 -0.07197446 ... -0.06166822 -0.1257587
  -0.0520644 ]
 ...
 [-0.02728089  0.1552      0.02050948 ...  0.09405284 -0.08236547
  -0.04060027]
 [-0.0573018   0.18655883  0.03014164 ...  0.06408855 -0.00851724
  -0.02787329]
 [ 0.06139382  0.13654158 -0.01217416 ...  0.01146802  0.06740606
   0.01805817]]


In [12]:
# building FAISS Index
def build_faiss_index(embeddings):
    print(f"Building FAISS index with {len(embeddings)} vectors...")
    d = embeddings.shape[1]
    index = faiss.IndexFlatL2(d)
    index.add(embeddings)
    print("FAISS index built.\n")
    return index
faiss_index = build_faiss_index(embeddings)

Building FAISS index with 599342 vectors...
FAISS index built.



In [13]:
# FAISS Retrieval for Input Question
print('FAISS index: \n', faiss_index)
print(f"Document ChunksL \n {len(documents_chunks)}")
print('\nEmbeddings: \n', embeddings)

FAISS index: 
 <faiss.swigfaiss_avx512.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7944251f9d70> >
Document ChunksL 
 599342

Embeddings: 
 [[ 0.03934662 -0.00078637 -0.08404719 ...  0.01195637 -0.10189318
  -0.02531926]
 [-0.01363128 -0.00736005 -0.03140329 ... -0.02576192 -0.09384691
  -0.05192094]
 [-0.00906869 -0.00554028 -0.07197446 ... -0.06166822 -0.1257587
  -0.0520644 ]
 ...
 [-0.02728089  0.1552      0.02050948 ...  0.09405284 -0.08236547
  -0.04060027]
 [-0.0573018   0.18655883  0.03014164 ...  0.06408855 -0.00851724
  -0.02787329]
 [ 0.06139382  0.13654158 -0.01217416 ...  0.01146802  0.06740606
   0.01805817]]


In [14]:
# retrieves top k chunks
def retrieve_top_k_chunks(question, k=3):
    print(f"Encoding question: \"{question}\"")
    question_embedding = model.encode([question], convert_to_numpy=True)
    print(f"Querying FAISS index for top {k} relevant chunks...")
    distances, indices = faiss_index.search(question_embedding, k)
    top_chunks = [(documents_chunks[i], distances[0][rank]) for rank, i in enumerate(indices[0])]
    return top_chunks

sample_question = questions_df.iloc[0]['question']
print('sample question: \n', sample_question)
top_chunks = retrieve_top_k_chunks(sample_question, k=50)

sample question: 
 How many scrolls were in the grand library of Zandoria?
Encoding question: "How many scrolls were in the grand library of Zandoria?"
Querying FAISS index for top 50 relevant chunks...


##### Answering all questions based on roberta-base-squad2 QA model

In [15]:
# QA over top-k chunks and choose best answer

# Load QA model
print("Loading QA model: 'deepset/roberta-base-squad2'...")
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2", tokenizer="deepset/roberta-base-squad2")
print("QA model loaded.\n")

Loading QA model: 'deepset/roberta-base-squad2'...


Device set to use cuda:0


QA model loaded.



In [16]:
def normalize_distances(distances):
    min_val = min(distances)
    max_val = max(distances)
    if max_val == min_val:
        return [0 for _ in distances]
    return [(d - min_val) / (max_val - min_val) for d in distances]

def get_best_answer_from_chunks_aggregated(question, top_chunks, weight=0.5):
    distances = [distance for (_, distance) in top_chunks]
    norm_distances = normalize_distances(distances)
    answers = []
    print(f"Evaluating QA over top {len(top_chunks)} chunks\n")
    for i, (chunk, _) in enumerate(top_chunks):
        result = qa_pipeline(question=question, context=chunk)
        result["chunk_index"] = i
        result["retrieval_distance"] = norm_distances[i]
        result["context"] = chunk
        result["combined_score"] = result["score"] - weight * result["retrieval_distance"]
        answers.append(result)
    answers_sorted = sorted(answers, key=lambda x: x['combined_score'], reverse=True)
    best_answer = answers_sorted[0]
    return best_answer

print(sample_question)
best_answer_result = get_best_answer_from_chunks_aggregated(sample_question, top_chunks, weight=1.5)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


How many scrolls were in the grand library of Zandoria?
Evaluating QA over top 50 chunks



In [24]:
# here we use this function to compute a token-f1 score, without a ground truth
def compute_token_f1(predicted_answer, context):

    if not predicted_answer.strip():
        return 0.0, "N/A"

    def tokenize(text):
        return re.findall(r'\w+', text.lower())

    pred_tokens = set(tokenize(predicted_answer))
    if not pred_tokens:
        return 0.0, "N/A"

    best_f1 = 0.0
    best_sentence = ""

    for sentence in sent_tokenize(context):
        sent_tokens = set(tokenize(sentence))
        if not sent_tokens:
            continue

        common = pred_tokens & sent_tokens
        if not common:
            continue

        precision = len(common) / len(pred_tokens)
        recall = len(common) / len(sent_tokens)
        if precision + recall == 0:
            continue
        f1 = 2 * precision * recall / (precision + recall)

        if f1 > best_f1:
            best_f1 = f1
            best_sentence = sentence

    return best_f1, best_sentence if best_sentence else "N/A"

In [18]:
# answering all the question and computing the token-f1 score, with no ground truth

# Store results
all_results = []

# Loop over all questions
for idx, row in tqdm(questions_df.iterrows(), total=len(questions_df), desc="Evaluating all questions"):
    question_id = row['id'] if 'id' in row else idx
    question = row['question']
    top_chunks = retrieve_top_k_chunks(question, k=50)
    best_answer_result = get_best_answer_from_chunks_aggregated(question, top_chunks, weight = 0.9)
    context_text = top_chunks[best_answer_result['chunk_index']][0]
    f1_score, matched_sentence = compute_token_f1(best_answer_result['answer'], context_text)
    all_results.append({
        "Question ID": question_id,
        "Question": question,
        "Predicted Answer": best_answer_result['answer'],
        # "Confidence Score": best_answer_result['score'],
        # "Retrieval Similarity": best_answer_result['retrieval_distance'],
        # "Combined Score": best_answer_result['combined_score'],
        "Best-Matched Sentence": matched_sentence,
        "Token-Level F1": round(f1_score, 4)
    })

results_df = pd.DataFrame(all_results)
pd.set_option('display.max_colwidth', None)
from IPython.display import display
print("\n---Final Results Table:")
display(results_df)

Evaluating all questions:   0%|          | 0/20 [00:00<?, ?it/s]

Encoding question: "How many scrolls were in the grand library of Zandoria?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:   5%|▌         | 1/20 [00:00<00:11,  1.70it/s]

Encoding question: "Where can the bluefire crystal be found?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  10%|█         | 2/20 [00:01<00:10,  1.72it/s]

Encoding question: "In what year did Dr. Helena Carter win the Nobel Prize?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  15%|█▌        | 3/20 [00:01<00:09,  1.71it/s]

Encoding question: "What is the national bird of Veridonia?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  20%|██        | 4/20 [00:02<00:09,  1.72it/s]

Encoding question: "What was the name of the first space station of the Andromeda Federation?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  25%|██▌       | 5/20 [00:02<00:08,  1.69it/s]

Encoding question: "What is Lake Virelia famous for?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  30%|███       | 6/20 [00:03<00:08,  1.69it/s]

Encoding question: "When was 'Echoes of Tomorrow' first published?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  35%|███▌      | 7/20 [00:04<00:07,  1.70it/s]

Encoding question: "How long did it take the Zephyr-9 to fly around the world?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  40%|████      | 8/20 [00:04<00:07,  1.71it/s]

Encoding question: "Who developed the first AI capable of composing symphonies?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  45%|████▌     | 9/20 [00:05<00:06,  1.70it/s]

Encoding question: "How many moons does Xyphora have?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  50%|█████     | 10/20 [00:05<00:05,  1.70it/s]

Encoding question: "What is the seating capacity of the Grand Arcadium?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  55%|█████▌    | 11/20 [00:06<00:05,  1.69it/s]

Encoding question: "Where is the silver-tailed lynx native to?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  60%|██████    | 12/20 [00:07<00:04,  1.70it/s]

Encoding question: "When was the ChronoGem discovered?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  65%|██████▌   | 13/20 [00:07<00:04,  1.64it/s]

Encoding question: "How tall is Solaris Tower?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  70%|███████   | 14/20 [00:08<00:03,  1.67it/s]

Encoding question: "What was Novaterra built on?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  75%|███████▌  | 15/20 [00:08<00:02,  1.68it/s]

Encoding question: "When was Vortexium invented?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  80%|████████  | 16/20 [00:09<00:02,  1.68it/s]

Encoding question: "How long does the Luminara Festival last?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  85%|████████▌ | 17/20 [00:10<00:01,  1.70it/s]

Encoding question: "How many people speak Nythrani?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  90%|█████████ | 18/20 [00:10<00:01,  1.71it/s]

Encoding question: "Which cities does the HyperLoop-5 connect?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions:  95%|█████████▌| 19/20 [00:11<00:00,  1.72it/s]

Encoding question: "Where is the Moonshade Rose found?"
Querying FAISS index for top 50 relevant chunks...
Evaluating QA over top 50 chunks



Evaluating all questions: 100%|██████████| 20/20 [00:11<00:00,  1.70it/s]


---Final Results Table:





Unnamed: 0,Question ID,Question,Predicted Answer,Best-Matched Sentence,Token-Level F1
0,0,How many scrolls were in the grand library of Zandoria?,50000,"the ancient city of zandoria was founded in 1200 bc and was known for its grand library, which housed over 50,000 scrolls.",0.1667
1,1,Where can the bluefire crystal be found?,caves of mount eldoria,"this 17th day of january , 1989. the rare bluefire crystal can only be found in the caves of mount eldoria and emits a faint glow in the dark.",0.2857
2,2,In what year did Dr. Helena Carter win the Nobel Prize?,1998,dr. helena carter won the 1998 nobel prize in chemistry for discovering the synthesis of eco-friendly polymers.,0.1111
3,3,What is the national bird of Veridonia?,emerald falcon,"the national bird of the fictional country of veridonia is the emerald falcon, known for its vibrant green feathers.",0.2222
4,4,What was the name of the first space station of the Andromeda Federation?,nexus-1,"the first space station of the andromeda federation, called nexus-1, was launched in 2142. first, install the dependencies.",0.2222
5,5,What is Lake Virelia famous for?,purple-hued waters,"lake virelia is famous for its purple-hued waters, which change color slightly during sunrise and sunset.",0.3
6,6,When was 'Echoes of Tomorrow' first published?,2035,the novel 'echoes of tomorrow' by julian spence was first published in 2035 and became an instant bestseller.,0.1053
7,7,How long did it take the Zephyr-9 to fly around the world?,42 hours,the zephyr-9 aircraft set a record by flying non-stop around the world in just 42 hours.,0.2105
8,8,Who developed the first AI capable of composing symphonies?,elliot grayson,professor elliot grayson developed the first ai capable of composing symphonies in 2078. the decision of whether or not to carry on to do second degree is an incredibly personal one.,0.1333
9,9,How many moons does Xyphora have?,three,"the fictional planet of xyphora has three moons: trelos, mornis, and zephyria.",0.1538


In [23]:
# now computing token-f1 score, with a defined ground truth

# ground truth for token-f1
ground_truth_data = {
    "Question Number": list(range(20)),
    "Question": [
        "How many scrolls were in the grand library of Zandoria?",
        "Where can the bluefire crystal be found?",
        "In what year did Dr. Helena Carter win the Nobel Prize?",
        "What is the national bird of Veridonia?",
        "What was the name of the first space station of the Andromeda Federation?",
        "What is Lake Virelia famous for?",
        "When was 'Echoes of Tomorrow' first published?",
        "How long did it take the Zephyr-9 to fly around the world?",
        "Who developed the first AI capable of composing symphonies?",
        "How many moons does Xyphora have?",
        "What is the seating capacity of the Grand Arcadium?",
        "Where is the silver-tailed lynx native to?",
        "When was the ChronoGem discovered?",
        "How tall is Solaris Tower?",
        "What was Novaterra built on?",
        "When was Vortexium invented?",
        "How long does the Luminara Festival last?",
        "How many people speak Nythrani?",
        "Which cities does the HyperLoop-5 connect?",
        "Where is the Moonshade Rose found?"
    ],
    "Answer": [
        "50,000",
        "caves of mount eldoria",
        "1998",
        "emerald falcon",
        "nexus-1",
        "purple-hued waters",
        "2035",
        "42 hours",
        "elliot grayson",
        "three",
        "150,000",
        "frostwood region",
        "1876",
        "1,250 meters",
        "floating platforms",
        "2095",
        "seven days",
        "over 10 million",
        "orionis and eldoria",
        "nightfall valley"
    ]
}

ground_truth_df = pd.DataFrame(ground_truth_data)


In [20]:
def tokenize(text):
    text = text.lower()
    tokens = re.findall(r'\w+', text)
    return tokens

def compute_token_f1_groundtruth(predicted, ground_truth):
    predicted_tokens = tokenize(predicted)
    ground_truth_tokens = tokenize(ground_truth)
    common_tokens = Counter(predicted_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common_tokens.values())
    if num_same == 0:
        return 0.0

    precision = num_same / len(predicted_tokens)
    recall = num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def compute_metrics(results_df, ground_truth_df):
    correct_predictions = 0
    total_predictions = 0
    total_f1 = 0.0

    for index, row in results_df.iterrows():
        question = row["Question"]
        predicted_answer = row["Predicted Answer"]
        try:
            ground_truth_answer = ground_truth_df.loc[ground_truth_df['Question'] == question, 'Answer'].iloc[0]
        except IndexError:
            continue
        if predicted_answer.strip().lower() == ground_truth_answer.strip().lower():
            correct_predictions += 1
        f1 = compute_token_f1_groundtruth(predicted_answer, ground_truth_answer)
        total_f1 += f1
        total_predictions += 1
    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0.0
    avg_f1 = total_f1 / total_predictions if total_predictions > 0 else 0.0
    return accuracy, avg_f1


# Compute metrics based on the ground truth and the results
accuracy, avg_f1 = compute_metrics(results_df, ground_truth_df)

print(f"Overall Accuracy: {accuracy * 100:.2f}%")
print(f"Average Token-Level F1 Score: {avg_f1:.4f}")

Overall Accuracy: 100.00%
Average Token-Level F1 Score: 1.0000


##### Final Results

In [21]:
from IPython.display import HTML

print("---1) Results and token-f1 (no ground truth):")
display(HTML(results_df.to_html(index=False)))
print('----------------------------------------------------------------------------------------')
print("---2) Results and token-f1 (with ground truth):")
# display(HTML(results_df[['Question ID', 'Question', 'Predicted Answer', 'Best-Matched Sentence']].to_html(index=False)))
print(f"\nToken-F1 score, with ground truth: {avg_f1:.4f}")

---1) Results and token-f1 (no ground truth):


Question ID,Question,Predicted Answer,Best-Matched Sentence,Token-Level F1
0,How many scrolls were in the grand library of Zandoria?,50000,"the ancient city of zandoria was founded in 1200 bc and was known for its grand library, which housed over 50,000 scrolls.",0.1667
1,Where can the bluefire crystal be found?,caves of mount eldoria,"this 17th day of january , 1989. the rare bluefire crystal can only be found in the caves of mount eldoria and emits a faint glow in the dark.",0.2857
2,In what year did Dr. Helena Carter win the Nobel Prize?,1998,dr. helena carter won the 1998 nobel prize in chemistry for discovering the synthesis of eco-friendly polymers.,0.1111
3,What is the national bird of Veridonia?,emerald falcon,"the national bird of the fictional country of veridonia is the emerald falcon, known for its vibrant green feathers.",0.2222
4,What was the name of the first space station of the Andromeda Federation?,nexus-1,"the first space station of the andromeda federation, called nexus-1, was launched in 2142. first, install the dependencies.",0.2222
5,What is Lake Virelia famous for?,purple-hued waters,"lake virelia is famous for its purple-hued waters, which change color slightly during sunrise and sunset.",0.3
6,When was 'Echoes of Tomorrow' first published?,2035,the novel 'echoes of tomorrow' by julian spence was first published in 2035 and became an instant bestseller.,0.1053
7,How long did it take the Zephyr-9 to fly around the world?,42 hours,the zephyr-9 aircraft set a record by flying non-stop around the world in just 42 hours.,0.2105
8,Who developed the first AI capable of composing symphonies?,elliot grayson,professor elliot grayson developed the first ai capable of composing symphonies in 2078. the decision of whether or not to carry on to do second degree is an incredibly personal one.,0.1333
9,How many moons does Xyphora have?,three,"the fictional planet of xyphora has three moons: trelos, mornis, and zephyria.",0.1538


----------------------------------------------------------------------------------------
---2) Results and token-f1 (with ground truth):

Token-F1 score, with ground truth: 1.0000
