# **Evaluation Phase**

In this notebook, we are going to evaluate different Question-Answering models with respect to the common evaluation metrics in this area.
First, we will implement these metrics and elaborate on the benefits and drawbacks of each. Then we will provide a model-agnostic evaluation module which will be exploited to evaluate and compare different models. 

## Setting Up 

In [None]:
# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
address = "MLSys Course/project" # Current directory
import sys
sys.path.append('/content/drive/My Drive/{}'.format(address))

%cd /content/drive/My\ Drive/$address


/content/drive/My Drive/MLSys Course/project


### Importing required libraries

In [4]:
!pip install hazm==0.7.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting hazm==0.7.0
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.7/316.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nltk==3.3 (from hazm==0.7.0)
  Downloading nltk-3.3.0.zip (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting libwapiti>=0.2.1 (from hazm==0.7.0)
  Downloading libwapiti-0.2.1.tar.gz (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.6/233.6 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nltk, libwapiti
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.3-py3-none-any.

In [5]:
import pandas as pd
import numpy as np

## **Metrics**

### Exact Match (EM):

 EM measures the percentage of questions for which the system provides the exact correct answer. It is a binary metric where a prediction is either marked as correct (1) if it exactly matches the gold-standard answer or incorrect (0) otherwise. EM is a stringent metric as even a slight deviation from the correct answer is considered incorrect.

In [6]:
def exact_match(predicted_answer, true_answer):
    return int(predicted_answer.lower() == true_answer.lower())

### F1 Score:

 The F1 score combines precision and recall to evaluate the overlap between predicted and correct answers. Precision measures the proportion of predicted answers that are correct, while recall measures the proportion of correct answers that are predicted. The F1 score is the harmonic mean of precision and recall and provides a balanced evaluation metric.

In [7]:
def f1_score(predicted_answer, true_answer):
    predicted_tokens = set(predicted_answer.lower().split())
    true_tokens = set(true_answer.lower().split())
    
    if len(predicted_tokens) == 0 or len(true_tokens) == 0:
        return 0
    
    precision = len(predicted_tokens.intersection(true_tokens)) / len(predicted_tokens)
    recall = len(predicted_tokens.intersection(true_tokens)) / len(true_tokens)
    
    if precision + recall == 0:
        return 0
    
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

### BLEU Score:

 The Bilingual Evaluation Understudy (BLEU) is a metric commonly used in machine translation, but it can be adapted for QA evaluation. It compares the n-gram overlap between the predicted and reference answers. BLEU ranges from 0 to 1, where higher scores indicate better performance. However, BLEU is not always an ideal metric for QA as it primarily focuses on lexical overlap and does not capture semantic understanding.

In [16]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [17]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def bleu_score(predicted_answer, true_answer):
    reference = [true_answer.split()]
    hypothesis = predicted_answer.split()
    
    smoothing_function = SmoothingFunction().method1
    
    bleu = sentence_bleu(reference, hypothesis, smoothing_function=smoothing_function)
    return bleu


### ROUGE Score:

 The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric is another evaluation measure initially developed for text summarization but can be adapted for QA. It calculates the overlap of n-grams (such as unigrams, bigrams, and longer sequences) between the predicted and reference answers. Like BLEU, higher ROUGE scores indicate better performance.

In [9]:
!pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [10]:
from rouge import Rouge

def rouge_score(predicted_answer, true_answer):
    rouge = Rouge()
    scores = rouge.get_scores(predicted_answer, true_answer)
    return scores[0]['rouge-1']['f']

## **Evaluation**

### Preparing the test set

In [11]:
PATH_TO_TEST_SET = './dataset/validation_2.csv'

In [None]:
test_dataset = pd.read_csv(PATH_TO_TEST_SET, index_col=0)
test_dataset.sample(5)

Unnamed: 0,title,context,question,answers,answer_start
3409,دانشگاه_شیکاگو,در سال 1929 ، پنجمین رئیس دانشگاه ، رابرت مینا...,رئیس پنجم دانشگاه در چه سالی سمت خود را به دست...,1929,7.0
909,هوگنوت,شاهزاده لویی دو کونده ، به همراه پسرانش دانیل ...,در چه سالی توافق برای اجازه حل و فصل زارلند حا...,1604,160.0
4679,پارلمان اسکاتلند,لوایح را می توان از طرق مختلف به پارلمان ارائه...,چه کسی می تواند قوانین جدید یا اصلاحاتی را در ...,دولت اسکاتلند,52.0
4038,سیستم ایمنی,مکانیسم های استفاده شده برای فرار از سیستم ایم...,چه ترکیباتی را می توان با مولکولهای سلول میزبا...,آنتی ژن ها,702.0
267,نظریه پیچیدگی محاسباتی,کاهش متداول که استفاده می شود کاهش زمان چند جم...,از چه اندازه گیری زمان در کاهش زمان چند جمله ا...,زمان چند جمله ای,88.0


### Wrapper Class
To provide a consistent interface for different models, we implement a wrapper parent class. For each model, we will inherit from this class and implement the ```preprocess``` and ```postprocess``` methods. Each method will convert the input format in a way that is appropriate for the main model.
The main model is passed as an argument when initializing an instance.

In [49]:
class ModelWrapperBase:
  def __init__(self, main_model):
    self.model = main_model

  def preprocess(self, x):
    """
    Preprocess the input so that it can be feeded to the main_model. Overload this method if necessary.

    param x: is the input of the wrapper instance. It is a tuple: (context, question, answer_start)
    return: The appropriate input format for the main_model
    e.g: If your model just needs the 
    'question' parameter as the input, you have to return the second item of the tuple.
    """
    preprocessed = x # Preprocess the input x
    return preprocessed
  
  def postprocess(self, predicted_output):
    """
    Postprocess the main_model's output. Overload this method if necessary.

    param predicted_output: is the output of the main_model. It can be in any format.
    return: A string, which is the final predicted answer to the question.
    e.g: If the main_model's output is a tuple like (answer, length_of_answer), 
    you need to gets the first item and return it.
    """
    postprocessed = predicted_output # Postprocess the x
    return postprocessed
  
  def __call__(self, x):
    pre = self.preprocess(x)
    predicted = self.model(pre)
    post = self.postprocess(predicted)
    return post

In [54]:
class BaslineWrapper(ModelWrapperBase):
  def preprocess(self, x):
    return x[1]
  
  def postprocess(self, predicted_output):
    a = list(predicted_output.values())
    if len(a) == 0:
      return ' '
    pred = a[0]
    if len(pred) == 0:
      return ' '
    return pred
  
  def __call__(self, x):
    pre = self.preprocess(x)
    predicted = self.model.retrieve(pre, k=1)
    post = self.postprocess(predicted)
    return post

### Preparing the Model

In [None]:
#################################################################################################################################
#                                                                                                                               #
#Please initialize and load your pretrained model here. Also, initialize an appropriate wrapper class and pass the model to it. #
#                             Finally, you can pass the wrapper object to the evaluation function.                              #
#                                                                                                                               #
#################################################################################################################################

main_model = ... # Load your trained model here.
wrapper = ... # Initialize an instance of an appropriate wrapper class which you have already implemented. Pass this instance to the evaluation function.

### **Baseline (VectorSpaceModel)**

In [55]:
from baseline.indexes.positional_index import PositionalIndex
from baseline.preprocessor.persian_preprocessor import Preprocessor
from baseline.model.query_parser import  query_parse
from baseline.model.vector_space_model import VectorSpaceModel

PATH_TO_TRAIN_SET = './dataset/train_2.csv'

# Loading preprocessor
preprocessor = Preprocessor(PATH_TO_TRAIN_SET)
preprocessor.normalize()
preprocessor.lemmatize()

pos_index = PositionalIndex(None)
pos_index.load('./baseline/indexes/pos_index.json')

# Loading models
vs_model = VectorSpaceModel(pos_index, preprocessor, query_parse, to_retrieve='answers')
baseline = BaslineWrapper(vs_model)

### Running Evaluation

The module will calculate the evaluation metrics (Exact Match, F1 Score, BLEU Score, and ROUGE Score) for each question-answer pair in the test set. It then calculates the overall scores and prints the evaluation results. The function also returns a dictionary containing the evaluation scores for further analysis or reporting.

In [56]:
def evaluate_qa_model(qa_model, test_set):
    exact_match_scores = []
    f1_scores = []
    bleu_scores = []
    rouge_scores = []
    i = 0
    for index, row in test_set.iterrows():
        if i % 50 == 0:
          print(f'Processing index {i}')
        question = row['question']
        context = row['context']
        answer_start = row['answer_start']
        true_answer = row['answers']
        inp = (context, question, answer_start)
        
        # Generate predicted answer using the QA model
        predicted_answer = qa_model(inp)
        # Calculate evaluation metrics
        em = exact_match(predicted_answer, true_answer)
        f1 = f1_score(predicted_answer, true_answer)
        bleu = bleu_score(predicted_answer, true_answer)
        rouge = rouge_score(predicted_answer, true_answer)
        
        # Store the scores for each question
        exact_match_scores.append(em)
        f1_scores.append(f1)
        bleu_scores.append(bleu)
        rouge_scores.append(rouge)
        i += 1
    
    # Calculate the overall scores
    overall_exact_match = sum(exact_match_scores) / len(exact_match_scores)
    overall_f1 = sum(f1_scores) / len(f1_scores)
    overall_bleu = sum(bleu_scores) / len(bleu_scores)
    overall_rouge = sum(rouge_scores) / len(rouge_scores)
    
    # Print the evaluation results
    print("Evaluation Results:")
    print("Exact Match (EM): {:.4f}".format(overall_exact_match))
    print("F1 Score: {:.4f}".format(overall_f1))
    print("BLEU Score: {:.4f}".format(overall_bleu))
    print("ROUGE Score: {:.4f}".format(overall_rouge))
    
    # Return the evaluation scores
    evaluation_scores = {
        'Exact Match': overall_exact_match,
        'F1 Score': overall_f1,
        'BLEU Score': overall_bleu,
        'ROUGE Score': overall_rouge
    }
    
    return evaluation_scores

In [57]:
evaluate_qa_model(baseline, test_dataset)

Processing index 0
Processing index 50
Processing index 100
Processing index 150
Processing index 200
Processing index 250
Processing index 300
Processing index 350
Processing index 400
Processing index 450
Processing index 500
Processing index 550
Processing index 600
Processing index 650
Processing index 700
Processing index 750
Processing index 800
Processing index 850
Processing index 900
Processing index 950
Processing index 1000
Processing index 1050
Processing index 1100
Processing index 1150
Processing index 1200
Processing index 1250
Processing index 1300
Processing index 1350
Processing index 1400
Processing index 1450
Processing index 1500
Processing index 1550
Processing index 1600
Processing index 1650
Processing index 1700
Processing index 1750
Processing index 1800
Processing index 1850
Processing index 1900
Processing index 1950
Processing index 2000
Processing index 2050
Processing index 2100
Processing index 2150
Processing index 2200
Processing index 2250
Processing 

{'Exact Match': 0.0027820710973724882,
 'F1 Score': 0.019000295819706015,
 'BLEU Score': 0.0037798335567291383,
 'ROUGE Score': 0.019207707855558556}