<a href="https://colab.research.google.com/github/arianmo477/EgoCentricVisionPolito/blob/main/Extension/Test_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install rouge-score
!pip install evaluate
!pip install bert_score
!pip install nltk

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=bac4a0e33d9707e0f23790ad951af0861c509d00e6272f5fcf13155f439d733a
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluat

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import re
import evaluate
import torch
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

def evaluate_video_qa_metrics(answers_df):
    """
    Evaluates a QA model's answers using BLEU, ROUGE, and BERTScore.
    Prints all metrics directly. Does not save to file.
    """

    # Drop rows with missing values
    answers_df = answers_df.dropna(subset=['ground truth', 'answer'])

    # --- Tokenizer ---
    def simple_tokenizer(text):
        return re.findall(r'\b\w+\b', str(text).lower())

    # --- Tokenize ---
    references = [[simple_tokenizer(ref)] for ref in answers_df['ground truth']]
    hypotheses = [simple_tokenizer(pred) for pred in answers_df['answer']]

    # --- BLEU Scores ---
    smoothie = SmoothingFunction().method4
    bleu1 = corpus_bleu(references, hypotheses, weights=(1.0, 0, 0, 0), smoothing_function=smoothie)
    bleu2 = corpus_bleu(references, hypotheses, weights=(0.5, 0.5, 0, 0), smoothing_function=smoothie)
    bleu3 = corpus_bleu(references, hypotheses, weights=(0.33, 0.33, 0.33, 0), smoothing_function=smoothie)
    bleu4 = corpus_bleu(references, hypotheses, smoothing_function=smoothie)

    print(f"\nüîµ BLEU Scores:")
    print(f"BLEU-1: {bleu1:.4f}")
    print(f"BLEU-2: {bleu2:.4f}")
    print(f"BLEU-3: {bleu3:.4f}")
    print(f"BLEU-4: {bleu4:.4f}")

    # --- Prepare text for ROUGE & BERTScore ---
    joined_hypotheses = [' '.join(tokens) for tokens in hypotheses]
    joined_references = [' '.join(ref[0]) for ref in references]

    # --- ROUGE ---
    rouge_results = {}
    try:
        rouge = evaluate.load("rouge")
        rouge_results = rouge.compute(
            predictions=joined_hypotheses,
            references=joined_references
        )
        print("\nüî¥ ROUGE Scores:")
        for k, v in rouge_results.items():
            print(f"{k.upper()}: {v:.4f}")
    except Exception as e:
        print(f"\n‚ùóROUGE evaluation failed: {e}")

    # --- BERTScore ---
    avg_precision = avg_recall = avg_f1 = 0.0
    try:
        bertscore = evaluate.load("bertscore")
        bertscore_results = bertscore.compute(
            predictions=joined_hypotheses,
            references=joined_references,
            lang="en"
        )
        avg_precision = sum(bertscore_results['precision']) / len(bertscore_results['precision'])
        avg_recall = sum(bertscore_results['recall']) / len(bertscore_results['recall'])
        avg_f1 = sum(bertscore_results['f1']) / len(bertscore_results['f1'])

        print("\nüü¢ BERTScore (Semantic Similarity):")
        print(f"Precision: {avg_precision:.4f}")
        print(f"Recall:    {avg_recall:.4f}")
        print(f"F1 Score:  {avg_f1:.4f}")
    except Exception as e:
        print(f"\n‚ùóBERTScore evaluation failed: {e}")

In [None]:
# Load answers DataFrame (make sure answers_df is already defined)
# Example:
import pandas as pd
answers_df1 = pd.read_csv("/content/drive/MyDrive/model_2.0/model/results/Prompt_LLAVA_VIDEO.csv")
answers_df2 = pd.read_csv("/content/drive/MyDrive/model_2.0/model/results/gemini_video_qa_ans_1.5_flash_3000.csv")
answers_df3 = pd.read_csv("/content/drive/MyDrive/model_2.0/model/results/gemini_video_qa_ans_1.5_pro_3000.csv")
answers_df4 = pd.read_csv("/content/drive/MyDrive/model_2.0/model/results/gemini_video_qa_ans_2.5_flash.csv")
answers_df5 = pd.read_csv("/content/drive/MyDrive/model_2.0/model/results/gemini_video_qa_ans_2.5_pro.csv")
answers_df6 = pd.read_csv("/content/drive/MyDrive/model_2.0/model/results/gpt4o_3000tokens.csv")





In [None]:
print("\nVIDEO_LLAVA MODEL")
evaluate_video_qa_metrics(answers_df1)

print("\nGEMINI 1.5 flash MODEL")
evaluate_video_qa_metrics(answers_df2)

print("\nGEMINI 1.5 pro MODEL")
evaluate_video_qa_metrics(answers_df3)

print("\nGEMINI 2.5 flash MODEL")
evaluate_video_qa_metrics(answers_df4)

print("\nGEMINI 2.5 pro MODEL")
evaluate_video_qa_metrics(answers_df5)


print("\nChat GPT 4o MODEL")
evaluate_video_qa_metrics(answers_df6)


VIDEO_LLAVA MODEL

üîµ BLEU Scores:
BLEU-1: 0.2558
BLEU-2: 0.1607
BLEU-3: 0.0901
BLEU-4: 0.0333

üî¥ ROUGE Scores:
ROUGE1: 0.2825
ROUGE2: 0.1280
ROUGEL: 0.2799
ROUGELSUM: 0.2800


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üü¢ BERTScore (Semantic Similarity):
Precision: 0.8964
Recall:    0.8851
F1 Score:  0.8904

GEMINI 1.5 flash MODEL

üîµ BLEU Scores:
BLEU-1: 0.2684
BLEU-2: 0.1677
BLEU-3: 0.1023
BLEU-4: 0.0419

üî¥ ROUGE Scores:
ROUGE1: 0.2852
ROUGE2: 0.1047
ROUGEL: 0.2819
ROUGELSUM: 0.2798


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üü¢ BERTScore (Semantic Similarity):
Precision: 0.8871
Recall:    0.8923
F1 Score:  0.8893

GEMINI 1.5 pro MODEL

üîµ BLEU Scores:
BLEU-1: 0.3115
BLEU-2: 0.1992
BLEU-3: 0.1214
BLEU-4: 0.0524

üî¥ ROUGE Scores:
ROUGE1: 0.3449
ROUGE2: 0.1373
ROUGEL: 0.3439
ROUGELSUM: 0.3456


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üü¢ BERTScore (Semantic Similarity):
Precision: 0.8906
Recall:    0.8996
F1 Score:  0.8946

GEMINI 2.5 flash MODEL

üîµ BLEU Scores:
BLEU-1: 0.2889
BLEU-2: 0.1741
BLEU-3: 0.0995
BLEU-4: 0.0449

üî¥ ROUGE Scores:
ROUGE1: 0.3097
ROUGE2: 0.1065
ROUGEL: 0.3080
ROUGELSUM: 0.3062


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üü¢ BERTScore (Semantic Similarity):
Precision: 0.8859
Recall:    0.8910
F1 Score:  0.8880

GEMINI 2.5 pro MODEL

üîµ BLEU Scores:
BLEU-1: 0.3256
BLEU-2: 0.2058
BLEU-3: 0.1191
BLEU-4: 0.0507

üî¥ ROUGE Scores:
ROUGE1: 0.3420
ROUGE2: 0.1442
ROUGEL: 0.3359
ROUGELSUM: 0.3374


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üü¢ BERTScore (Semantic Similarity):
Precision: 0.8938
Recall:    0.8975
F1 Score:  0.8952

Chat GPT 4o MODEL

üîµ BLEU Scores:
BLEU-1: 0.2785
BLEU-2: 0.1691
BLEU-3: 0.0955
BLEU-4: 0.0434

üî¥ ROUGE Scores:
ROUGE1: 0.3073
ROUGE2: 0.1047
ROUGEL: 0.3030
ROUGELSUM: 0.3031


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üü¢ BERTScore (Semantic Similarity):
Precision: 0.8824
Recall:    0.8952
F1 Score:  0.8883
