# SocraticFlanT5 - Evaluation | DL2 Mini-project, May 2023
---

This notebook downloads the images from the validation split of the [MS COCO Dataset (2017 version)](https://cocodataset.org/#download) and the corresponding ground-truth captions, generates captions based on the Socratic model pipeline outlined below, and evaluates the generated captions based on the MS COCO ground-truth captions. We will evaluate the folowing two approaches: 
1. Baseline: a Socratic model based on the work by [Zeng et al. (2022)](https://socraticmodels.github.io/) where GPT-3 is replaced by [FLAN-T5-xl](https://huggingface.co/docs/transformers/model_doc/flan-t5). 

2. Improved prompting: an improved baseline model where the template prompt filled by CLIP is processed before passing to FLAN-T5-xl.

There are two approaches to this evaluation: rule-based and embedding-based.

---
For the **rule-based approach**, the following metrics will be used, based on [this](https://github.com/salaniz/pycocoevalcap) repository:

* *BLEU-4*: BLEU (Bilingual Evaluation Understudy) is a metric that measures the similarity between the generated captions and the ground truth captions based on n-gram matching. The BLEU-4 score measures the precision of the generated captions up to four-grams compared to the ground truth captions.

* *METEOR*: METEOR (Metric for Evaluation of Translation with Explicit ORdering) is another metric that measures the similarity between the generated captions and the ground truth captions. It also takes into account word order and synonymy by using a set of reference summaries to compute a harmonic mean of precision and recall.

* *CIDEr*: CIDEr (Consensus-based Image Description Evaluation) is a metric that measures the consensus between the generated captions and the ground truth captions. It computes the similarity between the generated captions and the reference captions based on their TF-IDF weights, which helps capture important words in the captions.

* *SPICE*: SPICE (Semantic Propositional Image Caption Evaluation) is a metric that measures the semantic similarity between the generated captions and the ground truth captions. It analyzes the similarity between the semantic propositions present in the generated captions and those in the reference captions, taking into account the structure and meaning of the propositions.

* *ROUGE-L*: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric that measures the similarity between the generated captions and the ground truth captions based on overlapping sequences of words. ROUGE-L measures the longest common subsequence (LCS) between the generated captions and the reference captions, taking into account sentence-level structure and word order.

---

For the **embedding-based** approach (based on CLIP embeddings), we calculate the cosine similarities between each image embedding and embeddings of the ground truth captions and then we calculate the cosine similarities between each image embedding and embeddings of the captions generated with FLAN-T5-xl.

<span style="color:#8B0000">**Important**</span>: we assume that you have the generated captions accessible from the current directory via `cache/res_baseline.pickle` or `cache/res_improved.pickle` or both. If that is not the case, please run the following notebook:
* `SocraticFlanT5 - Caption Generation.ipynb`

### Set-up

#### Loading the required packages

In [9]:
from image_captioning import ClipManager, ImageManager, VocabManager, FlanT5Manager, COCOManager
from eval import SocraticEvalCap
from utils import get_device
from transformers import set_seed
import os
import re
import json
import numpy as np
import pickle
import time
import random
import pandas as pd

### Evaluate the generated captions against the ground truth

#### Load the ground truth annotations

In [None]:
imgs_folder = 'imgs/val2017/'
annotation_file = 'annotations/annotations/captions_val2017.json'

with open(annotation_file, 'r') as f:
    lines = json.load(f)['annotations']
gts = {}
for item in lines:
    if item['image_id'] not in gts:
        gts[item['image_id']] = []
    gts[item['image_id']].append({'image_id': item['image_id'], 'caption': item['caption']})

#### Compute the embeddings for the gt captions

In [None]:
if not os.path.exists('cache/embed_capt_gt.pickle'):
    embed_capt_gt = {}
    for img_id, list_of_capt_dict in gts.items():
        list_of_captions = [capt_dict['caption'] for capt_dict in list_of_capt_dict]

        # Dims of img_feats_gt: 5 x 768
        img_feats_gt = clip_manager.get_text_feats(list_of_captions)

        embed_capt_gt[img_id] = img_feats_gt

    with open('cache/embed_capt_gt.pickle', 'wb') as handle:
        pickle.dump(embed_capt_gt, handle, protocol=pickle.HIGHEST_PROTOCOL)

#### Evaluation

In [None]:
approaches = ['baseline', 'improved']
eval_cap = {}

for approach in approaches:
    
    caption_file_path = f'cache/res_{approach}.pickle'
    caption_embed_file_path = f'cache/embed_capt_res_{approach}.pickle'
    
    # Load the generated captions, their embeddings, and image embeddings
    with open(caption_file_path, 'rb') as handle:
        res = pickle.load(handle) 
    evaluator = SocraticEvalCap(gts, res, approach=approach)

    # Rule-based metrics
    evaluator.evaluate_rulebased()
    eval_rulebased = {}
    for metric, score in evaluator.eval.items():
        print(f'{metric}: {score:.3f}')
        eval_rulebased[metric] = round(score, 5)
    eval_cap[approach]['rulebased'] = eval_rulebased

    # Embedding-based metric
    evaluator.evaluate_cossim()
    for source_caption, sim in evaluator.sims.items():
        print(f'{source_caption}: avg = {sim[0]:.3f}, std = {sim[1]:.3f}')
    eval_cap[approach]['cossim'] = evaluator.sims

#### Save the evaluation scores

In [None]:
with open('eval_cap.pickle', 'wb') as handle:
    pickle.dump(eval_cap, handle, protocol=pickle.HIGHEST_PROTOCOL)