# Baseline evaluation of FLAN-T5 on the MS COCO image-caption pairs dataset

---
The following metrics will be used, based on [this](https://github.com/salaniz/pycocoevalcap) repository:

* **BLEU-4**: BLEU (Bilingual Evaluation Understudy) is a metric that measures the similarity between the generated captions and the ground truth captions based on n-gram matching. The BLEU-4 score measures the precision of the generated captions up to four-grams compared to the ground truth captions.

* **METEOR**: METEOR (Metric for Evaluation of Translation with Explicit ORdering) is another metric that measures the similarity between the generated captions and the ground truth captions. It also takes into account word order and synonymy by using a set of reference summaries to compute a harmonic mean of precision and recall.

* **CIDEr**: CIDEr (Consensus-based Image Description Evaluation) is a metric that measures the consensus between the generated captions and the ground truth captions. It computes the similarity between the generated captions and the reference captions based on their TF-IDF weights, which helps capture important words in the captions.

* **SPICE**: SPICE (Semantic Propositional Image Caption Evaluation) is a metric that measures the semantic similarity between the generated captions and the ground truth captions. It analyzes the similarity between the semantic propositions present in the generated captions and those in the reference captions, taking into account the structure and meaning of the propositions.

* **ROUGE-L**: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric that measures the similarity between the generated captions and the ground truth captions based on overlapping sequences of words. ROUGE-L measures the longest common subsequence (LCS) between the generated captions and the reference captions, taking into account sentence-level structure and word order.

---

We first install the required evaluation scripts and libraries:

In [None]:
!pip install pycocotools
!pip install git+https://github.com/salaniz/pycocoevalcap.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/salaniz/pycocoevalcap.git
  Cloning https://github.com/salaniz/pycocoevalcap.git to /tmp/pip-req-build-6orfjmlj
  Running command git clone --filter=blob:none --quiet https://github.com/salaniz/pycocoevalcap.git /tmp/pip-req-build-6orfjmlj
  Resolved https://github.com/salaniz/pycocoevalcap.git to commit a24f74c408c918f1f4ec34e9514bc8a76ce41ffd
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
from pycocotools.coco import COCO
# from pycocoevalcap.eval import COCOEvalCap
import pycocoevalcap
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice
from pycocoevalcap.rouge.rouge import Rouge

import urllib.request
import json
from google.colab import drive

In [None]:
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


If you are running this notebook for the first time, please download the dataset and annotations from the COCO website, mount the Google Drive and save the downloaded files. To do the above, please uncomment the following cell and then comment out again:

In [None]:
## Download the annotations file from the COCO website
# annotations_url = 'http://images.cocodataset.org/annotations/annotations_trainval2014.zip'
# urllib.request.urlretrieve(annotations_url, 'annotations.zip')
# !unzip annotations.zip

## Copy the file to your Google Drive
# import shutil
# shutil.copy('annotations/captions_val2014.json', '/content/gdrive/MyDrive/')

### Ground truth (GT) annotations

In [None]:
annotations_file = '/content/gdrive/MyDrive/captions_val2014.json'
coco = COCO(annotations_file)

loading annotations into memory...
Done (t=0.46s)
creating index...
index created!


### Generated (GEN) annotations

In [None]:
# change this file to FLAN-T5 generated caption text file
captions_file = '/content/gdrive/MyDrive/captions_val2014.json'
with open(captions_file, 'r') as f:
    captions = json.load(f)

Now we can load the ground truth captions and the generated captions into Python dictionaries. Each caption should be a string.

In [None]:
# Create a dictionary with GT and GEN captions for each image
# gt_captions = []
# gen_captions = []
gt_captions = {}
gen_captions = {}

In [None]:
# for image_id, caption in captions.items():
for ix in range(len(captions['annotations'])):
    
    # remove this later
    (_, image_id), (_, id), (_, caption) = captions['annotations'][ix].items()

    # get the IDs of the annotations for the given image ID
    ann_ids = coco.getAnnIds(imgIds=image_id)

    # load the annotations for the given annotation IDs
    anns = coco.loadAnns(ann_ids)

    # extract the reference captions from the annotations
    references = [ann['caption'] for ann in anns]

    # add the reference captions and generated caption to the lists
    # gt_captions.append(references)
    # gen_captions.append(caption)
    gt_captions[image_id] = [{'caption': '\n'.join(references)}]
    # gen_captions[image_id] = caption
    gen_captions[image_id] = [{'caption': caption}]

In [None]:
print(f'There are {len(gt_captions)} ground truth (reference) captions. An example reference captions from the annotations are... \n{gt_captions[167549]}\n')

print(f'There are {len(gen_captions)} ground truth (reference) captions. An example generated caption from the model is... \n{gen_captions[167549]}')

There are 40504 ground truth (reference) captions. An example reference captions from the annotations are... 
[{'caption': 'A stuffed bear and a vase by a headstone.\nA brown teddy bear holding a glass vase in front of a grave.\nA stuffed animal is in she snow in front of a tombstone.\nA gravestone with a vase and stuffed animal on it.\nA stuffed animal and a vase by a gravestone.'}]

There are 40504 ground truth (reference) captions. An example generated caption from the model is... 
[{'caption': 'A stuffed animal and a vase by a gravestone.'}]


We tokenize the captions using the PTBTokenizer provided by the COCO evaluation toolkit.

In [None]:
# Initialise tokenizer
tokenizer = PTBTokenizer()

gt_captions_tokens = tokenizer.tokenize(gt_captions)
gen_captions_tokens = tokenizer.tokenize(gen_captions)

NameError: ignored

We initialize the evaluation metrics that you want to use (BLEU, ROUGE, CIDEr, and/or SPICE) and compute the evaluation scores for the generated captions with respect to the ground truth captions.

In [None]:
bleu_eval = Bleu()
meteor_eval = Meteor()
cider_eval = Cider()
spice_eval = Spice()
rouge_eval = Rouge()

In [None]:
# Compute BLEU score
bleu_score, _ = bleu_eval.compute_score(gt_captions_tokens, gen_captions_tokens)

# Compute METEOR score
meteor_score, _ = meteor_eval.compute_score(gt_captions_tokens, gen_captions_tokens)

# Compute CIDEr score
cider_score, _ = cider_eval.compute_score(gt_captions_tokens, gen_captions_tokens)

# Compute SPICE score
# spice_score, _ = spice_eval.compute_score(gt_captions, gen_captions)

# Compute ROUGE score
rouge_score, _ = rouge_eval.compute_score(gt_captions_tokens, gen_captions_tokens)

In [None]:
# We print the evaluation scores
print("BLEU score: ", bleu_score)
print("METEOR score: ", meteor_score)
print("CIDEr score: ", cider_score)
# print("SPICE score: ", spice_score)
print("ROUGE score: ", rouge_score)

BLEU score:  [0.018309205099127902, 0.01830920281982215, 0.018309200181732745, 0.018309197076799424]
METEOR score:  0.10194725233647307
CIDEr score:  4.8458567515162145e-08
ROUGE score:  0.29645632888870255


To put these values in perspective, we can see what the best possible scores are on each of the above metric:

In [None]:
# Best possible BLEU score
best_bleu_score, _ = bleu_eval.compute_score(gen_captions_tokens, gen_captions_tokens)

# Best possible METEOR score
best_meteor_score, _ = meteor_eval.compute_score(gen_captions_tokens, gen_captions_tokens)

# Best possible CIDEr score
best_cider_score, _ = cider_eval.compute_score(gen_captions_tokens, gen_captions_tokens)

# Best possible SPICE score
best_spice_score, _ = spice_eval.compute_score(gen_captions, gen_captions)

# Best possible ROUGE score
best_rouge_score, _ = rouge_eval.compute_score(gen_captions_tokens, gen_captions_tokens)

In [None]:
# We print the evaluation scores
print("Best BLEU score: ", best_bleu_score)
print("Best METEOR score: ", best_meteor_score)
# print("SPICE score: ", spice_score)
print("Best CIDEr score: ", best_cider_score)
print("Best ROUGE score: ", best_rouge_score)

Best BLEU score:  [0.9999999999999952, 0.9999999999999951, 0.999999999999995, 0.9999999999999948]
Best METEOR score:  1.0
Best CIDEr score:  10.0
Best ROUGE score:  1.0


# Notes

* The annotations are not available for the test split, only for the train and validation splits. Should we use the validation split?
* Three versions of the MS COCO dataset are available: 2014, 2015, 2017. Which one is best to use?
* Should we evaluate the caption quality on other dataset?
* Should FLAN-T5 be evaluated on the whole validation split or only on the random 100 samples (possibly sampled multiple times and averaged)?
* How to evaluate performance on single generated caption against 5 ground truth captions?
* Other possible metrics to consider other than those of the source paper:
    1. Take GT and GEN caption embeddings from CLIP --> Take cosine similarity/dot product 
    * Motivation: Rule-based methods do not usually capture the semantics of the caption, only operate on the token level.
    2. A [learning based](https://vision.cornell.edu/se3/wp-content/uploads/2018/03/1501.pdf) discriinative evaluation metric
    * Motivation: Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence.
* SPICE metric does not work at the moment.