## Evaluating the pretrained speaker: MS COCO Captions evaluation

The following notebook is to set up / re-use public code for image caption evaluation following the pipeline suggested in the original paper on MS COCO Captions. Additionally, given the specifics of the downstream task (namely, sampling from the model and minimizing the CCE loss against that), a baseline for the validation PPL on sampled captions is computed as well.  

Due to installation difficulties, and because it was not a part of the original paper, the SPICE score computation is commented out in the cloned code and not performed in the present evaluation. 

#### Utils
In order to compute the standard image caption evaluation metrics, the code provided in [this](https://github.com/daqingliu/coco-caption) repo is used. Since it requires the results to be formatted in a specific syntax, script below performs some utility mapping of validation annotation IDs to  validation image IDs. 

--> i need to produce {'image_id': XXX, 'caption': 'lower cased string'} items when validating the model. When iterating over items with my data loader, I get annotation IDs. So i need to map ann IDs to img IDs. I can do that via th COCO.loadAnns(annIds) and then retrieve 'image_id'.

In [1]:
# reproducing the desired results format
import json
import torch
from pycocotools.coco import COCO
import math
import pandas as pd
from torchvision import transforms
import os
import sys
import torch.nn as nn
import numpy as np

In [2]:
# just creating a file for my entire val split
val_ids = torch.load("val_split_IDs_from_COCO_train.pt")
print(len(val_ids))
coco = COCO("../../../data/train/annotations/captions_train2014.json")
val_imgIDs = [coco.loadAnns(i)[0]['image_id'] for i in val_ids]
print(len(val_imgIDs))
# torch.save(val_imgIDs, "val_split_imgIDs_from_COCO_train.pt")

264048
loading annotations into memory...
Done (t=0.52s)
creating index...
index created!
264048


A further restriction is that there must be only one produced caption per image so the evaluation happens on
one annID per unique image only -- the metrics are computed relative to all 5 ground truth captions anyways, but they are retrieved within the pipeline. Below, the respective annotation ID and image ID files are created. 

In [18]:
# i also need to create a new test split which only contains one annotation per unique image
# get images i know werent used for pretraining
with open("imgID2annID.json", "r") as fp:
    f = json.load(fp)
imgIDs4val = list(f.keys())[30000:]

ann4val_unqIm = [f[i][0] for i in imgIDs4val]
# torch.save(torch.tensor(ann4val_unqIm), "val_split_annIDs_singular_from_COCO_train_tensor.pt")

In [2]:
# load unique data
val_ids = torch.load("val_split_annIDs_singular_from_COCO_train_tensor.pt").tolist()
print(len(val_ids))
coco = COCO("../../../data/train/annotations/captions_train2014.json")
val_imgIDs = [coco.loadAnns(i)[0]['image_id'] for i in val_ids]
print(len(val_imgIDs))
# torch.save(val_imgIDs, "val_split_imgIDs_singular_from_COCO_train.pt")

52783
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
52783


#### Evaluation function
The wrapper below takes in a trained model and performs the evaluation on a set number of validation images (e.g., a set including the images used in the reference game, or images used neither in pretraining nor experiments). 

In [3]:
# import agent modules from the actual repo
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [4]:
from agents.speaker import DecoderRNN
from utils.build_dataset import get_loader
from reference_game_utils.update_policy import clean_sentence
from utils.vocabulary import Vocabulary

In [5]:
from coco_caption.pycocotools.coco import COCO 
from coco_caption.pycocoevalcap.eval import COCOEvalCap # TODO in readme add point about renaming!
import skimage.io as io

import json
from json import encoder
encoder.FLOAT_REPR = lambda o: format(o, '.3f')

In [31]:
def evaluate_speaker(
    model_path: str,
    num_val_imgs: int,
    res_path: str,
    val_ppl_path: str,
    metrics_res_path: str,
    vocab_file: str,
    download_dir: str,
    val_file: str, 
    val_imgIDs_file: str,
    vocab_threshold: int = 25,
    batch_size: int = 1,
    embed_size: int = 512,
    visual_embed_size: int = 512,
    hidden_size: int = 512,
    decoding_strategy: str = "greedy",
) -> None:
    """
    Evaluate a pretrained speaker (image captioner) model for MS COCO by computing: 
    1) validation loss + perplexity given a paricular decoding strategy
    2) image captioning evaluation metrics from MS COCO Captions:
        BLEU, ROUGE, METEOR and CIDEr.
    All results including produced captions are saved.
    
    Arguments:
    ---------
    model_path: str
        Path to speaker model weights.
    num_val_imgs: int
        Number of validation images to be used.
    res_path: str
        Path and name of file where produced captions will be saved.
    val_ppl_path: str
        Path and name of file where batch-wise validation loss and PPL will be written to.
    metrics_res_path: str
        Path and name of file where COCO Captions metrics will be written to.
    vocab_file: str
        Path to vocab file.
    download_dir: str
        Directory with annotations.
    val_file: str
        Path to file holding UNIQUE per image annotation IDs from validation set.
    val_imgIDs_file: str
        Path to image IDs corresponding to the validation annotation IDs above.
    vocab_threshold: int = 25
        Minimal token count used in vocabulary construction.
    batch_size: int = 1
        Must be 1.
    embed_size: int = 512
        Dimensionality of embeddings. Must correspond to pretraining settings.
    visual_embed_size: int = 512
        Dimensionality of image vectors. Must correspond to pretraining settings.
    hidden_size: int = 512
        Dimensionality of the hidden layer. Must correspond to pretraining settings.
    decoding_strategy: str = "greedy"
        Decoding strategy to be used in sampling. 
        Available options: "greedy", "exp", "pure", "topk_temperature", "encoding".
        If "encode" is used, ground truth captions are just passed through the LSTM in training mode,
        no decoding is taking place.
    """
    assert batch_size == 1, "Only batch_size=1 evaluations are supported!"
    
    # data loader
    transform_test = transforms.Compose([transforms.Resize((224, 224)), 
                                         transforms.ToTensor(), \
                                         transforms.Normalize((0.485, 0.456, 0.406), \
                                                          (0.229, 0.224, 0.225))])
    data_loader_test = get_loader(transform=transform_test,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=True,
                         download_dir=download_dir, 
                         vocab_file=vocab_file,
                         dataset_path=val_file, 
                         num_imgs=num_val_imgs,
                         embedded_imgs=torch.load("../train_logs/COCO_train_ResNet_features_reshaped_dict.pt"),
                        )
    # add img IDs
    data_loader_test.dataset._img_ids_flat = torch.load(val_imgIDs_file)[:num_val_imgs]
    val_imgIDs = torch.load(val_imgIDs_file)[:num_val_imgs]
    
    vocab_size = len(data_loader_test.dataset.vocab)
    # load model
    decoder = DecoderRNN(
        embed_size,
        hidden_size,
        vocab_size,
        visual_embed_size,
    )
    decoder.eval()
    criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()
    hidden = decoder.init_hidden(batch_size)
    
    # instantiate results 
    results = []
    val_running_loss = 0.0
    val_running_ppl = 0.0
    losses_list = []
    ppl_list = []
    counter = 0
    total = 0
    
    num_steps = math.ceil(len(data_loader_test.dataset)/batch_size)
    
    softmax = nn.Softmax(dim = -1)
        
    for i in range(num_steps):
        counter += 1
        # manually construct indices to avoid duplications bc of length of examples or random repetitions
        indices = [(i, 0)]
        
        new_sampler = torch.utils.data.sampler.SubsetRandomSampler(indices=indices)
        data_loader_test.batch_sampler.sampler = new_sampler

        # Obtain the batch.
        targets, distractors, target_features, distractor_features, target_captions, distractor_captions = next(iter(data_loader_test)) 
        
        both_images = torch.cat((target_features.unsqueeze(1), distractor_features.unsqueeze(1)), dim=1)
        # retrieve image IDs
        batch_img_ids = [val_imgIDs[i[0]] for i in indices]
        
        max_seq_len = target_captions.shape[1]-1
        
        with torch.no_grad():
            # get prediction
            if decoding_strategy == "encoding":
                outputs, _ = decoder(both_images, target_captions, hidden)
                norm_outputs = softmax(outputs)
                _, captions_pred = torch.max(norm_outputs, dim = -1)
                
            else: 
                captions_pred, log_probs, outputs, entropies = decoder.sample(
                    both_images, 
                    max_sequence_length=max_seq_len, 
                    decoding_strategy=decoding_strategy
                )
            # transform to natural language
            nl_captions_pred = clean_sentence(captions_pred, data_loader_test)
            
            # append to results list together with img ID
            for i, c in list(zip(batch_img_ids, nl_captions_pred)):
                
                if "end" in c.split(" "):
                    len_c = sum([1 for x in c.split(" ")[:c.split(" ").index("end")] if x != "end" ])
                else:
                    len_c = len(c.split(" "))            
                
                results.append({"image_id": i, "caption": " ".join(c.split()[:len_c])})
                
            # compute val PPL
            loss = criterion(outputs.transpose(1,2), target_captions[:, 1:]) 
            losses_list.append(loss.item())
            ppl = np.exp(loss.item())
            ppl_list.append(ppl)

            val_running_loss += loss.item()
            val_running_ppl += ppl
    
    print("Final average loss: ", val_running_loss / counter)
    print("Final average PPL: ", val_running_ppl / counter)
        
    # check if results dir exists
    os.makedirs("../../../data/speaker_eval_results/", exist_ok=True)
    
    # write out results file
    with open(res_path, "w") as f:
        json.dump(results, f)
    # write out validation PPLs
    df_out = pd.DataFrame({
        "loss": losses_list,
        "PPL": ppl_list,
    })
    df_out.to_csv(val_ppl_path)
    
    # now compute the evaluations, as proposed in the notebook from the repo referenced above 
    cocoRes = coco.loadRes(res_path)
    # create cocoEval object by taking coco and cocoRes
    cocoEval = COCOEvalCap(data_loader_test.dataset.coco, cocoRes)

    cocoEval.params['image_id'] = cocoRes.getImgIds()

    # evaluate results
    cocoEval.evaluate()
    coco_metrics = cocoEval.eval.items()
    # construct out file 
    metrics_df = pd.DataFrame(cocoEval.eval, index = [0]).round(3)
    metrics_df.to_csv(
        metrics_res_path
    )
    print("Final coco metrics: ", coco_metrics)
    

In [32]:
evaluate_speaker(
    model_path="../models/decoder-coco-512dim-scheduled_sampling_wGreedyDecoding_k150-1.pkl",
    num_val_imgs=64,
    res_path="../../../data/speaker_eval_results/decoder_scheduled_sampling_wGreedyDecoding_k150-1_results.json",
    val_ppl_path="../../../data/speaker_eval_results/decoder_scheduled_sampling_wGreedyDecoding_k150-1_val.json",
    metrics_res_path="../../../data/speaker_eval_results/decoder_scheduled_sampling_wGreedyDecoding_k150-1_COCO_metrics.csv",
    vocab_file="vocab4000.pkl",
    download_dir="../../../data/train",
    val_file="val_split_annIDs_singular_from_COCO_train_tensor.pt", 
    val_imgIDs_file="val_split_imgIDs_singular_from_COCO_train.pt",
    batch_size=1,
    decoding_strategy="encoding",
)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=0.42s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 64/64 [00:00<00:00, 74153.44it/s]
100%|██████████| 3700/3700 [00:00<00:00, 104442.65it/s]

Loader ids  64
Loader ids  64
IMG IDS  64
IMG IDS  64





Final average loss:  8.345450565218925
Final average PPL:  4216.196820365262
Loading and preparing results...
DONE (t=0.02s)
creating index...
index created!
Len img ids:  64
[341245, 521150, 568955, 393602, 576543, 92639, 521132, 573988, 86831, 127575, 138859, 31673, 322670, 353357, 422423, 461236, 38435, 14824, 25644, 577685, 316622, 315808, 392404, 209261, 190723, 525891, 265209, 321213, 59622, 344969, 159683, 64751, 451312, 68764, 83815, 53957, 157434, 552159, 173515, 122343, 324901, 47293, 474653, 420487, 62821, 515186, 64818, 469644, 519696, 200404, 289204, 451381, 113812, 70411, 386074, 443432, 239559, 132617, 318815, 530040, 417528, 69758, 296696, 553719]
Len imgIds  64
64
64
tokenization...
setting up scorers...
computing Bleu score...
Img Ids in BLEU:  [14824, 25644, 31673, 38435, 47293, 53957, 59622, 62821, 64751, 64818, 68764, 69758, 70411, 83815, 86831, 92639, 113812, 122343, 127575, 132617, 138859, 157434, 159683, 173515, 190723, 200404, 209261, 239559, 265209, 289204, 29

METEOR: 0.007
computing Rouge score...
ROUGE_L: 0.019
computing CIDEr score...
CIDEr: 0.001
Final coco metrics:  dict_items([('Bleu_1', 0.016497461928913076), ('Bleu_2', 1.509521434943508e-10), ('Bleu_3', 3.256203541872574e-13), ('Bleu_4', 1.5513939494595848e-14), ('METEOR', 0.0069023193987303765), ('ROUGE_L', 0.0189564366156952), ('CIDEr', 0.0012282727200121127)])
