# Automatic Evaluation for Question Generation and Question Answering 

Commonly used techniques for evaluating text in machine learning include [BLEU](https://en.wikipedia.org/wiki/BLEU) and [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric), both of which heavily focus on n-gram overlap between the reference text and what was generated (may or may not be sensitive to word order).

However, given the generative nature of our problem, these scores may not fully indicate how "good" the results are. For example, the following question is a valid question that could be generated from the context, but it has almost no overlap with the 'gold standard' reference text, and thus achieves a poor BLEU and ROUGE score: 

- **Context**: As this was the 50th super bowl, the league emphasized the "Golden Anniversary" with various gold-themed initiatives , as well as temporarily suspending the tradition of naming each super bowl game with roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. 
- **Reference Question**: If Roman numerals were used, what would Super Bowl 50 have been called?
- **Generated Question**: What is the “Golden Anniversary?”

Many scholars use these scores, in conjunction with human ratings for "naturalness," clarity, etc. to gain a better understanding of their performance. The evaluation package provided by [Chen et al. (2015)](https://arxiv.org/pdf/1504.00325.pdf) on [Github](https://github.com/tylin/coco-caption), originally used for image captions, includes a suite of evaluation metrics that can be used. 

The code below and enclosed subfolders are adapted from the Github repo associated with the above mentioned paper: https://github.com/tylin/coco-caption. The license is located in this folder

## Imports and Set Up

In [3]:
from pycocotools.coco import COCO
from pycocoevalcap.eval import COCOEvalCap

import json
from json import encoder
encoder.FLOAT_REPR = lambda o: format(o, '.3f')

In [4]:
# set up file names and pathes
dataDir='./'

annFile_ques = dataDir + 'annotated_data_ques.json'
resFile_ques = dataDir + 'res_data_ques.json'

annFile_ans = dataDir + 'annotated_data_ans.json'
#subtypes = ['results', 'evalImgs', 'eval']
resFile_ans = dataDir + 'res_data_ans.json'

# download Stanford models
#!./get_stanford_models.sh

### Question Evaluation

In [5]:
# create coco object and cocoRes object
coco_ques = COCO(annFile_ques)
cocoRes_ques = coco_ques.loadRes(resFile_ques)

loading annotations into memory...
0:00:00.001604
creating index...
index created!
Loading and preparing results...     
DONE (t=0.00s)
creating index...
index created!


In [6]:
# create cocoEval object by taking coco and cocoRes
cocoEval_ques = COCOEvalCap(coco_ques, cocoRes_ques)

# evaluate - using artificial tag for 'image id' because this was adapted from image caption scoring
cocoEval_ques.params['image_id'] = cocoRes_ques.getImgIds()

# evaluate results
cocoEval_ques.evaluate()

tokenization...
setting up scorers...
computing Bleu score...
{'reflen': 1063, 'guess': [977, 877, 777, 677], 'testlen': 977, 'correct': [321, 109, 51, 23]}
ratio: 0.919096895578
Bleu_1: 0.301
Bleu_2: 0.185
Bleu_3: 0.127
Bleu_4: 0.089
computing METEOR score...
METEOR: 0.140
computing Rouge score...
ROUGE_L: 0.280
computing CIDEr score...
CIDEr: 0.795
computing SPICE score...
SPICE: 0.178


In [7]:
# Print summary of evaluation scores
for metric, score in cocoEval_ques.eval.items():
    print '%s: %.3f'%(metric, score)

CIDEr: 0.795
Bleu_4: 0.089
Bleu_3: 0.127
Bleu_2: 0.185
Bleu_1: 0.301
ROUGE_L: 0.280
METEOR: 0.140
SPICE: 0.178


### Answer Evaluation

In [8]:
# create coco object and cocoRes object
coco_ans = COCO(annFile_ans)
cocoRes_ans = coco_ans.loadRes(resFile_ans)

loading annotations into memory...
0:00:00.000828
creating index...
index created!
Loading and preparing results...     
DONE (t=0.00s)
creating index...
index created!


In [9]:
# create cocoEval object by taking coco and cocoRes
cocoEval_ans = COCOEvalCap(coco_ans, cocoRes_ans)

# evaluate - using artificial tag for 'image id' because this was adapted from image caption scoring
cocoEval_ans.params['image_id'] = cocoRes_ans.getImgIds()

# evaluate results
cocoEval_ans.evaluate()

tokenization...
setting up scorers...
computing Bleu score...
{'reflen': 281, 'guess': [285, 185, 125, 85], 'testlen': 285, 'correct': [179, 95, 49, 23]}
ratio: 1.01423487544
Bleu_1: 0.628
Bleu_2: 0.568
Bleu_3: 0.502
Bleu_4: 0.430
computing METEOR score...
METEOR: 0.372
computing Rouge score...
ROUGE_L: 0.679
computing CIDEr score...
CIDEr: 3.297
computing SPICE score...
SPICE: 0.529


In [10]:
# Print summary of evaluation scores
for metric, score in cocoEval_ans.eval.items():
    print '%s: %.3f'%(metric, score)

CIDEr: 3.297
Bleu_4: 0.430
Bleu_3: 0.502
Bleu_2: 0.568
Bleu_1: 0.628
ROUGE_L: 0.679
METEOR: 0.372
SPICE: 0.529
