Skip to content
Evaluate QA models for consistency
Python Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
notebooks
precomputed
qa_consistency
CONTRIBUTING.md
LICENSE
MANIFEST.in
README.md
__init__.py
setup.py

README.md

Evaluating consistency of Question-Answering Models

This repository contains code for creating implications and evaluating the consistency of question-answering models, as described in the following paper:

Are Red Roses Red? Evaluating Consistency of Question-Answering Models
Marco Tulio Ribeiro, Carlos Guestrin, Sameer Singh
Association for Computational Linguistics (ACL), 2019

Installation

  1. Clone this repository and cd to the folder:
git clone git@github.com:marcotcr/qa_consistency.git
cd qa_consistency
  1. Create and activate a virtual environment, e.g.:
virtualenv -p python3.6 env
source env/bin/activate
  1. Run the following, replacing [gpu] with [cpu] if you don't have a gpu.
pip install cython numpy
pip install benepar[gpu]
pip install -e .
cd qa_consistency
git clone https://github.com/kelvinguu/qanli.git
cd ..
python -c "import benepar;benepar.download('benepar_en_small')"
python -m spacy download en_core_web_sm

Generating implications:

VQA

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
import qa_consistency.implication
gen = qa_consistency.implication.ImplicationsVQA()
gen.implications('How many birds?', '3')

[('Are there 3 birds ?', 'yes', 'yeseqcount'),
('Are there 4 birds ?', 'no', 'n+1'),
('Are there any birds ?', 'yes', 'ans>0 implies some')]

SQuAD

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
import qa_consistency.implication
gen = qa_consistency.implication.ImplicationsSquad()
passage = 'Kublai originally named his eldest son, Zhenjin, as the Crown Prince, \
but he died before Kublai in 1285.'
gen.implications('When did Zhenjin die?', '1285', passage)

[('Who died in 1285?', 'Zhenjin', 'subj')]

Evaluating the consistency of models

VQA

Download and extract precomputed implications here. Create a folder for the consistency dataset (CONSISTENCY_FOLDER). Output your model predictions into a json file (PRED_FILE) in the VQA format. Then run:

import qa_consistency.dataset_utils
all_imps = pickle.load(open('vqa_imps.pkl', 'rb'))
vqa = qa_consistency.dataset_utils.load_vqa(vqa_path, 'validation')
# Uncomment the line below if you want vqa v2
# vqa = qa_consistency.dataset_utils.load_vqav2(vqa_path, 'validation')
qa_consistency.dataset_utils.generate_implication_vqa(vqa, PRED_FILE, all_imps, CONSISTENCY_FOLDER)

This will write CONSISTENCY_FOLDER/{questions,annotations}.json. At this point you should run your model on these files, and generate a new prediction file (CONSISTENCY_PRED_FILE), and then run:

stats = qa_consistency.dataset_utils.evaluate_consistency_vqa(CONSISTENCY_FOLDER, CONSISTENCY_PREDS_FILE)
print('Consistency by implication type:')
print()
for x, v in stats.items():
    if x == 'all':
        continue
    print('%s : %.1f' % (x, 100* v))
print()
print('Avg  : %.1f' % (100 * stats['all']))

SQuAD

Download and extract precomputed implications here. Let SQUAD_PATH be a pointer to the original squad dev set json (dev-v1.1.json), PRED_FILE be the predictions json on the dev set from your model in the SQuAD official format (dictionary of id : answer). Run:

import qa_consistency.dataset_utils
all_imps = pickle.load(open('squad_imps.pkl', 'rb'))
qa_consistency.dataset_utils.generate_implication_squad(
SQUAD_PATH, PRED_FILE, all_imps, NEW_SQUAD_JSON)

This will generate a new dataset in the SQuAD format in the NEW_SQUAD_JSON path. At this point you should run your model on this file, and generate a new prediction file (CONSISTENCY_PRED_FILE), and then run:

stats = qa_consistency.dataset_utils.evaluate_consistency_squad(NEW_SQUAD_JSON, CONSISTENCY_PRED_FILE)
print('Consistency by implication type:')
print()
for x, v in stats.items():
    if x == 'all':
        continue
    print('%s : %.1f' % (x, 100* v))
print()
print('Avg  : %.1f' % (100 * stats['all']))

Notebooks where we bring it all together

Code of Conduct

Microsoft Open Source Code of Conduct

You can’t perform that action at this time.