This repository contains code for the paper How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
Our code is mainly based on T5 and mesh-tensorflow and runs on TPUs.
Please follow the original T5 repository to properly setup TPUs.
To install required packages, download T5 (version 0.6.4) and mesh-tensorflow (version 0.1.16) and copy source files into the
Don't replace files already in these folders because those files are the files we modified for calibration purpose.
Run the following commands to fine-tune the UnifiedQA models with
margin objective functions.
$tpu specifies the name of the TPU,
$model_output specifies the output location to save the fine-tuned model,
$objective specifies the objective function to use.
./finetune.sh $tpu 3B $model_output $objective uq_clean_train_ol_mix train mc
Evaluate candidate answers
Run the following commands to evaluate the probabilities of candidate answers.
$score_output specifies the location to save the output, and
1103000 specifies the checkpoint to use.
./score.sh $tpu $score_output $model_output 1103000 uq_clean_test dev
Run the following commands to compute the ECE metric given the probabilities of candidate answers.
python cal.py --mix uq_clean_test --split dev --score $score_output