The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions
This is the official repo for the following paper
- The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions, Xiang Zhou, Yixin Nie, Hao Tan and Mohit Bansal, EMNLP 2020 (arxiv)
This code requires Python 3. All the dependencies are specified in "requirement.txt"
pip install -r requirements.txt
The current code supports the calculation of decomposed variance metrics from standard evaluation numbers.
Download the NLI datasets and put it under the
nli_datafolder in the root directory
Organize the evaluation result of your model under the
modelsdirectly in the same way as the
berts(an example folder showing the result of BERT-base) folder, name of the folder representing the model type
MODEL_TYPE/seed_xsaves the evaluation results with seed
MODEL_TYPE/seed_x/, each folder represent the evaluation result on one dataset, including three files:
eval_results.txt: Final accuracy of the model
logits_results.txt: List of logits output by the model on every example in the dataset
pred_results.txt: List of labels predicted by the model on every example in the dataset
Run the evaluation scripts by
python variance_report.py MODEL_PATH
Other scripts (training/evaluation/analysis) and model checkpoints that are used in the paper will come soon.