This project implements models that train on the Stanford Question Answering Dataset (SQuAD). The SQuAD dataset is comprised of pairs of passages and questions given in English text where the answer to the question is a span of text in the passage. The goal of a model that trains on SQuAD is to predict the answer to a given passage/question pair. The project's main site has examples of some of the passages, questions, and answers, as well as a ranking for the existing models.
Specifically, this project implements:
I primarily made this for my own education, but the code could be used as a starting point for another project. The models are written in TensorFlow and the project uses (optional) AWS S3 storage for model checkpointing and data storage.
Model | Dev Em | Dev F1 | Details |
---|---|---|---|
Fusion Net | 73.5% | 82.0% | Checkout 82feaa3f78a51eaeb66c5578c5d5a9f125711312 python3 train_local.py --model_type=fusion_net --rnn_size=128 --batch_size=16 --input_dropout=0.4 --rnn_dropout=0.3 --dropout=0.4 training time ~11 hours over 2 1080 Ti GPUs, ~31 min/epoch |
Mnemonic reader | 71.2% | 80.1% | Checkout 82feaa3f78a51eaeb66c5578c5d5a9f125711312 python3 train_local.py --model_type=mnemonic_reader --rnn_size=40 --batch_size=65 --input_dropout=0.3 --rnn_dropout=0.3 --dropout=0.3 training time ~6 hours over 2 1080 Ti GPUs, ~8 min/epoch |
Rnet | ~60% | ~70% | |
Match LSTM | ~58% | ~68% |
All results are for a single model rather than an ensemble. I didn't train all models for the same duration and there may be bugs or unoptimized hyperparameters in my implementation.
Thanks to @Bearsuny for identifying an issue in the evaluation. It now uses the official/correct scoring mechanism.
- Python 3
- spaCy and the "en" model
- Cove vectors - You can skip this part but will probably need to manually remove any cove references in the setup. This also requires pytorch.
- Tensorflow 1.4
- cuDNN 7 recommended, GPUs required
In order to use AWS S3 for model checkpointing and data storage, you must set up AWS credentials. This page shows how to do it.
After your credentials are set up, you can enable S3 in the project by setting
the use_s3
flag to True
and setting s3_bucket_name
to the name of your
S3 bucket.
f.DEFINE_boolean("use_s3", True, ...)
...
f.DEFINE_string("s3_bucket_name", "<YOUR S3 BUCKET HERE>",...)
python3 setup.py
The following command will start model training and create or restore the current model parameters from the last checkpoint (if it exists). After each epcoh, the Dev F1/Em are calculated, and if the F1 score is a new high score, then the model parameters are saved. There is no mechanism to automatically stop training; it should be done manually.
python3 train_local.py --num_gpus=<NUMBER OF GPUS>
The following command will evaluate the model
on the Dev dataset and print out the exact match and f1 scores.
To make it easier to use the compatible SQuAD-formatted model outputs, the
predicted strings for each question will be written to the evaluation_dir
in a file called predictions.json.
In addition, if the visualize_evaluated_results
flag is true
, then
the passsages, questions, and ground truth spans will be written to output
files specified in the evaluation_dir
flag.
python3 evaluate_local.py --num_gpus=<NUMBER OF GPUS>
You can visualize the model loss, gradients, exact match, and f1 scores as the model trains by using TensorBoard at the top level directory of this repository.
tensorboard --logdir=log