Including pre-trained language models for fine-tuning on other NLP tasks
So far I've collected 3 pre-trained LM, including GPT, ELMo and BERT.
I mainly re-produce and fine-tune these models on the RACE dataset and the experimental results are shown in the following sections.
Hope this repo can help you extend these pre-trained models to other tasks.
This dir includes the BERT model. Here we present expemeriments on the RACE dataset and the SQuAD dataset(version 1 and version 2)
bash run_classifier_wikiqa.sh
: run fine-tuning experiments on WikiQA
bash run_classifier_RACE.sh
: run fine-tuning experiments on RACE
bash extract_features_RACE.sh
: extract and dump representations of RACE to the local disk
run_squad.sh
: run fine-tuning experiments on SQuAD v1.1
run_squad_score.sh
: calclulate EM and F1 scores and the best threshold for the null answer on SQuAD v1.1
run_squad2.0.sh
: run fine-tuning experiments on SQuAD v2
run_squad2.0_score.sh
: calclulate EM and F1 scores and the best threshold for the null answer
run_squad2.0_with_best_thres.sh
: rerun(only predict) the model with the best threshold
This dir includes the ELMo model. Here we present experiments on the RACE dataset.
NOTE: This repo must be run in the python3 environment!
python run_race.py
TO DO!
This dir includes the GPT model. This is the original repo which refers to openai/finetune-transformer-lm and it includes experiments on the ROCStories dataset.
NOTE: you can download pre-trained params from openai/finetune-transformer-lm.
bash run.sh
This dir includes the GPT model. This is the modified repo which refers to openai/finetune-transformer-lm and it includes experiments on the RACE dataset.
bash run.sh
epochs | batch size | max input length | model | Overlap | MAP | MRR | device |
---|---|---|---|---|---|---|---|
3 | 1 | 256 | BERT | 243/633 | 76.5 | 78.0 | 1 GTX 1080 |
epochs | batch size | max input length | model | Accuracy (%) on dev | Accuracy (%) on test | Accuracy (%) on middle test | Accuracy (%) on high test | device |
---|---|---|---|---|---|---|---|---|
3 | 1 | 320 | GPT | 52.22 | 51.54 | 53.90 | 50.57 | 1 GTX 1080 |
3 | 1 | 512 | BERT | 55.65 | 54.11 | 59.26 | 52.00 | 1 GTX 1080 |
3 | 8 | 512/32 | ELMo | 39.39 | 38.57 | 38.23 | 39.02 | 1 GTX 1080 |
25 | 8 | 512/32 | ELMo | 40.64 | 38.04 | 37.67 | 38.25 | 1 GTX 1080 |
Analysis: from the chart above we can see that BERT behaves better than GPT on RACE, and ELMo performs much poorly. I only fine-tune a dense layer both for BERT and GPT, and I fine-tune a bilinear attention layer and a bilinear dot operation for ELMo. The reason why ELMo behaves poorly is that vector representations extracted from ELMo are individual between articles and questions(also options if considered).
The results are based on the dev set and the BERT-Base, Uncased Model
.
epochs | batch size | max input length | model | dataset | EM (%) | F1 (%) | device |
---|---|---|---|---|---|---|---|
3 | 6 | 384 | BERT | v1.1 | 81.15 | 88.51 | 1 GTX 1080 |
3 | 6 | 384 | BERT | v2 | 76.45 | 73.12 | 1 GTX 1080 |
epochs | batch size | model | Accuracy (%) on dev | Accuracy (%) on test | device |
---|---|---|---|---|---|
3 | 8 | GPT | 89.57 | 84.77 | 1 GTX 1080 |
The data
dir structure is shown as: