Code to reproduce the BERT intermediate training experiments from Shnarch et al. (2022).
Using this repository you can:
(1) Download the datasets used in the paper;
(2) Run intermediate training that relies on pseudo-labels from the results of the sIB clustering algorithm;
(3) Fine-tune a BERT classifier starting from the default pretrained model (bert-base-uncased) and from the model after intermediate training;
(4) Compare the the BERT classification performance with and without the intermediate training stage.
Table of contents
The framework requires Python 3.8
-
Clone the repository locally:
git clone https://github.com/IBM/intermediate-training-using-clustering
-
Go to the cloned directory
cd intermediate-training-using-clustering
-
Install the project dependencies:
pip install -r requirements.txt
Windows users may also need to download the latest Microsoft Visual C++ Redistributable for Visual Studio in order to support tensorflow
-
Run the python script
python download_and_process_datasets.py
. This script downloads and processes 8 datasets used in the paper.
The experiment script run_experiment.py
requires 6 arguments:
train_file
: path to the train data (e.g. datasets/isear/train.csv).eval_file
: path to the evaluation data (e.g. datasets/isear/test.csv).num_clusters
: number of clusters used to generate the task pseudo labels. Defaults to 50 (as used in the paper)labeling_budget
: number of examples from the train data used for BERT fine-tuning (in the paper we tested the following budgets: 64, 128, 192, 256, 384, 512, 768, 1024)random_seed
: used for sampling the train data and for model traininginter_training_epochs
: number of epochs for the intermediate task. Defaults to 1 (as used in the paper)finetuning_epochs
: number of epochs for fine-tuning BERT overlabeling_budget
examples. Defaults to 10 (as used in the paper)
For example:
python run_experiment.py --train_file datasets/yahoo_answers/train.csv --eval_file datasets/yahoo_answers/test.csv --num_clusters 50 --labeling_budget 64 --finetuning_epochs 10 --inter_training_epochs 1 --random_seed 0
The results of the experimental run (accuracy for BERT with and without the intermediate task over the eval_file
) are written both to the screen, and to output/results.csv
.
Multiple experiments can safely write in parallel to the same output/results.csv
file - each new result is appended to the file. In addition, for every new result, an aggregation of all the results so far is written to output/aggregated_results.csv
. This aggregation reflects the mean of all runs for each experimental setting (i.e. with/without intermediate training) for a particular eval_file and labeling budget.
In order to show the effect of the intermediate task in different labeling budgets, run python plot.py
. This script generates plots under output/plots
for each dataset.
For example:
Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov and Noam Slonim (2022). Cluster & Tune: Boost Cold Start Performance in Text Classification. ACL 2022
Please cite:
@inproceedings{shnarch-etal-2022-cluster,
title = "Cluster & Tune: Boost Cold Start Performance in Text Classification",
author = "Shnarch, Eyal and
Gera, Ariel and
Halfon, Alon and
Dankin, Lena and
Choshen, Leshem and
Aharonov, Ranit and
Slonim, Noam",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.526",
pages = "7639--7653",
}
This work is released under the Apache 2.0 license. The full text of the license can be found in LICENSE.