Skip to content

kobayashikanna01/plms_are_lifelong_learners

Repository files navigation

Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study

Open-resource code of our ICLR 2023 paper: Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study https://openreview.net/forum?id=UazgYBMS9-W


Requirements

python==3.8 pytorch>=1.7.0 transformers>=4.3.0


Download Datasets

We follow d'Autume et al. (2019), and download the datasets from their Google Drive.


Prepare

  1. Clone from Github.
  2. Create the data directory:
cd plms_are_lifelong_learners
mkdir data
  1. Move the .tar.gz files to "data".
  2. Uncompress the .tar.gz files and sample data from the original datasets:
bash uncompressing.sh
python sampling_data.py --seed 42

Train Models Sequentially

  1. Tokenize the input texts:
python tokenizing.py --tokenizer bert-base-uncased --data_dir ./data/ --max_token_num 128

"bert-base-uncased" can be replaced by any other tokenizer in Hugging Face Transformer Models, e.g. "roberta-base", "prajjwal1/bert-tiny", etc. The files with tokenized texts will be saved in the directory ./data/

  1. Train the models
CUDA_VISIBLE_DEVICES=0 python train_cla.py --plm_name bert-base-uncased --tok_name bert-base-uncased --pad_token 0 --plm_type bert --hidden_size 768 --device cuda --seed 1023 --padding_len 128 --batch_size 32 --learning_rate 1.5e-5 --trainer sequential --epoch 2 --order 0 --rep_itv 10000 --rep_num 100
  • "plm_name" indicates which pre-trained model (and its well-pre-trained weights) should be loaded. "tok_name" means which tokenizer are employed in the first step. These two parameters can be different, e.g., python train_cla.py --plm_name bert-large-uncased --tok_name bert-base-uncased ...
  • "plm_type" is required. When using GPT-2 or XLNet, please set "plm_type" as "gpt2" or "xlnet". It is because the classifiaction models based on GPT-2 or XLNet employ representations of the last tokens as features, while other models (like BERT or RoBERTa) employ the first token ([CLS]).
  • "pad_token" should be an integer, which is the ID of padding tokens in the tokenizer, e.g. 0 for BERT, or 1 for RoBERTa.
  • "trainer" can be sequential, replay, or multiclass. The sequential means training sequentially without Episodic Memory Play, and multiclass means training on all tasks together (multi-task learning).
  • "order" can be 0, 1, 2, 3, which is correspond to Appendix A in the paper.

Probing Study

  1. Re-train the decoder in each checkpoint:
CUDA_VISIBLE_DEVICES=0 python probing_train.py --plm_name bert-base-uncased --tok_name bert-base-uncased --pad_token 0 --plm_type bert --hidden_size 768 --device cuda --seed 1023 --padding_len 128 --batch_size 32 --learning_rate 3e-5 --epoch 10 --train_time "1971-02-03-14-56-07"
  1. Evaluate the performance of each re-trained model:
CUDA_VISIBLE_DEVICES=0 python test_model.py --plm_name bert-base-uncased --tok_name bert-base-uncased --pad_token 0 --plm_type bert --hidden_size 768 --device cuda --padding_len 128 --train_time "1971-02-03-14-56-07"

About

Open-resource code of our ICLR paper ( https://openreview.net/forum?id=UazgYBMS9-W ).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published