Skip to content
/ multicls Public

The code for "Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling"

License

Notifications You must be signed in to change notification settings

iesl/multicls

Repository files navigation

Code for "Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling"

Setup

Only tested on python3.6.

python -m pip install virtualenv
virtualenv bert_env
source bert_env/bin/activate
pip install -r requirements.txt

Usage

The code is built on the source code of On Losses for Modern Language Models with several enhancements and modifications. In addition to previous proposed pre-training tasks ("mlm", "rg (QT) in the paper", "tf", "tf_idf", "so"...etc), we provide a new training mechanism for transformers which enjoys the benefits of ensembling without sacrificing efficiency. To train our Multi-CLS Bert, simply specify --model-type mf (MCQT in paper) with number of facets K you want via --num-facets K.

Currently mf type can be combined with any of the following methods:

  • Using Hard Negative --use_hard_neg 1
  • Architecture-based Diversification --diversify_mode
  • On which layer of BERT to insert additional linear layer --diversify_hidden_layer
  • Enable cross facets loss--facet2facet
  • λ in our MCQT loss: agg_weight
  • Always using MLM loss (--always-mlm True). The loss of pre-training task will be "mf" + "mlm".
  • Initialize with pretrained bert's weight --pretrained

When pre-training with multi-tasks, the loss function can calculated using any of the following methods:

  • Summing all losses (default, incompatible between a small subset of tasks, see paper for more detail)
  • Continuous Multi-Task Learning, based on ERNIE 2.0 (--continual-learning True)
  • Alternating between losses (--alternating True)

To view all usable parameters that shares by all different pretrain tasks, you may find them in arguments.py.

Note that our code still supports those comparing tasks listed in our paper, you may just change the model type to reproduce the result (ex: using --model-type rg+so+tf_idf to perform MTL method )

Before training, you should

  • Set paths to read/save/load from in paths.py
  • To create datasets, see data_utils/make_dataset.py
  • For tf_idf prediction, you need to first calculate the idf score for your dataset. See idf.py for a script to do this.
  • If you want to change the transformer size, check out bert_config.json.
  • If you want to train bert-large, you may use bert_large_config.json with --tokenizer-model-type bert-large-uncased.

The following command is the best setting that we used our paper for Multi-Bert

python -m pretrain_bert --model-type mf,tf_idf,so --pretrained-bert --save-iters 200000 --lr 2e-5 --agg-function max --warmup 0.001 --facet2facet  --epochs 2 --num-facets 5 --diversify_hidden_layer 4,8 --loss_mode log  --use_hard_neg 1 --batch-size 30 --seed 1 --diversify_mode lin_new --add_testing_agg --agg_weight 0.1 --save_suffix _add_testing_agg_max01_n5_no_pooling_no_bias_h48_lin_no_bias_hard_neg_tf_idf_so_bsz_30_e2_norm_facet_warmup0001_s1

Fine-tuning

Before running fine-tuning task, change output_path in evaluate/generate_random_number.py as well as random_file_path in evaluate/config/test_bert.conf to your local path. Run the python file to generate random number, which is to ensure the random seeds for training data sampling remain same while fine-tuning.

To run fine-tuning task: You will need to convert the saved state dict of the required model using the convert_state_dict.py file. Then run: python3 -m evaluate.main --exp_name [experiment name] --overrides parameters_to_Overide Where experiment name is the same as the model type above. If using a saved checkpoint instead of the best model, use the --checkpoint argument. You may change the data you want to use in paths.py, can be glue or super glue. As for the --overrides, this parameter accepts command like strings to override the default values in fine-tuning config (evaluate/config/test_bert.conf). You may specify learning rate, model_suffix or few shot setting there.

In Multi-Bert, we provide different ways to aggregate all the CLS embeddings. To specify the aggregation function, change the value of pool_type in evaluate/config/test_bert.conf

  • Re-parameterization pool_type=proj_avg_train
  • Sum Aggregation pool_type=first_init

The following command is an example to run fine-tuning task on Glue dataset with few shot sample size =100. Use run name with suffix to reload the model weight you saved from pretraining.

common_para="warmup_ratio = 0.1, max_grad_norm = 1.0, pool_type=proj_avg_train, "
common_name="warmup01_clip1_proj_avg_train_correct"

python -m evaluate.main 
--exp_name $exp_name 
--overrides "run_name = ${model_name}_1, 
$common_para pretrain_tasks = glue}, 
target_tasks = glue, 
lr=1e-5, batch_size=4, few_shot = 32, max_epochs = 20, 
pooler_dropout = 0, random_seed = 1, 
run_name_suffix = adam_${common_name}_e20_bsz4:s1:lr"

Citation

@inproceedings{chang2023multi-cls,
  title={Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling},
  author={Haw-Shiuan Chang* and Ruei-Yao Sun* and Kathryn Ricci* and Andrew McCallum},
  booktitle={Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2023},
}

About

The code for "Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages