Copyright (c) 2020-2021. All rights reserved.

Licensed under the MIT License.

# HPO for Fine-Tuning Pre-trained Language Models


## 1. Introduction


In this notebook, we demonstrate a procedure for troubleshooting HPO failure in fine-tuning pre-trained language models (introduced in the following paper):

*An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models. Xueqing Liu, Chi Wang. To appear in ACL-IJCNLP 2021*

FLAML requires `Python>=3.6`. To run this notebook example, please install flaml with the `notebook` and `nlp` options:
```bash
pip install flaml[acl2021]
```

In [1]:
%cd /data/xliu127/projects/hyperopt/FLAML/
!python setup.py install
from flaml.nlp import AutoTransformers

import flaml
import inspect
print(inspect.getsource(flaml.nlp))


/data/xliu127/projects/hyperopt/FLAML
running install
running bdist_egg
running egg_info
writing FLAML.egg-info/PKG-INFO
writing dependency_links to FLAML.egg-info/dependency_links.txt
writing requirements to FLAML.egg-info/requires.txt
writing top-level names to FLAML.egg-info/top_level.txt
reading manifest file 'FLAML.egg-info/SOURCES.txt'
writing manifest file 'FLAML.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/flaml
creating build/bdist.linux-x86_64/egg/flaml/tune
copying build/lib/flaml/tune/trial.py -> build/bdist.linux-x86_64/egg/flaml/tune
copying build/lib/flaml/tune/tune.py -> build/bdist.linux-x86_64/egg/flaml/tune
copying build/lib/flaml/tune/sample.py -> build/bdist.linux-x86_64/egg/flaml/tune
copying build/lib/flaml/tune/__init__.py -> build/bdist.linux-x86_64/egg/flaml/tune
copying build/lib/flaml/tune/analysis.py -> build/bdis

byte-compiling build/bdist.linux-x86_64/egg/flaml/scheduler/trial_scheduler.py to trial_scheduler.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/flaml/scheduler/__init__.py to __init__.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/flaml/scheduler/online_scheduler.py to online_scheduler.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/flaml/model.py to model.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/test/test_autovw.py to test_autovw.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/test/tune/__init__.py to __init__.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/test/tune/test_tune.py to test_tune.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/test/test_split.py to test_split.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/test/test_pytorch_cifar10.py to test_pytorch_cifar10.cpython-38.pyc
byte-compiling build/bdist.linux-x86_64/egg/test/test_restore.py to test_restore.cpython-38.pyc
byte-com

## 2. Initial Experimental Study (Section 4)


### Load dataset 

Load the dataset using AutoTransformer.prepare_data. In this notebook, we use the Recognizing Textual Entailment (RTE) dataset and the Electra model as an example:

In [2]:
autohf = AutoTransformers()
preparedata_setting = {
        "dataset_subdataset_name": "glue:mrpc",
        "pretrained_model_size": "google/electra-base-discriminator:base",
        "data_root_path": "data/",
        "max_seq_length": 128,
        }
autohf.prepare_data(**preparedata_setting)


console_args has no attribute pretrained_model_size, continue
console_args has no attribute dataset_subdataset_name, continue
console_args has no attribute algo_mode, continue
console_args has no attribute space_mode, continue
console_args has no attribute search_alg_args_mode, continue
console_args has no attribute algo_name, continue
console_args has no attribute pruner, continue
console_args has no attribute resplit_mode, continue
console_args has no attribute rep_id, continue
console_args has no attribute seed_data, continue
console_args has no attribute seed_transformers, continue
console_args has no attribute optarg1, continue
console_args has no attribute optarg2, continue


Reusing dataset glue (/home/xliu127/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)
Loading cached processed dataset at /home/xliu127/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-52e85ab7216b1e98.arrow
Loading cached processed dataset at /home/xliu127/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-f580e10e11ba92cb.arrow
Loading cached processed dataset at /home/xliu127/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-5a6d8f04bf59fe22.arrow
Loading cached processed dataset at /home/xliu127/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-52e85ab7216b1e98.arrow
Loading cached processed dataset at /home/xliu127/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c996572

### Running grid search

First, we run grid search using Electra. By specifying `algo_mode="grid"`, AutoTransformers will run the grid search algorithm. By specifying `space_mode="grid"`, AutoTransformers will use the default grid search configuration recommended by the Electra paper:

In [3]:
autohf_settings = {"resources_per_trial": {"gpu": 1, "cpu": 1},
                   "num_samples": 1,
                   "time_budget": 100000,  # unlimited time budget
                   "ckpt_per_epoch": 5,
                   "fp16": True,
                   "algo_mode": "grid",  # set the search algorithm to grid search
                   "space_mode": "grid", # set the search space to the recommended grid space
                   }
validation_metric, analysis = autohf.fit(**autohf_settings,)

2021-06-15 16:25:00,708	INFO tune.py:450 -- Total run time: 360.87 seconds (360.72 seconds for the tuning loop).


Get the time for running grid search: 

In [4]:
GST = autohf.last_run_duration
print("grid search for {} took {} seconds".format(autohf.jobid_config.get_jobid_full_data_name(), GST))

grid search for glue_mrpc took 360.9044075012207 seconds


After the HPO run finishes, generate the predictions and save it as a .zip file to be submitted to the glue website. Here we will need the library AzureUtils which is for storing the output information (e.g., analysis log, .zip file) locally and uploading the output to an azure blob container (e.g., if multiple jobs are executed in a cluster). If the azure key and container information is not specified, the output information will only be saved locally. 

In [5]:
predictions, test_metric = autohf.predict()
from flaml.nlp.result_analysis.azure_utils import AzureUtils

print(autohf.jobid_config)

azure_utils = AzureUtils(root_log_path="logs_test/", autohf=autohf)
azure_utils.write_autohf_output(valid_metric=validation_metric,
                                predictions=predictions,
                                duration= autohf.last_run_duration)
print(validation_metric)

remove_columns_ is deprecated and will be removed in the next major version of datasets. Use the dataset.remove_columns method instead.


Cleaning the existing label column from test data


JobID(dat=['glue'], subdat='mrpc', mod='grid', spa='grid', arg='dft', alg='grid', pru='None', pre_full='google/electra-base-discriminator', pre='electra', presz='base', spt='ori', rep=0, sddt=43, sdhf=42, var1=None, var2=None)
To use the azure storage component in flaml.nlp, run pip install azure-storage-blob
console_args does not contain data_root_dir, loading the default value
To use the azure storage component in flaml.nlp, run pip install azure-storage-blob
{'eval_accuracy': 0.8946078431372549, 'eval_f1': 0.9238938053097344, 'eval_loss': 0.2885817289352417}


The validation F1/accuracy we got was 92.4/89.5. After the above steps, you will find a .zip file for the predictions under data/result/. Submit the .zip file to the glue website. The test F1/accuracy we got was 90.4/86.7.

### Running Random Search

Next, we run random search with the same time budget as grid search:

In [6]:
def tune_hpo(time_budget, this_hpo_space):
    autohf_settings = {"resources_per_trial": {"gpu": 1, "cpu": 1},
                       "num_samples": -1,
                       "time_budget": time_budget,  # unlimited time budget
                       "ckpt_per_epoch": 5,
                       "fp16": True,
                       "algo_mode": "hpo",  # set the search algorithm to grid search
                       "algo_name": "rs",
                       "space_mode": "cus", # set the search space to the recommended grid space
                       "hpo_space": this_hpo_space
                       }
    validation_metric, analysis = autohf.fit(**autohf_settings,)
    predictions, test_metric = autohf.predict()
    from flaml.nlp.result_analysis.azure_utils import AzureUtils
    azure_utils = AzureUtils(root_log_path="logs_test/", autohf=autohf)
    azure_utils.write_autohf_output(valid_metric=validation_metric,
                                    predictions=predictions,
                                    duration= autohf.last_run_duration)
    print(validation_metric)

In [7]:
hpo_space_full = {
               "learning_rate": {"l": 3e-5, "u": 1.5e-4, "space": "log"},
               "warmup_ratio": {"l": 0, "u": 0.2, "space": "linear"},
               "num_train_epochs": [3],
               "per_device_train_batch_size": [16, 32, 64],
               "weight_decay": {"l": 0.0, "u": 0.3, "space": "linear"},
               "attention_probs_dropout_prob": {"l": 0, "u": 0.2, "space": "linear"},
               "hidden_dropout_prob": {"l": 0, "u": 0.2, "space": "linear"},
            }

tune_hpo(GST, hpo_space_full)

[2m[36m(pid=13140)[0m 
 23%|██▎       | 3/13 [00:00<00:00, 22.87it/s][A
[2m[36m(pid=13140)[0m 
 23%|██▎       | 3/13 [00:00<00:00, 22.87it/s][A
 67%|██████▋   | 232/345 [03:37<03:44,  1.98s/it]
 67%|██████▋   | 232/345 [03:37<03:44,  1.98s/it]
[2m[36m(pid=13140)[0m 
 38%|███▊      | 5/13 [00:00<00:00, 19.85it/s][A
[2m[36m(pid=13140)[0m 
 38%|███▊      | 5/13 [00:00<00:00, 19.85it/s][A
 68%|██████▊   | 233/345 [03:37<02:39,  1.42s/it]
 68%|██████▊   | 233/345 [03:37<02:39,  1.42s/it]
[2m[36m(pid=13140)[0m 
 54%|█████▍    | 7/13 [00:00<00:00, 18.17it/s][A
[2m[36m(pid=13140)[0m 
 54%|█████▍    | 7/13 [00:00<00:00, 18.17it/s][A
 68%|██████▊   | 234/345 [03:37<01:53,  1.03s/it]
 68%|██████▊   | 234/345 [03:37<01:53,  1.03s/it]
 68%|██████▊   | 235/345 [03:37<01:22,  1.33it/s]
 68%|██████▊   | 235/345 [03:37<01:22,  1.33it/s]
[2m[36m(pid=13140)[0m 
 69%|██████▉   | 9/13 [00:00<00:00, 17.17it/s][A
[2m[36m(pid=13140)[0m 
 69%|██████▉   | 9/13 [00:00<00:00, 17.17i

[2m[36m(pid=13140)[0m {'eval_loss': 0.4927924573421478, 'eval_accuracy': 0.8137254901960784, 'eval_f1': 0.8778135048231511, 'epoch': 1.4482758620689655}
[2m[36m(pid=13140)[0m {'eval_loss': 0.4927924573421478, 'eval_accuracy': 0.8137254901960784, 'eval_f1': 0.8778135048231511, 'epoch': 1.4482758620689655}


 70%|███████   | 242/345 [03:38<00:16,  6.25it/s]
 70%|███████   | 242/345 [03:38<00:16,  6.25it/s]
 70%|███████   | 243/345 [03:38<00:14,  6.97it/s]
 70%|███████   | 243/345 [03:38<00:14,  6.97it/s]
 71%|███████   | 244/345 [03:38<00:13,  7.57it/s]
 71%|███████   | 244/345 [03:38<00:13,  7.57it/s]
 71%|███████   | 245/345 [03:38<00:12,  8.06it/s]
 71%|███████   | 245/345 [03:38<00:12,  8.06it/s]
 71%|███████▏  | 246/345 [03:38<00:11,  8.44it/s]
 71%|███████▏  | 246/345 [03:38<00:11,  8.44it/s]
 72%|███████▏  | 247/345 [03:38<00:11,  8.50it/s]
 72%|███████▏  | 247/345 [03:38<00:11,  8.50it/s]
 49%|████▉     | 85/174 [02:12<08:21,  5.63s/it]
 49%|████▉     | 85/174 [02:12<08:21,  5.63s/it]
 72%|███████▏  | 248/345 [03:39<00:11,  8.77it/s]
 72%|███████▏  | 248/345 [03:39<00:11,  8.77it/s]
 72%|███████▏  | 249/345 [03:39<00:10,  8.75it/s]
 72%|███████▏  | 249/345 [03:39<00:10,  8.75it/s]
 49%|████▉     | 86/174 [02:13<05:52,  4.00s/it]
 49%|████▉     | 86/174 [02:13<05:52,  4.00s/it]
 72%

[2m[36m(pid=13142)[0m {'eval_loss': 0.40390661358833313, 'eval_accuracy': 0.8651960784313726, 'eval_f1': 0.9036777583187391, 'epoch': 2.2}
[2m[36m(pid=13142)[0m {'eval_loss': 0.40390661358833313, 'eval_accuracy': 0.8651960784313726, 'eval_f1': 0.9036777583187391, 'epoch': 2.2}


 55%|█████▍    | 95/174 [02:14<00:26,  2.98it/s]
 55%|█████▍    | 95/174 [02:14<00:26,  2.98it/s]
 55%|█████▌    | 96/174 [02:14<00:22,  3.44it/s]
  0%|          | 0/13 [00:00<?, ?it/s][A
 55%|█████▌    | 96/174 [02:14<00:22,  3.44it/s]
  0%|          | 0/13 [00:00<?, ?it/s][A
[2m[36m(pid=13138)[0m 
 23%|██▎       | 3/13 [00:00<00:00, 22.64it/s][A
[2m[36m(pid=13138)[0m 
 23%|██▎       | 3/13 [00:00<00:00, 22.64it/s][A
[2m[36m(pid=13138)[0m 
 38%|███▊      | 5/13 [00:00<00:00, 19.70it/s][A
[2m[36m(pid=13138)[0m 
 38%|███▊      | 5/13 [00:00<00:00, 19.70it/s][A
[2m[36m(pid=13138)[0m 
 54%|█████▍    | 7/13 [00:00<00:00, 17.99it/s][A
[2m[36m(pid=13138)[0m 
 54%|█████▍    | 7/13 [00:00<00:00, 17.99it/s][A
[2m[36m(pid=13138)[0m 
 69%|██████▉   | 9/13 [00:00<00:00, 17.00it/s][A
[2m[36m(pid=13138)[0m 
 69%|██████▉   | 9/13 [00:00<00:00, 17.00it/s][A
[2m[36m(pid=13138)[0m 
 85%|████████▍ | 11/13 [00:00<00:00, 16.35it/s][A
[2m[36m(pid=13138)[0m 
 85%|████

[2m[36m(pid=13138)[0m {'eval_loss': 0.29186099767684937, 'eval_accuracy': 0.8725490196078431, 'eval_f1': 0.9040590405904059, 'epoch': 1.6551724137931034}
[2m[36m(pid=13138)[0m {'eval_loss': 0.29186099767684937, 'eval_accuracy': 0.8725490196078431, 'eval_f1': 0.9040590405904059, 'epoch': 1.6551724137931034}


 73%|███████▎  | 253/345 [03:42<01:20,  1.14it/s]
[2m[36m(pid=13142)[0m 
                                               [A
 73%|███████▎  | 253/345 [03:42<01:20,  1.14it/s]
[2m[36m(pid=13142)[0m 
                                               [A
2021-06-15 16:34:31,364	INFO tune.py:450 -- Total run time: 518.66 seconds (511.35 seconds for the tuning loop).
 55%|█████▌    | 96/174 [02:17<01:51,  1.43s/it]
[2m[36m(pid=13138)[0m 
                                               [A
 55%|█████▌    | 96/174 [02:17<01:51,  1.43s/it]
[2m[36m(pid=13138)[0m 
                                               [A


To use the azure storage component in flaml.nlp, run pip install azure-storage-blob
console_args does not contain data_root_dir, loading the default value
To use the azure storage component in flaml.nlp, run pip install azure-storage-blob
{'eval_accuracy': 0.9093137254901961, 'eval_f1': 0.9345132743362833, 'eval_loss': 0.2630058825016022}


The validation F1/accuracy we got was 93.5/90.9. Similarly, we can submit the .zip file to the glue website. The test F1/accuaracy we got was 81.6/70.2. As an example, we only run the experiment one time, but in general, we should run the experiment multiple repetitions and report the averaged validation and test accuracy.

## 3. Troubleshooting HPO Failures

Since the validation accuracy is larger than grid search while the test accuracy is smaller, HPO has overfitting. We reduce the search space:

In [None]:
hpo_space_fixwr = {
               "learning_rate": {"l": 3e-5, "u": 1.5e-4, "space": "log"},
               "warmup_ratio": [0.1],
               "num_train_epochs": [3],
               "per_device_train_batch_size": [16, 32, 64],
               "weight_decay": {"l": 0.0, "u": 0.3, "space": "linear"},
               "attention_probs_dropout_prob": {"l": 0, "u": 0.2, "space": "linear"},
               "hidden_dropout_prob": {"l": 0, "u": 0.2, "space": "linear"},
            }
tune_hpo(GST, hpo_space_fixwr)

 43%|████▎     | 149/345 [01:57<00:47,  4.16it/s]
 43%|████▎     | 149/345 [01:57<00:47,  4.16it/s]
 43%|████▎     | 149/345 [01:57<00:47,  4.16it/s]
 43%|████▎     | 150/345 [01:57<00:40,  4.76it/s]
 43%|████▎     | 150/345 [01:57<00:40,  4.76it/s]
 43%|████▎     | 150/345 [01:57<00:40,  4.76it/s]
 44%|████▍     | 151/345 [01:57<00:35,  5.39it/s]
 44%|████▍     | 151/345 [01:57<00:35,  5.39it/s]
 44%|████▍     | 151/345 [01:57<00:35,  5.39it/s]
 40%|████      | 139/345 [01:49<11:24,  3.32s/it]
 40%|████      | 139/345 [01:49<11:24,  3.32s/it]
 40%|████      | 139/345 [01:49<11:24,  3.32s/it]
 44%|████▍     | 152/345 [01:57<00:32,  5.86it/s]
 44%|████▍     | 152/345 [01:57<00:32,  5.86it/s]
 44%|████▍     | 152/345 [01:57<00:32,  5.86it/s]
 41%|████      | 140/345 [01:49<08:03,  2.36s/it]
 41%|████      | 140/345 [01:49<08:03,  2.36s/it]
 41%|████      | 140/345 [01:49<08:03,  2.36s/it]
 44%|████▍     | 153/345 [01:57<00:30,  6.25it/s]
 44%|████▍     | 153/345 [01:57<00:30,  6.25it/s]


[2m[36m(pid=16805)[0m {'eval_loss': 0.6097157001495361, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'epoch': 1.4}
[2m[36m(pid=16805)[0m {'eval_loss': 0.6097157001495361, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'epoch': 1.4}
[2m[36m(pid=16805)[0m {'eval_loss': 0.6097157001495361, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'epoch': 1.4}


 46%|████▋     | 160/345 [01:52<00:22,  8.09it/s]
 46%|████▋     | 160/345 [01:52<00:22,  8.09it/s]
 46%|████▋     | 160/345 [01:52<00:22,  8.09it/s]
 47%|████▋     | 161/345 [01:52<00:22,  8.11it/s]
 47%|████▋     | 161/345 [01:52<00:22,  8.11it/s]
 47%|████▋     | 161/345 [01:52<00:22,  8.11it/s]
[2m[36m(pid=16741)[0m 
  0%|          | 0/13 [00:00<?, ?it/s][A
[2m[36m(pid=16741)[0m 
  0%|          | 0/13 [00:00<?, ?it/s][A
[2m[36m(pid=16741)[0m 
  0%|          | 0/13 [00:00<?, ?it/s][A
[2m[36m(pid=16741)[0m 
 23%|██▎       | 3/13 [00:00<00:00, 22.59it/s][A
[2m[36m(pid=16741)[0m 
 23%|██▎       | 3/13 [00:00<00:00, 22.59it/s][A
[2m[36m(pid=16741)[0m 
 23%|██▎       | 3/13 [00:00<00:00, 22.59it/s][A
[2m[36m(pid=16741)[0m 
 38%|███▊      | 5/13 [00:00<00:00, 19.50it/s][A
[2m[36m(pid=16741)[0m 
 38%|███▊      | 5/13 [00:00<00:00, 19.50it/s][A
[2m[36m(pid=16741)[0m 
 38%|███▊      | 5/13 [00:00<00:00, 19.50it/s][A
[2m[36m(pid=16741)[0m 
 54%|█████▍   

[2m[36m(pid=16741)[0m {'eval_loss': 0.6241453289985657, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'epoch': 1.4}
[2m[36m(pid=16741)[0m {'eval_loss': 0.6241453289985657, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'epoch': 1.4}
[2m[36m(pid=16741)[0m {'eval_loss': 0.6241453289985657, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'epoch': 1.4}


[2m[36m(pid=16805)[0m  47%|████▋     | 161/345 [02:01<02:19,  1.32it/s]
[2m[36m(pid=16805)[0m 
[2m[36m(pid=16805)[0m                                                [A
[2m[36m(pid=16805)[0m  47%|████▋     | 161/345 [02:01<02:19,  1.32it/s]
[2m[36m(pid=16805)[0m 
[2m[36m(pid=16805)[0m                                                [A
[2m[36m(pid=16805)[0m  47%|████▋     | 161/345 [02:01<02:19,  1.32it/s]
[2m[36m(pid=16805)[0m 
[2m[36m(pid=16805)[0m                                                [A


The validation F1/accuracy we got was 92.6/89.7, the test F1/accuracy was 85.9/78.7, therefore overfitting still exists and we further reduce the space:

In [None]:
hpo_space_min = {
               "learning_rate": {"l": 3e-5, "u": 1.5e-4, "space": "log"},
               "warmup_ratio": [0.1],
               "num_train_epochs": [3],
               "per_device_train_batch_size": [16, 32, 64],
               "weight_decay": [0.0],
               "attention_probs_dropout_prob": [0.1],
               "hidden_dropout_prob": [0.1],
            }
tune_hpo(GST, hpo_space_min)

The validation F1/accuracy we got was 92.9/90.2, test F1/accuracy was 83.0/73.0. Since the result still overfits even with the minimal search space, we stop.