<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/21c_10kGNAD_huggingface_basic_optuna.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Optimization with HuggingFace Transformers

Adapted from https://huggingface.co/docs/transformers/custom_datasets#sequence-classification-with-imdb-reviews

Things we need
* a tokenizer
* tokenized input data
* a pretrained model
* evaluation metrics
* training parameters
* a Trainer instance

Notes
* [class labels can be included in the model config](https://github.com/huggingface/transformers/pull/2945#issuecomment-781986506) (a bit hacky)
* [fp16 is disabled on tesla P100 GPU in pytorch](https://discuss.pytorch.org/t/cnn-fp16-slower-than-fp32-on-tesla-p100/12146)

## Prerequisites

In [1]:
checkpoint = "distilbert-base-german-cased"
# checkpoint = "deepset/gbert-base"
# checkpoint = "deepset/gelectra-base"

project_name = f'10kgnad_hf__{checkpoint.replace("/", "_")}'

### Connect Google Drive

Will be used to save results

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
from pathlib import Path

# define model path
root_path = Path('/content/gdrive/My Drive/')
base_path = root_path / 'Colab Notebooks/nlp-classification/'
model_path = base_path / 'models'

## Check GPU

In [4]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Fri Jan  7 17:14:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Install Packages

In [5]:
%%time
!pip install -q -U transformers datasets >/dev/null
!pip install -q -U optuna >/dev/null

# check installed version
!pip freeze | grep optuna        # optuna==2.10.0
!pip freeze | grep transformers  # transformers==4.15.0
!pip freeze | grep torch         # torch==1.10.0+cu111

optuna==2.10.0
transformers==4.15.0
torch @ https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cu111/torchaudio-0.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl
torchsummary==1.5.1
torchtext==0.11.0
torchvision @ https://download.pytorch.org/whl/cu111/torchvision-0.11.1%2Bcu111-cp37-cp37m-linux_x86_64.whl
CPU times: user 113 ms, sys: 54.1 ms, total: 167 ms
Wall time: 11.1 s


In [6]:
from transformers import logging

# hide progress bar when downloading tokenizer and model (a workaround!)
logging.get_verbosity = lambda : logging.NOTSET

## Load Dataset

In [7]:
from datasets import load_dataset

gnad10k = load_dataset("gnad10")
label_names = gnad10k["train"].features["label"].names

Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/gnad10/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881 (last modified on Thu Jan  6 23:08:20 2022) since it couldn't be found locally at gnad10., or remotely on the Hugging Face Hub.
Using custom data configuration default
Reusing dataset gnad10 (/root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881)


  0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
print(gnad10k)
print("labels:", label_names)
gnad10k["train"][0]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9245
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1028
    })
})
labels: ['Web', 'Panorama', 'International', 'Wirtschaft', 'Sport', 'Inland', 'Etat', 'Wissenschaft', 'Kultur']


{'label': 4,
 'text': '21-Jähriger fällt wohl bis Saisonende aus. Wien – Rapid muss wohl bis Saisonende auf Offensivspieler Thomas Murg verzichten. Der im Winter aus Ried gekommene 21-Jährige erlitt beim 0:4-Heimdebakel gegen Admira Wacker Mödling am Samstag einen Teilriss des Innenbandes im linken Knie, wie eine Magnetresonanz-Untersuchung am Donnerstag ergab. Murg erhielt eine Schiene, muss aber nicht operiert werden. Dennoch steht ihm eine mehrwöchige Pause bevor.'}

## Data Preprocessing

* Loading the same Tokenizer that was used with the pretrained model.
* Define function to tokenize the text (with truncation to max input length of model.
* Run the tokenization

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_gnad10k = gnad10k.map(preprocess_function, batched=True).remove_columns("text")

Loading cached processed dataset at /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881/cache-5d66d7a004b32c63.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881/cache-1e7aaca04dbb52e2.arrow


### Use Dynamic Padding

Apply panding only on longest text in batch - this is more efficient than applying padding on the whole dataset.

In [10]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Model Setup

We want to include the label names and save them together with the model.
The only way to do this is to create a Config and put them in. 

In [11]:
import optuna
from transformers import AutoConfig, AutoModelForSequenceClassification

config = AutoConfig.from_pretrained(
        checkpoint,
        num_labels=len(label_names),
        id2label={i: label for i, label in enumerate(label_names)},
        label2id={label: i for i, label in enumerate(label_names)},
        )

def model_init(trial: optuna.Trial):
    """A function that instantiates the model to be used."""
    return AutoModelForSequenceClassification.from_pretrained(checkpoint, config=config)

### Define Evaluation Metrics

The funtion that computes the metrics needs to be passed to the Trainer.

In [12]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, matthews_corrcoef
import numpy as np
from typing import Dict

def compute_metrics(eval_preds):
    """The function that will be used to compute metrics at evaluation.
    Must take a :class:`~transformers.EvalPrediction` and return a dictionary
    string to metric values."""
    logits, labels = eval_preds
    preds = np.argmax(logits, axis=-1)
    return {
        "acc": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average='macro'),
        "precision": precision_score(labels, preds, average='macro'),
        "recall": recall_score(labels, preds, average='macro'),
        "mcc": matthews_corrcoef(labels, preds),
        }


# def objective(metrics: Dict[str, float]):
#     """A function computing the main optimization objective from the metrics
#     returned by the :obj:`compute_metrics` method.
#     To be used in :obj:`Trainer.hyperparameter_search`."""
#     return metrics["eval_loss"]

## Hyperparameter Tuning

In [29]:
from transformers import TrainerCallback
from optuna.trial import TrialState
from optuna.study._study_direction import StudyDirection
import pandas as pd

# https://github.com/huggingface/transformers/blob/v4.14.1/src/transformers/trainer_callback.py#L505



class MultiObjectiveMedianPrunerCallback(TrainerCallback):
    def __init__(self,
                 objectives,
                 n_startup_trials: int = 5,
                 n_warmup_steps: int = 0,
                 interval_steps: int = 1,
                 n_min_trials: int = 1,):
        self.ojectives = objectives
        self._n_startup_trials = n_startup_trials
        self._n_warmup_steps = n_warmup_steps
        self._interval_steps = interval_steps
        self._n_min_trials = n_min_trials
    
    def prune(self, study: "optuna.study.Study", trial: "optuna.trial.FrozenTrial") -> bool:
        # get completed trials
        complete_trials = study.get_trials(deepcopy=False,
                                           states=[TrialState.COMPLETE])
        # complete_trials = [t for t in all_trials
        #                    if t.state == TrialState.COMPLETE]
        n_trials = len(complete_trials)

        step = trial.last_step
        print(f"prune? step={step}, complete_trials={n_trials}")


        return False
        
    
    # def on_evaluate(self, args, state, control, metrics, **kwargs):
    #     print(f"pruning check ")
    #     # TODO: use set_user_attrs instead of report
    #     self.trial.report(metrics[self.metric], step=state.global_step)


# https://huggingface.co/docs/transformers/main_classes/callback#transformers.TrainerCallback

class TrialLogAndPruningCallback(TrainerCallback):
    """Stores eval metrics at each evaluation step in the trial user attrs."""
    def __init__(self, trial: optuna.Trial, objectives=None, warmup_steps=0, min_trials=7):
        self.trial = trial
        if objectives == None:
            self.objectives = ["eval_loss"]
        else:
            self.objectives = objectives
        self._warmup_steps = warmup_steps
        self._min_trials = min_trials

    def _filter_trials(self, complete_trials):
        """Select only trials with same parameter values"""
        keys = ["num_train_epochs", "per_device_train_batch_size"]
        values = [self.trial.params[k] for k in keys]
        return [t for t in complete_trials if values == [t.params[k] for k in keys]]

    def _prune(self, step: int, metrics) -> bool:
        """Median Pruning on multiple objectives."""
        if step < self._warmup_steps:
            return False

        study = self.trial.study
        complete_trials = study.get_trials(deepcopy=False,
                                           states=[TrialState.COMPLETE])
        # do not compare trials with different batch sizes and epochs
        complete_trials = self._filter_trials(complete_trials)
        n_trials = len(complete_trials)

        if n_trials < self._min_trials:
            return False

        trial_metrics = []
        for t in complete_trials:
            if str(step) in t.user_attrs.keys():
                trial_metrics.append(t.user_attrs[str(step)])
        n_metrics = len(trial_metrics)

        median = pd.DataFrame(trial_metrics).median()

        directions = study.directions
        prune_state = []
        for i, o in enumerate(self.objectives):
            if directions[i] == StudyDirection.MAXIMIZE:
                prune_state.append(metrics[o] <= median[o])
            else:
                prune_state.append(metrics[o] > median[o])
        
        met = ",".join([f"{m}={metrics[m]:.4}/{median[m]:.4}" for m in self.objectives])
        print(f"prune? step={step}, warmup={self._warmup_steps}, complete_trials={n_trials}, metrics={n_metrics} -> {met}; {prune_state}")

        # all metrics must be marked for pruning
        # return all(prune_state)
        return False
    
    def on_evaluate(self, args, state, control, lr_scheduler, metrics, **kwargs):
        step = state.global_step
        values = {**metrics, "lr": lr_scheduler.get_last_lr()[-1]}
        self.trial.set_user_attr(step, values)

        # pruning
        if self._prune(step, metrics):
            print(f"pruning trial at step {step}")
            # control.should_training_stop = True  # not needed
            raise optuna.TrialPruned()

In [30]:
from transformers import TrainingArguments, Trainer
import shutil

def hp_space(trial: optuna.Trial):
    """A function that defines the hyperparameter search space.
    To be used in :obj:`Trainer.hyperparameter_search`."""
    return {
        "learning_rate": trial.suggest_float("learning_rate", 3e-5, 1e-4, log=True),  # distilbert
        # "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-4, log=True),
        # "learning_rate": trial.suggest_float("learning_rate", 6e-5, 2e-4, log=True),  # electra
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [2, 3]),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [32]),
        # "weight_decay": trial.suggest_float("weight_decay", 1e-3, 1e-2, log=True),
        "weight_decay": trial.suggest_categorical("weight_decay", [1e-3, 0.0]),
        # "label_smoothing_factor": trial.suggest_float("label_smoothing_factor", 0.0, 0.1),
    }

best_model_dir = "best_model_trainer"

def best_model_callback(study, trial):
    """Save the model from a best trial"""
    for t in study.best_trials:
        if t.number == trial.number:
            print("This is a new besttrial", trial.number)
        
            out_filename = model_path / f"{project_name}_t{trial.number}"
            shutil.make_archive(out_filename, 'zip', f"{project_name}/{best_model_dir}")

def objective(trial: optuna.Trial):

    # get hyperparameters choice
    hp = hp_space(trial)
    lr = hp["learning_rate"]
    bs = hp["per_device_train_batch_size"]
    epochs = hp["num_train_epochs"]
    weight_decay = hp["weight_decay"]
    # label_smoothing_factor = hp["label_smoothing_factor"]

    eval_rounds_per_epoch = 5
    eval_steps = gnad10k["train"].num_rows / bs // eval_rounds_per_epoch

    training_args = TrainingArguments(
        output_dir=str(project_name),
        report_to=[],
        log_level="error",
        disable_tqdm=False,

        evaluation_strategy="steps",
        eval_steps=eval_steps,
        logging_steps=eval_steps,
        save_strategy="steps",
        save_steps=eval_steps,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,

        # hyperparameters
        num_train_epochs=epochs,
        learning_rate=lr,
        per_device_train_batch_size=bs,
        per_device_eval_batch_size=bs,
        weight_decay=weight_decay,
        # label_smoothing_factor=label_smoothing_factor,

        # fp16=True,  # fp16 is disabled on Tesla P100 by pytorch
    )

    trainer = Trainer(
        model_init=model_init,
        args=training_args,
        train_dataset=tokenized_gnad10k["train"],
        eval_dataset=tokenized_gnad10k["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[TrialLogAndPruningCallback(trial, objectives=["eval_loss", "eval_f1"], warmup_steps=eval_steps*4)]
        # callbacks=[TrialPruningCallback(trial)]
    )

    # train model and save best model from evaluations
    # needs 'load_best_model_at_end=True'
    trainer.train()
    trainer.save_model(f"{project_name}/{best_model_dir}")

    result = trainer.evaluate(eval_dataset=tokenized_gnad10k["test"])

    # store eval metrics in trial
    trial.set_user_attr("eval_result", result)
    
    # return result["eval_loss"]
    return result["eval_loss"], result["eval_f1"]

## Hyperparameter Tuning

In [None]:
db_path = "/content/gdrive/My Drive/Colab Notebooks/nlp-classification/"
db_name = "10kgnad_optuna"
# study_name = checkpoint + "_multi_epoch234"
study_name = checkpoint + "_loss-f1_bs32_epoch23"

# multi objective study
# https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/002_multi_objective.html#sphx-glr-tutorial-20-recipes-002-multi-objective-py
study = optuna.create_study(study_name=study_name,
                            directions=["minimize", "maximize"],
                            # pruner=MultiObjectiveMedianPrunerCallback(["eval_loss", "eval_f1"]),
                            storage=f"sqlite:///{db_path}{db_name}.db",
                            load_if_exists=True,)

# give some hyperparameters that are presumably good
# study.enqueue_trial(
#     {
#         "learning_rate": 8e-5,
#         "weight_decay": 1e-3,
#         "label_smoothing_factor": 0.0,
#     }
# )
# study.enqueue_trial(
#     {
#         "learning_rate": 7e-5,
#         "weight_decay": 1e-3,
#         "label_smoothing_factor": 1e-5,
#     }
# )

# https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch
import torch
torch.cuda.empty_cache()
import gc
gc.collect()


study.optimize(objective, n_trials=100, callbacks=[best_model_callback])

# study.best_params

[32m[I 2022-01-07 21:20:33,957][0m Using an existing study with name 'distilbert-base-german-cased_loss-f1_bs32_epoch23' instead of creating a new one.[0m


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,1.2469,0.698054,0.792802,0.781168,0.815487,0.772654,0.764949
114,0.654,0.505349,0.838521,0.836773,0.856238,0.824729,0.815229
171,0.5158,0.436459,0.860895,0.860254,0.862906,0.863638,0.841829
228,0.4673,0.411757,0.86284,0.863616,0.85984,0.871995,0.843409
285,0.4103,0.403604,0.867704,0.866437,0.875247,0.86541,0.849197
342,0.3066,0.380824,0.882296,0.880217,0.887779,0.875088,0.865543
399,0.3079,0.341928,0.892996,0.88906,0.892112,0.887245,0.877482


prune? step=228, warmup=228.0, complete_trials=95, metrics=95 -> eval_loss=0.4118/0.4225,eval_f1=0.8636/0.8662; [False, True]
prune? step=285, warmup=228.0, complete_trials=95, metrics=95 -> eval_loss=0.4036/0.4049,eval_f1=0.8664/0.8689; [False, True]
prune? step=342, warmup=228.0, complete_trials=95, metrics=95 -> eval_loss=0.3808/0.3764,eval_f1=0.8802/0.8764; [True, False]
prune? step=399, warmup=228.0, complete_trials=95, metrics=95 -> eval_loss=0.3419/0.3637,eval_f1=0.8891/0.8798; [False, False]


In [None]:
!ls -lahtr 10kgnad_hf__distilbert-base-german-cased/

## Hyperparameter Tuning

https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.hyperparameter_search

In [None]:
# disable transformer warnings like "Some weights of the model checkpoint ..."
logging.set_verbosity_error()


training_args = TrainingArguments(
    output_dir=str(project_name),
    report_to=[],
    log_level="error",
    disable_tqdm=False,

    evaluation_strategy="steps",
    # eval_steps=eval_steps,
    save_strategy="steps",
    # save_steps=eval_steps,
    # load_best_model_at_end=False,
    # metric_for_best_model="eval_loss",
    # greater_is_better=False,
)

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_gnad10k["train"],
    eval_dataset=tokenized_gnad10k["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


# Default objective is the sum of all metrics
# when metrics are provided, so we have to maximize it.
# best = trainer.hyperparameter_search(
#     hp_space=hp_space,
#     compute_objective=objective,
#     n_trials=2
# )