<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/21c2_10kGNAD_huggingface_optuna_config.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Optimization with HuggingFace Transformers

Adapted from https://huggingface.co/docs/transformers/custom_datasets#sequence-classification-with-imdb-reviews

Things we need
* a tokenizer
* tokenized input data
* a pretrained model
* evaluation metrics
* training parameters
* a Trainer instance

Notes
* [class labels can be included in the model config](https://github.com/huggingface/transformers/pull/2945#issuecomment-781986506) (a bit hacky)
* [fp16 is disabled on tesla P100 GPU in pytorch](https://discuss.pytorch.org/t/cnn-fp16-slower-than-fp32-on-tesla-p100/12146)
* [comparison of GPUS (K80, T4, P100, V100)](https://www.kaggle.com/general/198232)
* [GPU benchmark, mixed precision](https://medium.com/the-artificial-impostor/mixed-precision-training-on-tesla-t4-and-p100-d82e5d3b987d)

## Prerequisites

In [None]:
# checkpoint = "distilbert-base-german-cased"
checkpoint = "deepset/gbert-base"
# checkpoint = "deepset/gelectra-base"
# checkpoint = "deepset/gelectra-large"

project_name = f'10kgnad_hf__{checkpoint.replace("/", "_")}'

### Connect Google Drive

Will be used to save results

In [13]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [14]:
from pathlib import Path

# define model path
root_path = Path('/content/gdrive/My Drive/')
base_path = root_path / 'Colab Notebooks/nlp-classification/'
model_path = base_path / 'models'

## Check GPU

In [15]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sun Jan 30 22:50:06 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    35W / 250W |   3899MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install APEX

https://stackoverflow.com/questions/57284345/how-to-install-nvidia-apex-on-google-colab

In [None]:
%%writefile setup.sh

git clone https://github.com/NVIDIA/apex
pip install -q --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex >/dev/null

Overwriting setup.sh


In [None]:
%%time
# !sh setup.sh

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.87 µs


### Install Packages

In [None]:
%%time
!pip install -q -U transformers datasets >/dev/null
!pip install -q -U optuna >/dev/null

# check installed version
!pip freeze | grep optuna        # optuna==2.10.0
!pip freeze | grep transformers  # transformers==4.15.0
!pip freeze | grep "torch "      # torch==1.10.0+cu111

optuna==2.10.0
transformers==4.16.1
torch @ https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl
CPU times: user 134 ms, sys: 448 ms, total: 582 ms
Wall time: 16.4 s


In [None]:
from transformers import logging

# hide progress bar when downloading tokenizer and model (a workaround!)
logging.get_verbosity = lambda : logging.NOTSET

## Load Dataset

In [None]:
from datasets import load_dataset

base_url = "https://raw.githubusercontent.com/tblock/10kGNAD/master/{}.csv"
data_files = {x: base_url.format(x) for x in ["train", "test"]}
dataset = (load_dataset('csv',
                        data_files=data_files,
                        sep=";",
                        quotechar="'",
                        names=["label", "text"]).
           class_encode_column("label"))

label_names = dataset["train"].features["label"].names

Using custom data configuration default-0e1a53e9f937c1cf
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-0e1a53e9f937c1cf/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-0e1a53e9f937c1cf/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-7e2cd654f77312b3.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-0e1a53e9f937c1cf/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-8a8200b7f43f1260.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-0e1a53e9f937c1cf/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-575eed89dcd8fbbd.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-0e1a53e9f937c1cf/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-944af939de8444f9.arrow


In [None]:
print(dataset)
print("labels:", label_names)
dataset["train"][0]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 9245
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 1028
    })
})
labels: ['Etat', 'Inland', 'International', 'Kultur', 'Panorama', 'Sport', 'Web', 'Wirtschaft', 'Wissenschaft']


{'label': 5,
 'text': '21-Jähriger fällt wohl bis Saisonende aus. Wien – Rapid muss wohl bis Saisonende auf Offensivspieler Thomas Murg verzichten. Der im Winter aus Ried gekommene 21-Jährige erlitt beim 0:4-Heimdebakel gegen Admira Wacker Mödling am Samstag einen Teilriss des Innenbandes im linken Knie, wie eine Magnetresonanz-Untersuchung am Donnerstag ergab. Murg erhielt eine Schiene, muss aber nicht operiert werden. Dennoch steht ihm eine mehrwöchige Pause bevor.'}

### Use Dynamic Padding

Apply panding only on longest text in batch - this is more efficient than applying padding on the whole dataset.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Define Evaluation Metrics

The funtion that computes the metrics needs to be passed to the Trainer.

## Hyperparameter Tuning

In [None]:
from transformers import AutoConfig, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import shutil

def hp_space(trial: Trial):
    """A function that defines the hyperparameter search space.
    To be used in :obj:`Trainer.hyperparameter_search`."""
    return {
        # "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-4, log=True),  # distilbert/bert
        "learning_rate": trial.suggest_float("learning_rate", 5e-6, 5e-4, log=True),  # distilbert 1 epoch
        # "learning_rate": trial.suggest_float("learning_rate", 6e-5, 2e-4, log=True),  # electra
        # "num_train_epochs": trial.suggest_categorical("num_train_epochs", [1]),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [3]),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32, 64, 128, 256]),
        "weight_decay": trial.suggest_float("weight_decay", 1e-3, 1e-2, log=True),
        # "weight_decay": trial.suggest_categorical("weight_decay", [1e-3, 0.0]),
    }

best_model_dir = "best_model_trainer"

def best_model_callback(study, trial):
    """Save the model from a best trial"""
    for t in study.best_trials:
        if t.number == trial.number:
            print("This is a new besttrial", trial.number)
        
            out_filename = model_path / f"{project_name}_t{trial.number}"
            shutil.make_archive(out_filename, 'zip', f"{project_name}/{best_model_dir}")

def model_init(trial: Trial):
    """A function that instantiates the model to be used."""

    # We want to include the label names and save them together with the model.
    # The only way to do this is to create a Config and put them in. 
    config = AutoConfig.from_pretrained(
            checkpoint,
            num_labels=len(label_names),
            id2label={i: label for i, label in enumerate(label_names)},
            label2id={label: i for i, label in enumerate(label_names)},
            )

    return AutoModelForSequenceClassification.from_pretrained(checkpoint, config=config)

def objective(trial: Trial):

    # get hyperparameters choice
    hp = hp_space(trial)
    lr = hp["learning_rate"]
    bs = hp["per_device_train_batch_size"]
    epochs = hp["num_train_epochs"]
    weight_decay = hp["weight_decay"]
    # label_smoothing_factor = hp["label_smoothing_factor"]

    # calculate gradient_accumulation_steps
    train_batch_size = 8
    gradient_accumulation_steps = bs // train_batch_size

    eval_rounds_per_epoch = 5
    eval_steps = dataset["train"].num_rows / bs // eval_rounds_per_epoch

    training_args = TrainingArguments(
        output_dir=str(project_name),
        report_to=[],
        log_level="error",
        disable_tqdm=False,

        evaluation_strategy="steps",
        eval_steps=eval_steps,
        logging_steps=eval_steps,
        save_strategy="steps",
        save_steps=eval_steps,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,

        # hyperparameters
        num_train_epochs=epochs,
        learning_rate=lr,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        weight_decay=weight_decay,
        # label_smoothing_factor=label_smoothing_factor,

        # fp16=True,  # fp16 needs apex. but disabled on Tesla P100 by pytorch
    )

    trainer = Trainer(
        model_init=model_init,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        tokenizer=tokenizer,
        # data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[TrialLogAndPruningCallback(trial, objectives=["eval_loss", "eval_f1"], min_trials=700, warmup_steps=eval_steps*3)]
        # callbacks=[TrialPruningCallback(trial)]
    )

    # train model and save best model from evaluations
    # needs 'load_best_model_at_end=True'
    trainer.train()
    trainer.save_model(f"{project_name}/{best_model_dir}")

    result = trainer.evaluate(eval_dataset=tokenized_dataset["test"])

    # store eval metrics in trial
    trial.set_user_attr("eval_result", result)
    
    # return result["eval_loss"]
    return result["eval_loss"], result["eval_f1"]

## Hyperparameter Tuning

In [None]:
import optuna
from optuna.storages import RDBStorage
import random
import numpy as np

db_path = "/content/gdrive/My Drive/Colab Notebooks/nlp-classification/"
db_name = "10kgnad_optuna"
# study_name = checkpoint + "_multi_epoch234"
# study_name = checkpoint + "_loss-f1_bs32_epoch23"
# study_name = checkpoint + "_loss-f1_bs32_ep3_pad"
study_name = checkpoint + "_bs8-256_ep3_len128"

# automatically change the state of a stale trial to TrialState.FAIL from TrialState.RUNNING
storage = RDBStorage(url=f"sqlite:///{db_path}{db_name}.db", heartbeat_interval=60, grace_period=120)

# https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch
import torch
torch.cuda.empty_cache()
import gc
gc.collect()

# multi objective study
# https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/002_multi_objective.html#sphx-glr-tutorial-20-recipes-002-multi-objective-py
study = optuna.create_study(study_name=study_name,
                            directions=["minimize", "maximize"],
                            # storage=f"sqlite:///{db_path}{db_name}.db",
                            storage=storage,
                            load_if_exists=True,)

# ------ prime with parameters
def lr_sample(min, max, dist=0.1, jitter=0.1):
    min_log = np.log10(min)
    max_log = np.log10(max)
    n = int((max_log - min_log) / dist)
    return np.logspace(min_log, max_log, n) * np.random.uniform(1-jitter, 1+jitter, size=n)

def lr_pairs():
    lrs = []
    min_lr = 5e-6
    max_lr = 5e-4
    for bs in [8, 16, 32, 64, 128, 256]:
        lrs.extend((bs, lr) for lr in lr_sample(min_lr, max_lr, dist=0.05, jitter=0.05))
    random.shuffle(lrs)
    return lrs

for bs, lr in lr_pairs():
    study.enqueue_trial(
        {
                "learning_rate": lr,
                "per_device_train_batch_size": bs,
            }
        )

# give some hyperparameters that are presumably good
# for bs in [8,16,32,64,128]:
#     for lr in np.exp(np.linspace(np.log(7e-6), np.log(2e-4), 15)):
#         study.enqueue_trial(
#             {
#                 "learning_rate": lr,
#                 "per_device_train_batch_size": bs,
#             }
#         )
# study.enqueue_trial(
#     {
#         "learning_rate": 5.1e-5,
#         "per_device_train_batch_size": 32,
#     }
# )
# study.enqueue_trial(
#     {
#         "learning_rate": 5.8e-5,
#         "per_device_train_batch_size": 32,
#     }
# )


study.optimize(objective, n_trials=200, callbacks=[best_model_callback])

# study.best_params

[32m[I 2022-01-29 07:46:30,994][0m A new study created in RDB with name: deepset/gbert-base_bs8-256_ep3_len128[0m
  create_trial(state=TrialState.WAITING, system_attrs={"fixed_params": params})
  create_trial(state=TrialState.WAITING, system_attrs={"fixed_params": params})
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 16)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=345.0, min_trials=700
params: {'learning_rate': 9.170262586219904e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.0027599208115882986}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
115,1.5371,0.897237,0.756809,0.71614,0.813256,0.7005,0.721879
230,0.7384,0.534543,0.857004,0.849835,0.853672,0.848348,0.836384
345,0.544,0.459638,0.860895,0.854224,0.852459,0.859615,0.841169
460,0.4738,0.437892,0.870623,0.86438,0.862016,0.872975,0.852892
575,0.4557,0.399327,0.88716,0.883708,0.889948,0.879312,0.870744
690,0.3716,0.380467,0.877432,0.873802,0.877573,0.871174,0.859624
805,0.3037,0.365358,0.892996,0.8867,0.887809,0.88642,0.877468
920,0.3117,0.386337,0.875486,0.86901,0.875974,0.866185,0.857515
1035,0.3194,0.374001,0.892023,0.884317,0.883084,0.888582,0.876567
1150,0.3442,0.368,0.893969,0.886957,0.890175,0.886085,0.878631


[32m[I 2022-01-29 07:56:30,721][0m Trial 0 finished with values: [0.3627486228942871, 0.8901865223926281] and parameters: {'learning_rate': 9.170262586219904e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.0027599208115882986}. [0m


This is a new besttrial 0


fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 0.00025377987451239765, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.0027855104091463493}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,1.0193,0.847159,0.741245,0.725104,0.744659,0.757156,0.714034
114,0.7401,0.583273,0.822957,0.817302,0.830493,0.817661,0.798006
171,0.7023,0.655447,0.810311,0.793703,0.80837,0.809643,0.785935
228,0.6637,0.496088,0.85214,0.851229,0.856681,0.855069,0.832134
285,0.6689,0.554016,0.845331,0.843783,0.857241,0.837049,0.82351
342,0.5148,0.528414,0.848249,0.836174,0.844787,0.833759,0.826584
399,0.4353,0.543301,0.860895,0.84959,0.853161,0.85897,0.841827
456,0.4215,0.515264,0.849222,0.827968,0.841602,0.832897,0.827899
513,0.3521,0.514429,0.874514,0.867709,0.86542,0.873051,0.856551
570,0.4167,0.495555,0.860895,0.841587,0.860862,0.839522,0.84148


[32m[I 2022-01-29 08:06:08,960][0m Trial 1 finished with values: [0.4420566260814667, 0.8897816535322313] and parameters: {'learning_rate': 0.00025377987451239765, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.0027855104091463493}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 8)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=693.0, min_trials=700
params: {'learning_rate': 1.477243933028255e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.00965644490221287}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
231,1.0815,0.5603,0.843385,0.840096,0.838402,0.846232,0.821653
462,0.5134,0.435483,0.871595,0.86754,0.86162,0.878601,0.853917
693,0.4702,0.416472,0.882296,0.878207,0.877135,0.882815,0.865591
924,0.4313,0.407243,0.871595,0.864351,0.86373,0.870106,0.853648
1155,0.4431,0.411471,0.892023,0.888496,0.891028,0.889036,0.87703
1386,0.2608,0.414936,0.892023,0.886505,0.889275,0.885957,0.876599
1617,0.2653,0.429204,0.892023,0.884773,0.883453,0.888157,0.87679
1848,0.282,0.483657,0.883268,0.878967,0.886231,0.87681,0.866471
2079,0.2843,0.42763,0.891051,0.8847,0.883097,0.887044,0.875257
2310,0.3081,0.413849,0.898833,0.896164,0.902419,0.892017,0.884197


[32m[I 2022-01-29 08:16:40,565][0m Trial 2 finished with values: [0.4072425365447998, 0.8643513996282546] and parameters: {'learning_rate': 1.477243933028255e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.00965644490221287}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 8.865861102164478e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.004610893951910017}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,1.79,1.310157,0.628405,0.469333,0.577545,0.499204,0.576793
114,1.0679,0.751542,0.814202,0.793712,0.832896,0.787286,0.787767
171,0.6959,0.55582,0.849222,0.845489,0.855019,0.839391,0.827291
228,0.5629,0.505775,0.854086,0.849252,0.84744,0.858791,0.834354
285,0.5252,0.450872,0.86965,0.862135,0.86714,0.860356,0.850749
342,0.4583,0.414959,0.874514,0.87051,0.871114,0.870769,0.856316
399,0.3718,0.400745,0.876459,0.869203,0.871656,0.869392,0.858533
456,0.3831,0.401695,0.879377,0.872886,0.870227,0.878862,0.862252
513,0.3749,0.381597,0.879377,0.872931,0.870816,0.877462,0.862093
570,0.395,0.383328,0.889105,0.884042,0.888943,0.880802,0.87295


  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 08:25:54,330][0m Trial 3 finished with values: [0.36601579189300537, 0.8774477535603779] and parameters: {'learning_rate': 8.865861102164478e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.004610893951910017}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 9.332422126559672e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.003930827533604793}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,0.9157,0.559776,0.831712,0.824993,0.851876,0.82113,0.809747
114,0.519,0.449155,0.851167,0.824476,0.845263,0.827201,0.830528
171,0.4811,0.425011,0.866732,0.858558,0.852967,0.871659,0.848311
228,0.4397,0.43861,0.859922,0.858908,0.861055,0.867057,0.84126
285,0.4455,0.403228,0.873541,0.870829,0.889054,0.858319,0.855422
342,0.2758,0.367296,0.892023,0.890413,0.887556,0.895201,0.876615
399,0.212,0.387461,0.892996,0.886157,0.890431,0.883284,0.877391
456,0.2272,0.393353,0.884241,0.880822,0.88496,0.880261,0.86775
513,0.2155,0.429472,0.890078,0.88341,0.883167,0.891791,0.87509
570,0.2649,0.351835,0.893969,0.887789,0.885117,0.892162,0.878854


[32m[I 2022-01-29 08:35:06,497][0m Trial 4 finished with values: [0.3518349528312683, 0.8877889552706211] and parameters: {'learning_rate': 9.332422126559672e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.003930827533604793}. [0m


This is a new besttrial 4


fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 128)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=42.0, min_trials=700
params: {'learning_rate': 0.00041050509467014317, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0011671045013154776}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
14,1.4594,0.763391,0.785992,0.765838,0.776774,0.782524,0.757014
28,0.7181,0.604803,0.821984,0.814266,0.821329,0.812472,0.796648
42,0.6788,0.526314,0.848249,0.843916,0.840697,0.854574,0.82749
56,0.5854,0.558474,0.83463,0.83048,0.846141,0.834079,0.814127
70,0.622,0.500237,0.853113,0.847902,0.864267,0.840061,0.831979
84,0.4307,0.540657,0.847276,0.838883,0.838308,0.850132,0.826701
98,0.3473,0.539931,0.849222,0.84208,0.842229,0.847322,0.828547
112,0.3781,0.456767,0.859922,0.852043,0.866657,0.84699,0.840447
126,0.3419,0.473319,0.858949,0.85147,0.859673,0.852179,0.839353
140,0.37,0.464985,0.86965,0.863679,0.861594,0.87051,0.851392


[32m[I 2022-01-29 08:44:22,835][0m Trial 5 finished with values: [0.4373129606246948, 0.8748336346218838] and parameters: {'learning_rate': 0.00041050509467014317, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0011671045013154776}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 1.3980474740304256e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0039606645867351345}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,2.0836,1.925694,0.463035,0.309615,0.413138,0.352162,0.384702
14,1.8512,1.664102,0.56323,0.372737,0.335689,0.432836,0.500177
21,1.617,1.420909,0.634241,0.47987,0.567738,0.508253,0.581262
28,1.3764,1.221086,0.672179,0.541829,0.810594,0.554148,0.625154
35,1.2147,1.041985,0.746109,0.660264,0.829345,0.650016,0.710464
42,1.0785,0.91129,0.787938,0.745801,0.82654,0.736779,0.757354
49,0.9037,0.806528,0.81323,0.786793,0.852364,0.7701,0.786662
56,0.8187,0.733761,0.822957,0.802746,0.842309,0.788039,0.797036
63,0.7497,0.678391,0.831712,0.817796,0.851763,0.804499,0.807333
70,0.7332,0.627279,0.846304,0.833607,0.855023,0.82424,0.823812


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 08:53:11,172][0m Trial 6 finished with values: [0.5493583679199219, 0.8435817589319067] and parameters: {'learning_rate': 1.3980474740304256e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0039606645867351345}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 1.9839175247031734e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.005227926427303831}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,1.7086,1.150119,0.678016,0.547772,0.817171,0.557889,0.634344
56,0.9221,0.624375,0.832685,0.818441,0.844124,0.811205,0.808468
84,0.5919,0.489699,0.854086,0.847539,0.853667,0.844534,0.832897
112,0.504,0.419624,0.874514,0.868419,0.867372,0.871146,0.856504
140,0.4929,0.41545,0.872568,0.866709,0.867942,0.868863,0.854631
168,0.4061,0.378423,0.885214,0.880989,0.878387,0.883987,0.868584
196,0.3197,0.373419,0.886187,0.878579,0.876293,0.883851,0.8699
224,0.3282,0.363974,0.885214,0.880946,0.881016,0.882122,0.868638
252,0.3138,0.362317,0.88716,0.883535,0.889868,0.879382,0.870731
280,0.3387,0.35685,0.892023,0.88517,0.88776,0.885019,0.876428


[32m[I 2022-01-29 09:02:11,473][0m Trial 7 finished with values: [0.34754306077957153, 0.8873589941298792] and parameters: {'learning_rate': 1.9839175247031734e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.005227926427303831}. [0m


This is a new besttrial 7


fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 0.00019445902615149377, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0013622259121800415}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,1.7217,0.930834,0.755837,0.722563,0.777028,0.718739,0.723868
14,0.7575,0.499476,0.839494,0.825407,0.817273,0.843101,0.817734
21,0.5323,0.490282,0.855058,0.846839,0.854288,0.853505,0.835805
28,0.5483,0.444164,0.868677,0.865579,0.883054,0.852849,0.84973
35,0.5213,0.424248,0.871595,0.864654,0.86653,0.868794,0.853469
42,0.3336,0.356539,0.885214,0.880169,0.881189,0.881133,0.868659
49,0.2339,0.341159,0.896887,0.888712,0.890353,0.887332,0.881827
56,0.2667,0.354486,0.889105,0.88422,0.885375,0.885642,0.873361
63,0.2577,0.350029,0.885214,0.880049,0.887442,0.875805,0.86872
70,0.2774,0.334589,0.895914,0.892959,0.891801,0.895231,0.88089


[32m[I 2022-01-29 09:11:27,311][0m Trial 8 finished with values: [0.3345893919467926, 0.8929587984285268] and parameters: {'learning_rate': 0.00019445902615149377, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0013622259121800415}. [0m


This is a new besttrial 8


fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 16)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=345.0, min_trials=700
params: {'learning_rate': 7.825169186318606e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.007546565647762932}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
115,1.6173,1.014582,0.727626,0.664512,0.813271,0.649374,0.688747
230,0.8225,0.581985,0.847276,0.836764,0.84441,0.833571,0.82516
345,0.5796,0.483564,0.856031,0.848642,0.84723,0.853517,0.83561
460,0.4956,0.45351,0.859922,0.855453,0.854114,0.864705,0.840951
575,0.4729,0.410801,0.881323,0.877435,0.87999,0.876191,0.86416
690,0.3965,0.388723,0.874514,0.870081,0.873734,0.867474,0.856264
805,0.3246,0.375944,0.888132,0.881876,0.881395,0.883415,0.871924
920,0.3337,0.389852,0.875486,0.868038,0.876171,0.865039,0.857648
1035,0.3397,0.376305,0.889105,0.881228,0.879708,0.885603,0.87322
1150,0.3613,0.374152,0.890078,0.884905,0.889565,0.882985,0.874181


[32m[I 2022-01-29 09:21:34,076][0m Trial 9 finished with values: [0.3658309578895569, 0.8842479162722792] and parameters: {'learning_rate': 7.825169186318606e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.007546565647762932}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 1.0477519000433616e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.009508215679227439}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,1.9201,1.577964,0.566148,0.375861,0.443247,0.435404,0.503658
56,1.3678,1.046987,0.742218,0.677397,0.821775,0.66151,0.705316
84,0.9341,0.728717,0.830739,0.813792,0.847918,0.800662,0.806066
112,0.6992,0.585476,0.844358,0.832775,0.855934,0.822099,0.82166
140,0.6348,0.521177,0.857004,0.849893,0.867443,0.840754,0.836691
168,0.548,0.4585,0.874514,0.867235,0.874861,0.862608,0.856204
196,0.4468,0.437498,0.876459,0.869606,0.871122,0.870422,0.858539
224,0.4417,0.421277,0.881323,0.876645,0.87951,0.875143,0.864007
252,0.42,0.41031,0.88035,0.875258,0.884876,0.868756,0.86285
280,0.4351,0.403156,0.88035,0.872311,0.876801,0.871096,0.862978


  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 09:30:36,621][0m Trial 10 finished with values: [0.38307270407676697, 0.8758820664217609] and parameters: {'learning_rate': 1.0477519000433616e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.009508215679227439}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 128)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=42.0, min_trials=700
params: {'learning_rate': 0.0002494710113539506, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0015729870867858232}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
14,1.2004,0.651215,0.80642,0.794221,0.819898,0.795029,0.781097
28,0.5513,0.501216,0.84144,0.842194,0.85162,0.839923,0.819992
42,0.5085,0.454084,0.868677,0.867301,0.879274,0.860367,0.850064
56,0.4599,0.508519,0.839494,0.832903,0.849342,0.833049,0.818108
70,0.522,0.484665,0.861868,0.854714,0.853139,0.866511,0.843174
84,0.3426,0.396572,0.878405,0.869356,0.867819,0.875239,0.861334
98,0.2136,0.37086,0.889105,0.885171,0.887076,0.886531,0.873452
112,0.2582,0.463097,0.864786,0.862746,0.870137,0.870446,0.847314
126,0.2601,0.362699,0.876459,0.870016,0.873334,0.87002,0.858996
140,0.2989,0.349826,0.890078,0.884106,0.883875,0.884838,0.874132


[32m[I 2022-01-29 09:39:31,443][0m Trial 11 finished with values: [0.34982600808143616, 0.8841063654586251] and parameters: {'learning_rate': 0.0002494710113539506, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0015729870867858232}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 8)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=693.0, min_trials=700
params: {'learning_rate': 0.00014655188624811084, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.0010617745061379659}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
231,1.0553,0.923973,0.731518,0.729669,0.749733,0.744515,0.701663
462,0.9977,0.913319,0.731518,0.676095,0.691602,0.676751,0.692937
693,0.877,0.81525,0.784047,0.729187,0.776449,0.753816,0.755463
924,0.7936,0.680532,0.837549,0.828279,0.822098,0.844417,0.815997
1155,0.7274,0.698633,0.818093,0.820687,0.835581,0.820754,0.794759
1386,0.5644,0.671614,0.844358,0.839103,0.840765,0.848327,0.823777
1617,0.5317,0.794205,0.84144,0.83411,0.862845,0.821085,0.819807
1848,0.5417,0.679851,0.860895,0.85323,0.859297,0.851285,0.840842
2079,0.545,0.622204,0.847276,0.835203,0.858092,0.823682,0.825687
2310,0.5192,0.532936,0.878405,0.874476,0.883216,0.86836,0.8608


[32m[I 2022-01-29 09:50:00,546][0m Trial 12 finished with values: [0.5329359769821167, 0.8744756071184393] and parameters: {'learning_rate': 0.00014655188624811084, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.0010617745061379659}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 16)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=345.0, min_trials=700
params: {'learning_rate': 0.00043802337191544045, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.0012533394094352084}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
115,2.2245,2.148651,0.163424,0.031215,0.018158,0.111111,0.0
230,2.194,2.202052,0.146887,0.028461,0.016321,0.111111,0.0
345,2.185,2.177348,0.13716,0.026804,0.01524,0.111111,0.0
460,2.1605,2.158997,0.163424,0.031215,0.018158,0.111111,0.0
575,2.1558,2.156488,0.13716,0.026804,0.01524,0.111111,0.0
690,2.1773,2.150413,0.163424,0.031215,0.018158,0.111111,0.0
805,2.1653,2.179559,0.163424,0.031215,0.018158,0.111111,0.0
920,2.1633,2.153965,0.163424,0.031215,0.018158,0.111111,0.0
1035,2.1442,2.128993,0.163424,0.031215,0.018158,0.111111,0.0
1150,2.1473,2.135439,0.163424,0.031215,0.018158,0.111111,0.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))
[32m[I 2022-01-29 09:59:44,814][0m Trial 13 finished with values: [2.120208501815796, 0.031215161649944256] and parameters: {'learning_rate': 0.00043802337191544045, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.0012533394094352084}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 0.0003233565932492428, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.003471968825336658}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,1.1016,0.836243,0.7607,0.747116,0.775057,0.781374,0.735428
114,0.8604,0.783594,0.76751,0.723125,0.762475,0.739758,0.737682
171,0.8336,0.894776,0.759728,0.740624,0.769781,0.750264,0.729723
228,0.86,0.833949,0.790856,0.788645,0.819206,0.792657,0.767097
285,0.8464,0.736023,0.80642,0.801909,0.821856,0.801148,0.782143
342,0.6859,0.885707,0.783074,0.777901,0.790188,0.795911,0.757697
399,0.6172,0.816222,0.818093,0.806368,0.822414,0.811661,0.794055
456,0.6103,0.67203,0.820039,0.815581,0.824549,0.8147,0.794786
513,0.5481,0.660491,0.848249,0.840172,0.847402,0.841948,0.82661
570,0.5548,0.585458,0.842412,0.839169,0.852542,0.830851,0.819988


[32m[I 2022-01-29 10:09:00,460][0m Trial 14 finished with values: [0.4827063977718353, 0.8761334664911256] and parameters: {'learning_rate': 0.0003233565932492428, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.003471968825336658}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 5.5493780650768304e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.002569654858971682}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,1.2306,0.583036,0.824903,0.811167,0.835182,0.800775,0.800409
56,0.541,0.452381,0.85214,0.850881,0.85724,0.854669,0.832474
84,0.46,0.393792,0.878405,0.873061,0.878601,0.871568,0.861008
112,0.4224,0.381267,0.88035,0.876925,0.878811,0.880306,0.86376
140,0.4515,0.375862,0.882296,0.877641,0.875747,0.882337,0.865494
168,0.2768,0.357285,0.877432,0.87429,0.875464,0.874978,0.859831
196,0.2172,0.354184,0.894942,0.888815,0.884587,0.893886,0.879787
224,0.238,0.35671,0.893969,0.888736,0.885248,0.894593,0.878844
252,0.2286,0.331659,0.896887,0.892923,0.897899,0.889343,0.881911
280,0.2531,0.335413,0.904669,0.898692,0.900288,0.897812,0.890782


[32m[I 2022-01-29 10:18:03,141][0m Trial 15 finished with values: [0.32484832406044006, 0.9008951799368575] and parameters: {'learning_rate': 5.5493780650768304e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.002569654858971682}. [0m


This is a new besttrial 15


fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 8)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=693.0, min_trials=700
params: {'learning_rate': 5.753701304398057e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.003686267331294336}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
231,0.8355,0.569053,0.826848,0.819821,0.817811,0.833299,0.803613
462,0.57,0.490939,0.857004,0.843325,0.838746,0.853056,0.837129
693,0.5724,0.497278,0.860895,0.859116,0.85576,0.864209,0.840878
924,0.504,0.467499,0.871595,0.866891,0.867236,0.872363,0.853567
1155,0.5453,0.473322,0.871595,0.870758,0.872994,0.872289,0.853677
1386,0.3058,0.561848,0.882296,0.878322,0.884681,0.877951,0.866081
1617,0.2599,0.506452,0.896887,0.89115,0.891135,0.891834,0.881911
1848,0.3152,0.504459,0.890078,0.883756,0.883677,0.887632,0.874455
2079,0.299,0.546737,0.882296,0.879635,0.880337,0.880929,0.865537
2310,0.3071,0.49197,0.892996,0.891925,0.89299,0.892873,0.877763


[32m[I 2022-01-29 10:29:06,903][0m Trial 16 finished with values: [0.4674991965293884, 0.8668905491969076] and parameters: {'learning_rate': 5.753701304398057e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.003686267331294336}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 6.847655329697433e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.002097878125077771}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,2.0149,1.790944,0.525292,0.348995,0.322584,0.40405,0.457951
56,1.6425,1.38037,0.607977,0.452251,0.559578,0.484974,0.551437
84,1.2714,1.046559,0.743191,0.685175,0.820142,0.670657,0.70629
112,0.9772,0.825887,0.804475,0.773185,0.848543,0.754936,0.776448
140,0.8419,0.699804,0.827821,0.811426,0.851929,0.797798,0.803333
168,0.7334,0.60709,0.843385,0.829907,0.849021,0.821162,0.820506
196,0.6039,0.556377,0.851167,0.843846,0.856123,0.836845,0.829461
224,0.5745,0.52197,0.856031,0.849924,0.860033,0.843679,0.834923
252,0.5446,0.501139,0.861868,0.856722,0.873529,0.845981,0.841702
280,0.5467,0.479985,0.86965,0.860749,0.869679,0.856032,0.850671


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 10:38:08,394][0m Trial 17 finished with values: [0.44475504755973816, 0.8679923613492703] and parameters: {'learning_rate': 6.847655329697433e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.002097878125077771}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 128)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=42.0, min_trials=700
params: {'learning_rate': 0.0001657259967540948, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.002449348556800101}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
14,1.2318,0.688053,0.772374,0.744273,0.813902,0.727693,0.743985
28,0.5635,0.473883,0.847276,0.844023,0.849226,0.847569,0.82693
42,0.5054,0.462308,0.857004,0.850328,0.848818,0.862751,0.837808
56,0.4439,0.42755,0.865759,0.859,0.859021,0.867777,0.847954
70,0.4994,0.397725,0.881323,0.87499,0.874083,0.880102,0.864609
84,0.2928,0.351229,0.879377,0.872462,0.871951,0.875836,0.862069
98,0.1909,0.382538,0.883268,0.877818,0.875766,0.881348,0.866426
112,0.2601,0.366226,0.891051,0.884148,0.887001,0.883405,0.875389
126,0.2279,0.350769,0.901751,0.895837,0.896713,0.898863,0.888014
140,0.2444,0.346802,0.895914,0.891504,0.890571,0.894765,0.881034


[32m[I 2022-01-29 10:47:03,432][0m Trial 18 finished with values: [0.3468015193939209, 0.8915038847375549] and parameters: {'learning_rate': 0.0001657259967540948, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.002449348556800101}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 9.749898871510996e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.001550196862429288}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,2.1176,2.003444,0.396887,0.265767,0.300071,0.300116,0.313118
14,1.9492,1.825149,0.507782,0.33744,0.318265,0.388211,0.436204
21,1.7977,1.638196,0.567121,0.382722,0.423992,0.437096,0.503051
28,1.6014,1.471394,0.613813,0.445226,0.559541,0.483255,0.557461
35,1.4655,1.316723,0.655642,0.507265,0.58467,0.529842,0.60636
42,1.3547,1.187138,0.68677,0.56236,0.682349,0.573182,0.641939
49,1.1793,1.07729,0.737354,0.650754,0.824202,0.640032,0.699845
56,1.0965,0.988802,0.769455,0.71098,0.831139,0.69324,0.736339
63,1.0046,0.921725,0.788911,0.746648,0.838257,0.729274,0.758444
70,0.9797,0.86153,0.808366,0.776703,0.846542,0.759548,0.780623


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 10:55:53,857][0m Trial 19 finished with values: [0.7464913129806519, 0.8066600481539498] and parameters: {'learning_rate': 9.749898871510996e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.001550196862429288}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 0.0003037332858230308, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.0027847250456737114}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,1.2473,0.852071,0.703307,0.696037,0.819718,0.689803,0.677776
56,0.6887,0.704442,0.787938,0.782815,0.796108,0.791335,0.760095
84,0.7384,0.625297,0.835603,0.83117,0.832211,0.837319,0.812852
112,0.6162,0.522565,0.843385,0.835585,0.836123,0.845419,0.822268
140,0.6052,0.582338,0.84144,0.835354,0.860964,0.818817,0.818234
168,0.4759,0.524533,0.856031,0.841953,0.848916,0.842331,0.8358
196,0.371,0.514829,0.859922,0.855911,0.864592,0.852268,0.839773
224,0.3653,0.509692,0.843385,0.84171,0.854512,0.843238,0.822418
252,0.3318,0.498224,0.860895,0.853491,0.841831,0.872643,0.842056
280,0.3755,0.427407,0.879377,0.875154,0.874369,0.879854,0.862483


[32m[I 2022-01-29 11:04:54,847][0m Trial 20 finished with values: [0.4274071753025055, 0.8751540410253789] and parameters: {'learning_rate': 0.0003037332858230308, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.0027847250456737114}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 128)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=42.0, min_trials=700
params: {'learning_rate': 3.3705205014236195e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0015353540566317835}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
14,1.7467,1.201802,0.657588,0.530257,0.807584,0.541821,0.61114
28,0.9808,0.667858,0.824903,0.806958,0.840651,0.798366,0.799782
42,0.6228,0.535765,0.837549,0.833475,0.841987,0.831569,0.814763
56,0.5292,0.44399,0.864786,0.859316,0.872104,0.850203,0.845209
70,0.5003,0.41356,0.873541,0.865987,0.871901,0.865044,0.855451
84,0.4033,0.380163,0.885214,0.878215,0.877934,0.8803,0.868639
98,0.3106,0.365028,0.884241,0.87783,0.879936,0.877592,0.867439
112,0.3148,0.357572,0.888132,0.882476,0.883761,0.881888,0.871903
126,0.3004,0.359294,0.888132,0.880857,0.888564,0.875423,0.871751
140,0.3381,0.35581,0.891051,0.882733,0.881536,0.886129,0.875393


[32m[I 2022-01-29 11:13:48,746][0m Trial 21 finished with values: [0.34157106280326843, 0.891803054735122] and parameters: {'learning_rate': 3.3705205014236195e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0015353540566317835}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 16)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=345.0, min_trials=700
params: {'learning_rate': 0.00010385370797127183, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.004271677459604317}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
115,0.8364,0.604095,0.819066,0.805845,0.808588,0.821879,0.795966
230,0.6077,0.512416,0.856031,0.843052,0.843764,0.850921,0.836259
345,0.56,0.53728,0.829767,0.829835,0.832786,0.837538,0.806756
460,0.4957,0.492236,0.856031,0.851231,0.853224,0.861561,0.836815
575,0.5113,0.427359,0.875486,0.877702,0.876196,0.883279,0.858439
690,0.2826,0.520351,0.866732,0.861145,0.872301,0.858059,0.847997
805,0.2319,0.500757,0.875486,0.871655,0.870412,0.876533,0.858236
920,0.3096,0.480842,0.878405,0.871936,0.87447,0.874482,0.861072
1035,0.266,0.434851,0.898833,0.892524,0.887829,0.898519,0.884313
1150,0.286,0.424778,0.892996,0.886364,0.890188,0.884881,0.877485


[32m[I 2022-01-29 11:23:26,457][0m Trial 22 finished with values: [0.4247778058052063, 0.8863639658799652] and parameters: {'learning_rate': 0.00010385370797127183, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.004271677459604317}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 6.2166624718692126e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.0015762901497597605}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,2.0312,1.827934,0.500973,0.334773,0.426345,0.38509,0.430727
56,1.6916,1.447881,0.592412,0.422876,0.551617,0.465537,0.533491
84,1.3445,1.122832,0.712062,0.62366,0.796724,0.617544,0.670647
112,1.0516,0.896437,0.782101,0.733597,0.839707,0.71638,0.751282
140,0.903,0.752795,0.817121,0.793707,0.85177,0.779115,0.791088
168,0.7882,0.655556,0.835603,0.821275,0.848889,0.810631,0.811562
196,0.6526,0.597928,0.848249,0.840209,0.85463,0.831831,0.826054
224,0.6171,0.557745,0.854086,0.84298,0.855996,0.83625,0.83277
252,0.5793,0.531529,0.858949,0.852363,0.869279,0.84153,0.838329
280,0.5822,0.508288,0.859922,0.849811,0.861276,0.842714,0.839422


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 11:32:25,158][0m Trial 23 finished with values: [0.4691549837589264, 0.8598378327769765] and parameters: {'learning_rate': 6.2166624718692126e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.0015762901497597605}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 0.00015011643305430197, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.0013646844044177492}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,1.0697,0.525594,0.849222,0.842067,0.858859,0.836753,0.8288
56,0.513,0.486671,0.846304,0.841423,0.845198,0.845176,0.825423
84,0.5088,0.436635,0.865759,0.866791,0.872879,0.864694,0.846733
112,0.4502,0.414641,0.861868,0.859485,0.858872,0.870361,0.843731
140,0.4679,0.433122,0.870623,0.867227,0.871362,0.871153,0.852649
168,0.3252,0.411908,0.883268,0.879688,0.876017,0.885959,0.866791
196,0.2154,0.379826,0.889105,0.882778,0.88265,0.885114,0.873385
224,0.2495,0.374854,0.888132,0.884496,0.881285,0.889872,0.872117
252,0.2341,0.359042,0.892023,0.884298,0.88278,0.888072,0.876681
280,0.2694,0.359644,0.892996,0.887221,0.889691,0.885972,0.877586


[32m[I 2022-01-29 11:41:24,981][0m Trial 24 finished with values: [0.35811516642570496, 0.8953553844888646] and parameters: {'learning_rate': 0.00015011643305430197, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.0013646844044177492}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 6.444068140135576e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0074867390620878506}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,1.8211,1.252208,0.655642,0.511273,0.684884,0.533402,0.60622
14,1.0405,0.696371,0.821012,0.794613,0.839337,0.787431,0.795321
21,0.6408,0.526478,0.85214,0.84164,0.852763,0.835978,0.830802
28,0.5375,0.476835,0.856031,0.853069,0.874534,0.840481,0.836519
35,0.4989,0.41358,0.866732,0.860887,0.861507,0.863398,0.847931
42,0.3954,0.388925,0.879377,0.870009,0.871572,0.871503,0.861997
49,0.2925,0.373065,0.877432,0.869896,0.874264,0.869282,0.859762
56,0.2969,0.367319,0.88716,0.879619,0.88171,0.880658,0.871085
63,0.2926,0.377272,0.884241,0.878499,0.888928,0.872424,0.867481
70,0.3289,0.365929,0.878405,0.869199,0.866794,0.875562,0.861225


  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 11:50:16,095][0m Trial 25 finished with values: [0.3490840792655945, 0.8862868603824068] and parameters: {'learning_rate': 6.444068140135576e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0074867390620878506}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 7.867028121974179e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.005149244601973092}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,1.8343,1.39781,0.617704,0.449955,0.571764,0.486179,0.564623
114,1.1578,0.828618,0.812257,0.782284,0.83508,0.777098,0.785433
171,0.7561,0.597633,0.84144,0.834304,0.849573,0.824555,0.818286
228,0.5991,0.527467,0.848249,0.842275,0.844541,0.848151,0.827574
285,0.5513,0.468684,0.865759,0.858712,0.866456,0.854867,0.846207
342,0.4843,0.429419,0.870623,0.865919,0.867161,0.865603,0.851853
399,0.3977,0.413269,0.876459,0.870159,0.874364,0.868748,0.858489
456,0.4052,0.411213,0.877432,0.870695,0.869117,0.875537,0.85999
513,0.3961,0.389139,0.88035,0.874339,0.873234,0.87748,0.863109
570,0.4138,0.392533,0.882296,0.876599,0.882701,0.872857,0.865154


  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 11:59:27,781][0m Trial 26 finished with values: [0.3727229833602905, 0.8770659104005859] and parameters: {'learning_rate': 7.867028121974179e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.005149244601973092}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 0.00014176483733247038, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0022986715140560997}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,1.7017,0.937521,0.7607,0.719947,0.809048,0.710992,0.729417
14,0.7793,0.490461,0.85214,0.843478,0.8423,0.848619,0.83144
21,0.516,0.447428,0.863813,0.856459,0.85403,0.86555,0.844893
28,0.5007,0.431742,0.868677,0.867876,0.884582,0.856491,0.850061
35,0.4942,0.394611,0.867704,0.857681,0.862205,0.85823,0.848844
42,0.3095,0.348409,0.884241,0.875002,0.873104,0.878071,0.867609
49,0.2422,0.363521,0.890078,0.883573,0.890389,0.878665,0.873996
56,0.267,0.354465,0.895914,0.889303,0.886298,0.893076,0.880944
63,0.252,0.372987,0.883268,0.878491,0.888592,0.871949,0.866588
70,0.2763,0.337083,0.900778,0.895219,0.893715,0.898985,0.886684


[32m[I 2022-01-29 12:08:17,730][0m Trial 27 finished with values: [0.33708322048187256, 0.8952187875367795] and parameters: {'learning_rate': 0.00014176483733247038, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0022986715140560997}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 6.117647523466384e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0038493715778471}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,2.1498,2.067539,0.308366,0.186884,0.201609,0.237166,0.213683
14,2.0329,1.952708,0.44358,0.294855,0.317538,0.33645,0.360531
21,1.9449,1.845073,0.501946,0.335458,0.375158,0.38118,0.427571
28,1.8198,1.736612,0.548638,0.369788,0.370511,0.421984,0.481352
35,1.7365,1.631032,0.580739,0.405634,0.448358,0.44922,0.519779
42,1.6786,1.534877,0.607004,0.43913,0.4512,0.477212,0.549598
49,1.532,1.449728,0.617704,0.455655,0.562421,0.490685,0.562416
56,1.4666,1.373665,0.631323,0.472925,0.572223,0.50363,0.577976
63,1.3767,1.309977,0.651751,0.500484,0.58501,0.524468,0.602141
70,1.3621,1.252975,0.661479,0.516158,0.6927,0.537071,0.612733


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 12:17:04,429][0m Trial 28 finished with values: [1.1208508014678955, 0.6196000125987274] and parameters: {'learning_rate': 6.117647523466384e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.0038493715778471}. [0m
  "for distribution {}.".format(name, param_value, distribution)
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 0.00022317303056912185, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.0018214061884032904}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,0.9359,0.596372,0.809339,0.804676,0.823553,0.795881,0.781915
114,0.6845,0.578073,0.822957,0.825969,0.837731,0.822056,0.799052
171,0.6404,0.536541,0.837549,0.825745,0.815385,0.846772,0.815665
228,0.5705,0.527781,0.835603,0.831298,0.842077,0.83753,0.813943
285,0.5808,0.571544,0.825875,0.822172,0.844176,0.81754,0.802476
342,0.3869,0.510365,0.857004,0.847202,0.846914,0.851917,0.836735
399,0.2945,0.550787,0.85214,0.842728,0.844907,0.853106,0.831901
456,0.3543,0.430998,0.875486,0.86841,0.866294,0.875689,0.857803
513,0.3158,0.566049,0.86284,0.854198,0.866797,0.854604,0.843663
570,0.3535,0.437891,0.890078,0.888138,0.888901,0.888574,0.874197


[32m[I 2022-01-29 12:26:15,234][0m Trial 29 finished with values: [0.43099766969680786, 0.8684098410427146] and parameters: {'learning_rate': 0.00022317303056912185, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.0018214061884032904}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 128)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=42.0, min_trials=700
params: {'learning_rate': 6.416450556475228e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0014725254027968936}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
14,2.1013,1.975201,0.431907,0.287238,0.294292,0.330657,0.351324
28,1.908,1.759971,0.541829,0.361387,0.437513,0.416746,0.474937
42,1.7136,1.54174,0.581712,0.411392,0.551734,0.453541,0.520753
56,1.49,1.346813,0.633268,0.469847,0.573097,0.502092,0.580546
70,1.3311,1.165804,0.694553,0.568865,0.708572,0.577523,0.651397
84,1.1946,1.02672,0.755837,0.69308,0.815251,0.683971,0.720525
98,1.0147,0.915913,0.788911,0.748272,0.836523,0.728784,0.758439
112,0.935,0.834182,0.812257,0.783934,0.851147,0.765852,0.784946
126,0.8565,0.784169,0.817121,0.796886,0.849553,0.779479,0.790873
140,0.839,0.728142,0.83463,0.817323,0.850256,0.804625,0.810477


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 12:35:04,515][0m Trial 30 finished with values: [0.6444544792175293, 0.8331249713307459] and parameters: {'learning_rate': 6.416450556475228e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 128, 'weight_decay': 0.0014725254027968936}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 16)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=345.0, min_trials=700
params: {'learning_rate': 5.8324041160853215e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.006827576041735309}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
115,0.8431,0.53275,0.836576,0.836765,0.836609,0.850768,0.816098
230,0.5299,0.428289,0.865759,0.860436,0.859458,0.86435,0.846503
345,0.4994,0.527094,0.838521,0.825908,0.825672,0.84699,0.817771
460,0.4473,0.456128,0.866732,0.857312,0.860973,0.865808,0.848573
575,0.4519,0.389225,0.883268,0.881137,0.886679,0.878379,0.866675
690,0.2587,0.406758,0.892996,0.885176,0.892614,0.881055,0.877986
805,0.2096,0.442535,0.890078,0.885282,0.884096,0.891034,0.875059
920,0.2688,0.445921,0.889105,0.887877,0.894502,0.884357,0.873223
1035,0.2503,0.409281,0.892023,0.88775,0.883103,0.894199,0.876551
1150,0.2501,0.381521,0.893969,0.892803,0.89566,0.890495,0.878492


[32m[I 2022-01-29 12:44:40,344][0m Trial 31 finished with values: [0.3815212547779083, 0.8928026814276008] and parameters: {'learning_rate': 5.8324041160853215e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.006827576041735309}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 16)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=345.0, min_trials=700
params: {'learning_rate': 0.0002539606068280479, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.008215865853457857}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
115,1.1612,1.03352,0.698444,0.666042,0.742248,0.685758,0.664033
230,1.01,0.829171,0.785992,0.767893,0.791669,0.753971,0.754555
345,1.1091,1.483737,0.586576,0.604665,0.698901,0.614592,0.54475
460,1.0604,1.216273,0.64786,0.642867,0.738755,0.668977,0.616861
575,1.0377,1.195467,0.654669,0.665643,0.68713,0.695546,0.639437
690,0.8012,1.152508,0.663424,0.602792,0.61324,0.645257,0.628946
805,0.7163,1.001908,0.722763,0.715651,0.725675,0.747907,0.691943
920,0.7838,1.133106,0.700389,0.663119,0.666141,0.6912,0.667194
1035,0.6898,1.006042,0.716926,0.715276,0.748028,0.728877,0.686748
1150,0.7561,0.937207,0.756809,0.76242,0.776849,0.760182,0.723093


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-29 12:54:14,063][0m Trial 32 finished with values: [0.6939172744750977, 0.8474266079747198] and parameters: {'learning_rate': 0.0002539606068280479, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.008215865853457857}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 0.000246009921598085, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.004778813284793992}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,1.7539,0.897408,0.779183,0.757483,0.806688,0.748621,0.75051
14,0.7453,0.519193,0.837549,0.829429,0.831084,0.835973,0.81471
21,0.5716,0.486066,0.854086,0.846543,0.848755,0.854843,0.834658
28,0.5383,0.423711,0.860895,0.856066,0.859162,0.856724,0.841031
35,0.547,0.417605,0.866732,0.861636,0.86949,0.861906,0.848131
42,0.339,0.358419,0.875486,0.870781,0.874704,0.870559,0.857638
49,0.2243,0.38385,0.888132,0.885755,0.896175,0.877752,0.87189
56,0.2855,0.375343,0.882296,0.877531,0.880071,0.876216,0.865257
63,0.2503,0.341824,0.891051,0.88776,0.894365,0.883175,0.875209
70,0.2982,0.347304,0.890078,0.886133,0.885498,0.888612,0.874282


[32m[I 2022-01-29 13:03:03,206][0m Trial 33 finished with values: [0.34182417392730713, 0.8877602355183684] and parameters: {'learning_rate': 0.000246009921598085, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.004778813284793992}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 8)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=693.0, min_trials=700
params: {'learning_rate': 2.1308732758030146e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.0015799166052704611}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
231,0.957,0.54846,0.837549,0.836494,0.836745,0.846769,0.816579
462,0.4953,0.408218,0.875486,0.872759,0.86434,0.886002,0.858249
693,0.4812,0.42994,0.875486,0.871243,0.872618,0.875032,0.857912
924,0.4376,0.41683,0.878405,0.871078,0.86856,0.880325,0.861412
1155,0.4632,0.435911,0.878405,0.876028,0.876549,0.879799,0.861783
1386,0.2491,0.4808,0.881323,0.8769,0.878786,0.880005,0.864881
1617,0.2553,0.43071,0.900778,0.896068,0.894391,0.898755,0.886583
1848,0.2729,0.5036,0.885214,0.880436,0.889968,0.876624,0.868647
2079,0.2793,0.449622,0.894942,0.889486,0.889967,0.89032,0.87971
2310,0.2936,0.442454,0.888132,0.884806,0.88965,0.882228,0.872097


[32m[I 2022-01-29 13:13:27,600][0m Trial 34 finished with values: [0.40821805596351624, 0.8727594906409757] and parameters: {'learning_rate': 2.1308732758030146e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.0015799166052704611}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 0.0004838555339703523, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.001767055979370236}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,2.2881,2.141694,0.143969,0.035486,0.049405,0.115741,0.031844
14,1.9317,1.76086,0.361868,0.194583,0.155798,0.294484,0.303916
21,1.9937,2.172541,0.242218,0.084884,0.062729,0.164683,0.137884
28,1.9821,2.296722,0.243191,0.083605,0.060925,0.165344,0.139344
35,2.1293,1.900215,0.25,0.08893,0.079438,0.185171,0.148385
42,1.9605,1.852681,0.302529,0.1326,0.099608,0.211942,0.19421
49,1.9016,2.064027,0.254864,0.100521,0.094733,0.184728,0.188748
56,2.2883,2.190271,0.219844,0.086679,0.104649,0.149471,0.16224
63,2.0692,2.046475,0.234436,0.095847,0.103209,0.159392,0.181391
70,2.0341,1.980591,0.255837,0.10414,0.093774,0.173942,0.197068


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))
[32m[I 2022-01-29 13:22:18,360][0m Trial 35 finished with values: [1.7608602046966553, 0.19458316444269771] and parameters: {'learning_rate': 0.0004838555339703523, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.001767055979370236}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 5.055557515444166e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.007704930286495619}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,1.0387,0.504185,0.856031,0.839092,0.848376,0.842076,0.83619
114,0.4964,0.418576,0.860895,0.85033,0.848861,0.855644,0.84124
171,0.4436,0.413712,0.867704,0.863406,0.858815,0.87093,0.849061
228,0.419,0.403482,0.870623,0.865785,0.863475,0.875567,0.853309
285,0.4273,0.394584,0.877432,0.874872,0.883306,0.868619,0.859601
342,0.2409,0.350949,0.886187,0.883019,0.883535,0.883128,0.86969
399,0.2001,0.361852,0.894942,0.891806,0.894693,0.890752,0.879844
456,0.2283,0.364814,0.898833,0.89509,0.897584,0.895133,0.88422
513,0.2128,0.379509,0.892023,0.885443,0.883578,0.892761,0.876994
570,0.2581,0.336947,0.900778,0.894476,0.896648,0.894032,0.886387


[32m[I 2022-01-29 13:31:29,007][0m Trial 36 finished with values: [0.336946964263916, 0.8944764602762665] and parameters: {'learning_rate': 5.055557515444166e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.007704930286495619}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 8)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=693.0, min_trials=700
params: {'learning_rate': 0.0001213956551304579, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.004980760292979186}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
231,1.0058,0.854318,0.734436,0.726425,0.771102,0.735279,0.702844
462,0.7694,0.624968,0.829767,0.824237,0.84781,0.813396,0.805942
693,0.7422,0.669259,0.82393,0.816848,0.830567,0.829094,0.801379
924,0.696,0.619986,0.821984,0.81383,0.813301,0.827854,0.798229
1155,0.7139,0.590808,0.839494,0.834079,0.843496,0.832529,0.817522
1386,0.459,0.799877,0.838521,0.839789,0.85238,0.841509,0.817586
1617,0.4495,0.606432,0.863813,0.860312,0.86399,0.86922,0.846603
1848,0.4761,0.6121,0.864786,0.864214,0.877521,0.856419,0.84545
2079,0.4183,0.599132,0.875486,0.870009,0.879852,0.863937,0.857974
2310,0.4563,0.590836,0.867704,0.860504,0.877566,0.851618,0.849452


[32m[I 2022-01-29 13:41:51,814][0m Trial 37 finished with values: [0.552101194858551, 0.8855075042466827] and parameters: {'learning_rate': 0.0001213956551304579, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.004980760292979186}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 256)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'learning_rate': 5.392686865617486e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.004339891033507396}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,2.1567,2.082201,0.287938,0.172366,0.190799,0.220505,0.19049
14,2.05,1.978758,0.422179,0.278948,0.312731,0.317954,0.336108
21,1.9711,1.882225,0.475681,0.317459,0.337079,0.359884,0.395979
28,1.857,1.78452,0.530156,0.357851,0.366209,0.405409,0.459887
35,1.783,1.689674,0.566148,0.388216,0.424876,0.436117,0.502313
42,1.7375,1.602826,0.582685,0.416298,0.441156,0.455914,0.52085
49,1.6014,1.525726,0.60214,0.435701,0.447825,0.474848,0.544347
56,1.5418,1.456419,0.615759,0.451065,0.56466,0.48733,0.560086
63,1.4599,1.398,0.632296,0.471328,0.578336,0.502064,0.579553
70,1.4501,1.346997,0.642996,0.486573,0.577872,0.51457,0.591648


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


  _warn_prf(average, modifier, msg_start, len(result))
[32m[I 2022-01-29 13:50:38,467][0m Trial 38 finished with values: [1.2240041494369507, 0.54046694494783] and parameters: {'learning_rate': 5.392686865617486e-06, 'num_train_epochs': 3, 'per_device_train_batch_size': 256, 'weight_decay': 0.004339891033507396}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 32)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=171.0, min_trials=700
params: {'learning_rate': 0.00014702055839263257, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.001630971369462351}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
57,0.898,0.660651,0.798638,0.785095,0.793563,0.806144,0.773463
114,0.5704,0.525529,0.828794,0.822005,0.837418,0.82253,0.804911
171,0.565,0.471834,0.856031,0.846651,0.84757,0.857199,0.83655
228,0.4766,0.456308,0.859922,0.854037,0.85616,0.863981,0.840907
285,0.5489,0.414406,0.879377,0.875806,0.881667,0.871774,0.861747
342,0.3246,0.464491,0.866732,0.853801,0.85949,0.856289,0.847959
399,0.254,0.446613,0.872568,0.86699,0.869521,0.870346,0.854727
456,0.297,0.423336,0.88035,0.87201,0.876185,0.872445,0.863396
513,0.2492,0.409162,0.890078,0.883529,0.882581,0.890833,0.875127
570,0.3012,0.376262,0.891051,0.887757,0.892161,0.884052,0.875117


[32m[I 2022-01-29 13:59:48,269][0m Trial 39 finished with values: [0.37626218795776367, 0.8877572152080605] and parameters: {'learning_rate': 0.00014702055839263257, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.001630971369462351}. [0m
fixed params: [('num_train_epochs', 3), ('per_device_train_batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'learning_rate': 4.131270983914418e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 64, 'weight_decay': 0.0012996858167884208}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,1.3629,0.67587,0.818093,0.799567,0.836014,0.785481,0.792426
56,0.5936,0.461886,0.860895,0.855798,0.859355,0.860816,0.841852
84,0.4775,0.419125,0.866732,0.859106,0.871995,0.856416,0.848141
112,0.4312,0.379825,0.885214,0.877976,0.878065,0.882681,0.869133
140,0.4471,0.372712,0.884241,0.878633,0.875935,0.883313,0.867716
168,0.2978,0.352463,0.882296,0.878974,0.882626,0.876395,0.865196
196,0.2346,0.34791,0.886187,0.879943,0.875481,0.885816,0.86986
224,0.2566,0.341687,0.890078,0.884916,0.88221,0.889309,0.874318
252,0.2506,0.333345,0.893969,0.889003,0.892741,0.886495,0.878534
280,0.2663,0.329047,0.898833,0.893913,0.896075,0.892224,0.884057


In [None]:
!ls -lahtr $project_name

---
## New Optimization Code

In [1]:
!pip install -q optuna transformers datasets >/dev/null

In [9]:
from transformers import TrainerCallback
from optuna.trial import Trial, TrialState
from optuna.study._study_direction import StudyDirection
import pandas as pd

# https://github.com/huggingface/transformers/blob/v4.14.1/src/transformers/trainer_callback.py#L505
# https://huggingface.co/docs/transformers/main_classes/callback#transformers.TrainerCallback

import logging
logging.getLogger(__name__).setLevel(logging.INFO)
log = logging.getLogger(__name__)

class TrialLogAndPruningCallback(TrainerCallback):
    """Stores eval metrics at each evaluation step in the trial user attrs."""
    def __init__(self, trial: Trial, objectives=None, warmup_steps=0, min_trials=7):
        self.study = trial.study
        self.trial = trial
        self.param_keys = ["num_train_epochs", "batch_size"]
        self.param_vals = [trial.params[k] for k in self.param_keys]

        log.warning(f"fixed params: {list(zip(self.param_keys, self.param_vals))}")

        if objectives == None:
            self.objectives = ["eval_loss"]
        else:
            self.objectives = objectives
        self._warmup_steps = warmup_steps
        self._min_trials = max(1, int(min_trials))

        log.warning(f"objectives: {self.objectives}, directions: {self.study.directions}, warmup={self._warmup_steps}, min_trials={self._min_trials}")
        log.warning(f"params: {trial.params}")
        

    def _filter_trials(self, complete_trials):
        """Select only trials with same parameter values"""
        # values = [self.trial.params[k] for k in keys]
        return [t for t in complete_trials if self.param_vals == [t.params[k] for k in self.param_keys]]

    def _prune(self, step: int, metrics) -> bool:
        """Median Pruning on multiple objectives."""
        if step < self._warmup_steps:
            # log.warning(f"less than warmup steps {step}<{self._warmup_steps}")
            return False

        # get all completed trials
        complete_trials = self.study.get_trials(deepcopy=False,
                                                states=[TrialState.COMPLETE])
        # only compare trials with same batch size and epochs
        complete_trials = self._filter_trials(complete_trials)
        n_trials = len(complete_trials)

        # check minimal number of trial required
        if n_trials < self._min_trials:
            # log.warning(f"less than min trials {n_trials}<{self._min_trials}")
            return False

        # log.warning(f"checking {step}: {metrics}")

        # sanity check
        has_metrics = [o in metrics.keys() for o in self.objectives]
        if not all(has_metrics):
            log.warning(f"missing objective metrics {list(zip(self.objectives, has_metrics))}")

        # extract metrics from trials
        # print(f"fetching metrics of {n_trials} complete trials")
        trial_metrics = []
        for t in complete_trials:
            # print(str(step), "in keys?", str(step) in t.user_attrs.keys(), t.user_attrs.keys())
            if str(step) in t.user_attrs.keys():
                trial_metrics.append(t.user_attrs[str(step)])
        n_metrics = len(trial_metrics)

        # compute median for each metric over all trials
        median = pd.DataFrame(trial_metrics).median()

        # log.warning(f"median of {n_metrics}/{n_trials}: {median.to_dict()}")

        # compare current metric value with median
        prune_state = []
        for i, o in enumerate(self.objectives):
            if self.study.directions[i] == StudyDirection.MAXIMIZE:
                prune_state.append(metrics[o] <= median[o])
            else:
                prune_state.append(metrics[o] > median[o])
        
        met = ",".join([f"{m}={metrics[m]:.4}/{median[m]:.4}" for m in self.objectives])
        print(f"prune? step={step}, warmup={self._warmup_steps}, complete_trials={n_trials}, metrics={n_metrics} -> {met}; {prune_state}")

        # all metrics must be marked for pruning
        return all(prune_state)
    
    def on_evaluate(self, args, state, control, lr_scheduler, metrics, **kwargs):
        step = state.global_step
        values = {**metrics, "lr": lr_scheduler.get_last_lr()[-1]}
        self.trial.set_user_attr(str(step), values)

        # pruning
        if self._prune(step, metrics):
            print(f"pruning trial at step {step}")
            # control.should_training_stop = True  # not needed
            raise optuna.TrialPruned()

In [10]:
import optuna

class Interval:
    def __init__(self, min, max, log=False):
        self.min = min
        self.max = max
        self.log = log

def uniform(min, max):
    return Interval(min, max, log=False)

def log_uniform(min, max):
    return Interval(min, max, log=True)

class Space:

    def __init__(self, **kwargs):
        self.params = kwargs
    
    def _suggest(self, trial, key, val):
        if val is None or isinstance(val, (bool, float, int, str)):
            return trial.suggest_categorical(key, [val])
        if isinstance(val, list):
            return trial.suggest_categorical(key, val)
        if isinstance(val, Interval):
            return trial.suggest_float(key, val.min, val.max, log=val.log)

    def suggest(self, trial: optuna.trial.Trial):
        return {k:self._suggest(trial, k, v) for k, v in self.params.items()}


from transformers import AutoTokenizer
from transformers import AutoConfig, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import logging as trlog
from datasets import DatasetDict, Dataset, load_dataset
from datasets import logging as dslog
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, matthews_corrcoef, precision_recall_fscore_support
import numpy as np

# hide progress bar when downloading dataset - needs workaround!
dslog.get_verbosity = lambda : logging.NOTSET
trlog.get_verbosity = lambda : logging.NOTSET

def load_data():
    base_url = "https://raw.githubusercontent.com/tblock/10kGNAD/master/{}.csv"
    data_files = {x: base_url.format(x) for x in ["train", "test"]}
    dataset = (load_dataset('csv',
                            data_files=data_files,
                            sep=";",
                            quotechar="'",
                            names=["label", "text"]).
            class_encode_column("label"))
    label_names = dataset["train"].features["label"].names
    return dataset, label_names


def sample_data(ds: DatasetDict, train_size: float, columns=["text", "label"]):
    """Create a stratified sample of the train dataset."""
    X = ds["train"][columns[0]]
    y = ds["train"][columns[1]]
    X_train, _, y_train, _ = train_test_split(X, y, train_size=train_size, random_state=42, stratify=y)
    return DatasetDict({
        "train": Dataset.from_dict({"label": y_train, "text": X_train}),
        "test": ds["test"]
    })

data_cache = {}

def prepare_data(tokenizer, params) -> DatasetDict:

    train_size = params["train_size"]
    max_seq_length = params["max_seq_length"]
    identifier = f"{train_size}/{max_seq_length}"

    if identifier in data_cache:
        print(f"USING CACHED DATASET for {identifier}")
        return data_cache[identifier]
    else:
        print("GENERATING DATASET")

    dataset, label_names = load_data()

    if train_size is not None and train_size < 1.0:
        dataset = sample_data(dataset, train_size)

    if max_seq_length is None:
        max_seq_length = getattr(tokenizer, "model_max_length")
    
    # TODO better use a partial function
    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation=True, max_length=max_seq_length)
    
    mapped_data = dataset.map(preprocess_function, batched=True).remove_columns("text")

    print("STORING DATASET")
    data_cache[identifier] = (mapped_data, label_names)
    return mapped_data, label_names


def create_training_args(params: dict, config: dict) -> TrainingArguments:
    
    # output_dir = config["output_dir"]
    train_rows = config["train_rows"]
    base_batch_size = config["base_batch_size"]
    eval_rounds_per_epoch = config["eval_rounds_per_epoch"]

    # calculate gradient_accumulation and evaluation steps
    bs = params["batch_size"]
    gradient_accumulation_steps = bs // base_batch_size
    eval_steps = train_rows / bs // eval_rounds_per_epoch
    if (train_rows / bs < eval_rounds_per_epoch):
        raise ValueError(f"batch size {bs} is too big for {train_rows} examples and {eval_rounds_per_epoch} eval rounds!")

    return TrainingArguments(
        output_dir=config["output_dir"],
        report_to=[],
        log_level="error",
        disable_tqdm=False,

        evaluation_strategy="steps",
        eval_steps=eval_steps,
        logging_steps=eval_steps,
        save_strategy="steps",
        save_steps=eval_steps,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,

        # hyperparameters
        num_train_epochs=params["num_train_epochs"],
        learning_rate=params["learning_rate"],
        per_device_train_batch_size=base_batch_size,
        per_device_eval_batch_size=base_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        weight_decay=params["weight_decay"],

        # fp16=True,  # fp16 needs apex. but disabled on Tesla P100 by pytorch
    )


def init_model(checkpoint, label_names):
    """A function that instantiates the model to be used."""

    # We want to include the label names and save them together with the model.
    # The only way to do this is to create a Config and put them in. 
    config = AutoConfig.from_pretrained(
            checkpoint,
            num_labels=len(label_names),
            id2label={i: label for i, label in enumerate(label_names)},
            label2id={label: i for i, label in enumerate(label_names)},
            )

    return AutoModelForSequenceClassification.from_pretrained(checkpoint, config=config)

def compute_metrics(eval_preds):
    """The function that will be used to compute metrics at evaluation.
    Must take a :class:`~transformers.EvalPrediction` and return a dictionary
    string to metric values."""
    logits, labels = eval_preds
    preds = np.argmax(logits, axis=-1)
    # precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {
        "acc": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average='macro'),
        "precision": precision_score(labels, preds, average='macro'),
        "recall": recall_score(labels, preds, average='macro'),
        "mcc": matthews_corrcoef(labels, preds),
        }

def objective(trial):

    # suggest hyperparameters
    hp = space.suggest(trial)
    print(hp)

    checkpoint = hp["model"]
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)

    # prepare dataset
    tokenized_dataset, label_names = prepare_data(tokenizer, hp)
    # print(f"LABEL NAMES {label_names}")


    project_name = "test"
    best_model_dir = "best_model"

    config = {
        "output_dir": project_name,
        "base_batch_size": 8,
        "eval_rounds_per_epoch": 5,
        "train_rows": tokenized_dataset["train"].num_rows
    }

    ## TODO: calculate batch size and aggregations steps separately

    # create training args
    training_args = create_training_args(hp, config)
    # print(args)

    # https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch
    import torch
    torch.cuda.empty_cache()
    import gc
    gc.collect()


    # prepare Trainer
    trainer = Trainer(
        model_init=lambda x: init_model(checkpoint, label_names),
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        tokenizer=tokenizer,
        # data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[TrialLogAndPruningCallback(trial, objectives=["eval_loss", "eval_f1"], min_trials=700, warmup_steps=training_args.eval_steps*3)]
    )

    # train model and save best model from evaluations
    # needs 'load_best_model_at_end=True'
    trainer.train()
    trainer.save_model(f"{project_name}/{best_model_dir}")

    result = trainer.evaluate(eval_dataset=tokenized_dataset["test"])

    # store eval metrics in trial
    trial.set_user_attr("eval_result", result)
    
    # return result["eval_loss"]
    return result["eval_loss"], result["eval_f1"]

import random
import numpy as np

def lr_sample(min, max, dist=0.1, jitter=0.1):
    min_log = np.log10(min)
    max_log = np.log10(max)
    n = int((max_log - min_log) / dist)
    return np.logspace(min_log, max_log, n) * np.random.uniform(1-jitter, 1+jitter, size=n)

def lr_pairs(min_lr, max_lr, hp_values, dist=0.1, jitter=0.1, shuffle=True):
    lrs = []
    for val in hp_values:
        lrs.extend((val, lr) for lr in lr_sample(min_lr, max_lr, dist=dist, jitter=jitter))
    if shuffle:
        random.shuffle(lrs)
    return lrs

def prepare_trials(study, min_lr, max_lr, hp_name, hp_values, dist=0.1, jitter=0.1, shuffle=True):
    # ------ prime with parameters

    for val, lr in lr_pairs(min_lr, max_lr, hp_values, dist=0.1, jitter=0.1, shuffle=shuffle):
        study.enqueue_trial(
            {
                    "learning_rate": lr,
                    hp_name: val,
                }
            )

In [None]:
from optuna.storages import RDBStorage

db_path = "/content/gdrive/My Drive/Colab Notebooks/nlp-classification/"
db_name = "10kgnad_optuna"
# automatically change the state of a stale trial to TrialState.FAIL from TrialState.RUNNING
storage = RDBStorage(url=f"sqlite:///{db_path}{db_name}.db", heartbeat_interval=60, grace_period=120)

space = Space(
    model = "deepset/gbert-base",
    train_size = [0.25, 0.5, 1.0],
    data_collator = False,
    max_seq_length = 128,
    batch_size = [64],
    num_train_epochs = 2,
    learning_rate = log_uniform(5e-6, 5e-4),
    weight_decay = log_uniform(1e-3, 1e-2),
)

# TODO define optimization metrics

study_name = space.params["model"] + "_ds25-100_bs64_ep2_len128"
print("STUDY", study_name)

# multi objective study
# https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/002_multi_objective.html#sphx-glr-tutorial-20-recipes-002-multi-objective-py
study = optuna.create_study(study_name=study_name,
                            directions=["minimize", "maximize"],
                            storage=storage,
                            load_if_exists=True,)

# prime trials
lr = space.params["learning_rate"]
key = "train_size"
# prepare_trials(study, lr.min, lr.max, key, space.params[key], dist=0.05, jitter=0.05)

study.optimize(objective, n_trials=3)

[32m[I 2022-01-30 23:00:06,780][0m Using an existing study with name 'deepset/gbert-base_ds25-100_bs64_ep2_len128' instead of creating a new one.[0m


STUDY deepset/gbert-base_ds25-100_bs64_ep2_len128


  create_trial(state=TrialState.WAITING, system_attrs={"fixed_params": params})
  create_trial(state=TrialState.WAITING, system_attrs={"fixed_params": params})


{'model': 'deepset/gbert-base', 'train_size': 0.25, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 1.2276295732827837e-05, 'weight_decay': 0.0027240014343563727}
USING CACHED DATASET for 0.25/128


fixed params: [('num_train_epochs', 2), ('batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'model': 'deepset/gbert-base', 'train_size': 0.25, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 1.2276295732827837e-05, 'weight_decay': 0.0027240014343563727}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,2.152,2.028398,0.377432,0.243727,0.281115,0.279799,0.278619
14,1.9659,1.891984,0.450389,0.286981,0.304725,0.332783,0.368156
21,1.8687,1.744822,0.536965,0.358967,0.359316,0.413068,0.467602
28,1.7724,1.616001,0.562257,0.39655,0.426373,0.439915,0.498063
35,1.6109,1.509369,0.574903,0.416915,0.541455,0.454308,0.51497
42,1.5053,1.398851,0.626459,0.465426,0.568618,0.497825,0.572091
49,1.3898,1.325756,0.646887,0.494456,0.691836,0.518642,0.595966
56,1.3549,1.272181,0.65856,0.516576,0.694346,0.534474,0.609345
63,1.3124,1.2338,0.675097,0.549253,0.810627,0.557734,0.628266
70,1.2678,1.216491,0.679961,0.563215,0.79923,0.56681,0.633656


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-30 23:03:04,601][0m Trial 3 finished with values: [1.2164912223815918, 0.5632148867503006] and parameters: {'model': 'deepset/gbert-base', 'train_size': 0.25, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 1.2276295732827837e-05, 'weight_decay': 0.0027240014343563727}. [0m


{'model': 'deepset/gbert-base', 'train_size': 0.25, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 4.613658538714951e-05, 'weight_decay': 0.008837330426901786}
USING CACHED DATASET for 0.25/128


fixed params: [('num_train_epochs', 2), ('batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=21.0, min_trials=700
params: {'model': 'deepset/gbert-base', 'train_size': 0.25, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 4.613658538714951e-05, 'weight_decay': 0.008837330426901786}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
7,2.0678,1.821143,0.492218,0.306595,0.32039,0.376718,0.425593
14,1.6037,1.283495,0.678016,0.547608,0.599946,0.559153,0.636958
21,1.1922,0.900568,0.793774,0.775325,0.832911,0.754353,0.763913
28,0.9041,0.708131,0.807393,0.796384,0.813955,0.796824,0.781653
35,0.718,0.622202,0.829767,0.820943,0.84613,0.807739,0.80502
42,0.5841,0.563245,0.843385,0.835451,0.855622,0.824631,0.820557
49,0.5002,0.544211,0.842412,0.834631,0.858554,0.822545,0.819649
56,0.4701,0.513218,0.857004,0.849946,0.859465,0.8451,0.836257
63,0.4446,0.49378,0.859922,0.85348,0.860051,0.849334,0.83946
70,0.4601,0.487019,0.856031,0.849585,0.854713,0.846588,0.835013


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[32m[I 2022-01-30 23:06:00,741][0m Trial 4 finished with values: [0.4870188534259796, 0.8495850771306106] and parameters: {'model': 'deepset/gbert-base', 'train_size': 0.25, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 4.613658538714951e-05, 'weight_decay': 0.008837330426901786}. [0m


{'model': 'deepset/gbert-base', 'train_size': 1.0, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 0.00011006711630444298, 'weight_decay': 0.0033087965136340096}
USING CACHED DATASET for 1.0/128


fixed params: [('num_train_epochs', 2), ('batch_size', 64)]
objectives: ['eval_loss', 'eval_f1'], directions: [<StudyDirection.MINIMIZE: 1>, <StudyDirection.MAXIMIZE: 2>], warmup=84.0, min_trials=700
params: {'model': 'deepset/gbert-base', 'train_size': 1.0, 'data_collator': False, 'max_seq_length': 128, 'batch_size': 64, 'num_train_epochs': 2, 'learning_rate': 0.00011006711630444298, 'weight_decay': 0.0033087965136340096}


Step,Training Loss,Validation Loss,Acc,F1,Precision,Recall,Mcc
28,1.068,0.552434,0.832685,0.822455,0.844592,0.815563,0.81071
56,0.4963,0.44634,0.860895,0.857209,0.862571,0.859055,0.842293
84,0.453,0.393846,0.873541,0.869858,0.869688,0.87167,0.855276
112,0.4168,0.413256,0.86965,0.865183,0.866581,0.869679,0.851737
140,0.447,0.396629,0.88035,0.87568,0.876563,0.876831,0.863092
168,0.2685,0.366865,0.883268,0.879788,0.87863,0.88365,0.866637
