This notebook uses flaml to finetune a transformer model from Huggingface transformers library.

**Requirements.** This notebook has additional requirements:

In [1]:
#!pip install torch transformers datasets ipywidgets flaml[blendsearch,ray];

## Tokenizer

In [2]:
from transformers import AutoTokenizer

In [3]:
MODEL_CHECKPOINT = "distilbert-base-uncased"

In [4]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT, use_fast=True)

In [5]:
tokenizer("this is a test")

{'input_ids': [101, 2023, 2003, 1037, 3231, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}

## Data

In [6]:
TASK = "cola"

In [7]:
import datasets

In [8]:
raw_dataset = datasets.load_dataset("glue", TASK)

Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [9]:
# define tokenization function used to process data
COLUMN_NAME = "sentence"
def tokenize(examples):
    return tokenizer(examples[COLUMN_NAME], truncation=True)

In [10]:
encoded_dataset = raw_dataset.map(tokenize, batched=True)

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [11]:
encoded_dataset["train"][0]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'idx': 0,
 'input_ids': [101,
  2256,
  2814,
  2180,
  1005,
  1056,
  4965,
  2023,
  4106,
  1010,
  2292,
  2894,
  1996,
  2279,
  2028,
  2057,
  16599,
  1012,
  102],
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

## Model

In [12]:
from transformers import AutoModelForSequenceClassification

In [13]:
NUM_LABELS = 2
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=NUM_LABELS)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [14]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

## Metric

In [15]:
metric = datasets.load_metric("glue", TASK)

In [16]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

In [17]:
import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

## Training (aka Finetuning)

In [18]:
from transformers import Trainer
from transformers import TrainingArguments

In [19]:
args = TrainingArguments(
    output_dir='output',
    do_eval=True,
)

In [20]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [21]:
trainer.train()



Step,Training Loss




TrainOutput(global_step=804, training_loss=0.3209413462017306, metrics={'train_runtime': 115.5328, 'train_samples_per_second': 6.959, 'total_flos': 238363718990580.0, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 2336600064, 'init_mem_gpu_alloc_delta': 268953088, 'init_mem_cpu_peaked_delta': 257929216, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 2381066240, 'train_mem_gpu_alloc_delta': 806788096, 'train_mem_cpu_peaked_delta': 186974208, 'train_mem_gpu_peaked_delta': 550790144})

## Hyperparameter Optimization

`flaml.tune` is a module for economical hyperparameter tuning. It frees users from manually tuning many hyperparameters for a software, such as machine learning training procedures. 
The API is compatible with ray tune.

### Step 1. Define training method

We define a function `train_distilbert(config: dict)` that accepts a hyperparameter configuration dict `config`. The specific configs will be generated by flaml's search algorithm in a given search space.


In [22]:
import flaml

def train_distilbert(config: dict):

    # Load CoLA dataset and apply tokenizer
    cola_raw = datasets.load_dataset("glue", TASK)
    cola_encoded = cola_raw.map(tokenize, batched=True)
    train_dataset, eval_dataset = cola_encoded["train"], cola_encoded["validation"]

    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_CHECKPOINT, num_labels=NUM_LABELS
    )

    metric = datasets.load_metric("glue", TASK)
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return metric.compute(predictions=predictions, references=labels)

    training_args = TrainingArguments(
        output_dir='.',
        do_eval=False,
        disable_tqdm=True,
        logging_steps=20000,
        save_total_limit=0,
        **config,
    )

    trainer = Trainer(
        model,
        training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # train model
    trainer.train()

    # evaluate model
    eval_output = trainer.evaluate()

    # report the metric to optimize
    flaml.tune.report(
        loss=eval_output["eval_loss"],
        matthews_correlation=eval_output["eval_matthews_correlation"],
    )

### Step 2. Define the search

We are now ready to define our search. This includes:

- The `search_space` for our hyperparameters
- The metric and the mode ('max' or 'min') for optimization
- The constraints (`n_cpus`, `n_gpus`, `num_samples`, and `time_budget_s`)

In [23]:
max_num_epoch = 64
search_space = {
        # You can mix constants with search space objects.
        "num_train_epochs": flaml.tune.loguniform(1, max_num_epoch),
        "learning_rate": flaml.tune.loguniform(1e-6, 1e-4),
        "adam_epsilon": flaml.tune.loguniform(1e-9, 1e-7),
        "adam_beta1": flaml.tune.uniform(0.8, 0.99),
        "adam_beta2": flaml.tune.loguniform(98e-2, 9999e-4),
}

In [24]:
# optimization objective
HP_METRIC, MODE = "matthews_correlation", "max"

# resources
num_cpus = 4
num_gpus = 4

# constraints
num_samples = -1    # number of trials, -1 means unlimited
time_budget_s = 3600    # time budget in seconds

### Step 3. Launch with `flaml.tune.run`

We are now ready to launch the tuning using `flaml.tune.run`:

In [25]:
import time
import ray
start_time = time.time()
ray.shutdown()
ray.init(num_cpus=num_cpus, num_gpus=num_gpus)

print("Tuning started...")
analysis = flaml.tune.run(
    train_distilbert,
    search_alg=flaml.CFO(
        space=search_space,
        metric=HP_METRIC,
        mode=MODE,
        low_cost_partial_config={"num_train_epochs": 1}),
    report_intermediate_result=False,
    # uncomment the following if report_intermediate_result = True
    # max_resource=max_num_epoch, min_resource=1,
    resources_per_trial={"gpu": num_gpus, "cpu": num_cpus},
    local_dir='logs/',
    num_samples=num_samples,
    time_budget_s=time_budget_s,
    use_ray=True,
)

ray.shutdown()

2021-05-07 02:35:57,130	INFO services.py:1172 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m
Tuning started...


[2m[36m(pid=886303)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 62.07ba/s]
100%|██████████| 9/9 [00:00<00:00, 40.87ba/s]
100%|██████████| 2/2 [00:00<00:00, 107.60ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 105.70ba/s]
[2m[36m(pid=886303)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=886303)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a Bert

Trial train_distilbert_a0c303d0 completed. Last result: loss=0.5879864692687988,matthews_correlation=0.0
[2m[36m(pid=886302)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 61.83ba/s]
100%|██████████| 9/9 [00:00<00:00, 41.19ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 108.40ba/s]
100%|██████████| 2/2 [00:00<00:00, 104.85ba/s]
[2m[36m(pid=886302)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=886302)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on anoth

Trial train_distilbert_a0c303d1 completed. Last result: loss=0.6030182838439941,matthews_correlation=0.0
[2m[36m(pid=886305)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 56.45ba/s]
100%|██████████| 9/9 [00:00<00:00, 39.00ba/s]
100%|██████████| 2/2 [00:00<00:00, 112.51ba/s]
100%|██████████| 2/2 [00:00<00:00, 106.76ba/s]
[2m[36m(pid=886305)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=886305)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (

Trial train_distilbert_c39b2ef0 completed. Last result: loss=0.5865175724029541,matthews_correlation=0.0
[2m[36m(pid=886304)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 59.27ba/s]
100%|██████████| 9/9 [00:00<00:00, 40.35ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 114.16ba/s]
100%|██████████| 2/2 [00:00<00:00, 92.98ba/s]
[2m[36m(pid=886304)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=886304)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on anothe

Trial train_distilbert_f00776e2 completed. Last result: loss=0.5813134908676147,matthews_correlation=0.0
[2m[36m(pid=892770)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 58.13ba/s]
100%|██████████| 9/9 [00:00<00:00, 39.40ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 92.35ba/s]
100%|██████████| 2/2 [00:00<00:00, 106.15ba/s]
[2m[36m(pid=892770)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=892770)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on anothe

Trial train_distilbert_11ab3900 completed. Last result: loss=0.5855756998062134,matthews_correlation=0.0
[2m[36m(pid=897725)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 55.93ba/s]
100%|██████████| 9/9 [00:00<00:00, 40.18ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 104.47ba/s]
100%|██████████| 2/2 [00:00<00:00, 102.67ba/s]
[2m[36m(pid=897725)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=897725)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on anoth

Trial train_distilbert_353025b6 completed. Last result: loss=0.5316324830055237,matthews_correlation=0.38889272875750597
[2m[36m(pid=907288)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 60.41ba/s]
100%|██████████| 9/9 [00:00<00:00, 40.27ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 107.10ba/s]
100%|██████████| 2/2 [00:00<00:00, 93.66ba/s]
[2m[36m(pid=907288)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=907288)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model t

Trial train_distilbert_5728a1de completed. Last result: loss=0.5385054349899292,matthews_correlation=0.2805581766595423
[2m[36m(pid=908756)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 60.71ba/s]
100%|██████████| 9/9 [00:00<00:00, 40.09ba/s]
100%|██████████| 2/2 [00:00<00:00, 96.21ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 89.91ba/s]
[2m[36m(pid=908756)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=908756)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model tra

Trial train_distilbert_9394c2e2 completed. Last result: loss=0.5391769409179688,matthews_correlation=0.3272948213494272
[2m[36m(pid=912284)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 67.17ba/s]
100%|██████████| 9/9 [00:00<00:00, 43.92ba/s]
100%|██████████| 2/2 [00:00<00:00, 92.79ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 113.54ba/s]
[2m[36m(pid=912284)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=912284)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model tr

Trial train_distilbert_b6543fec completed. Last result: loss=0.5275164842605591,matthews_correlation=0.37917684067701946
[2m[36m(pid=914582)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 59.49ba/s]
100%|██████████| 9/9 [00:00<00:00, 39.12ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 106.86ba/s]
100%|██████████| 2/2 [00:00<00:00, 110.39ba/s]
[2m[36m(pid=914582)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=914582)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model 

Trial train_distilbert_0071f998 completed. Last result: loss=0.5162246823310852,matthews_correlation=0.417156672319181
[2m[36m(pid=918301)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 53.62ba/s]
100%|██████████| 9/9 [00:00<00:00, 35.94ba/s]
100%|██████████| 2/2 [00:00<00:00, 104.02ba/s]
100%|██████████| 2/2 [00:00<00:00, 107.63ba/s]
[2m[36m(pid=918301)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=918301)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another 

Trial train_distilbert_2f830be6 completed. Last result: loss=0.5516289472579956,matthews_correlation=0.06558874629318973
[2m[36m(pid=920414)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 59.90ba/s]
100%|██████████| 9/9 [00:00<00:00, 39.68ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 84.51ba/s]
100%|██████████| 2/2 [00:00<00:00, 86.07ba/s]
[2m[36m(pid=920414)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=920414)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model tr

Trial train_distilbert_7ce03f12 completed. Last result: loss=0.523731529712677,matthews_correlation=0.45354879777314566
[2m[36m(pid=925520)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 63.59ba/s]
100%|██████████| 9/9 [00:00<00:00, 41.23ba/s]
100%|██████████| 2/2 [00:00<00:00, 102.78ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 107.25ba/s]
[2m[36m(pid=925520)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=925520)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model t

Trial train_distilbert_aaab0508 completed. Last result: loss=0.5112878680229187,matthews_correlation=0.4508496945113286
[2m[36m(pid=929827)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 57.22ba/s]
100%|██████████| 9/9 [00:00<00:00, 38.97ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 90.88ba/s]
100%|██████████| 2/2 [00:00<00:00, 89.31ba/s]
[2m[36m(pid=929827)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=929827)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model tra

Trial train_distilbert_14262454 completed. Last result: loss=0.5350601673126221,matthews_correlation=0.40085080763525827
[2m[36m(pid=934238)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 53.04ba/s]
100%|██████████| 9/9 [00:00<00:00, 37.06ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 106.60ba/s]
100%|██████████| 2/2 [00:00<00:00, 90.49ba/s]
[2m[36m(pid=934238)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=934238)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model t

Trial train_distilbert_6d211fe6 completed. Last result: loss=0.609851062297821,matthews_correlation=0.5268023551875569
[2m[36m(pid=942628)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 62.15ba/s]
100%|██████████| 9/9 [00:00<00:00, 40.78ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 108.16ba/s]
100%|██████████| 2/2 [00:00<00:00, 107.36ba/s]
[2m[36m(pid=942628)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=942628)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model tr

Trial train_distilbert_c980bae4 completed. Last result: loss=0.5422758460044861,matthews_correlation=0.32496815807366203
[2m[36m(pid=945904)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 57.84ba/s]
100%|██████████| 9/9 [00:00<00:00, 40.01ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 102.16ba/s]
100%|██████████| 2/2 [00:00<00:00, 81.15ba/s]
[2m[36m(pid=945904)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=945904)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model t

Trial train_distilbert_6d0d29d6 completed. Last result: loss=0.9238015413284302,matthews_correlation=0.5494735380761103
[2m[36m(pid=973869)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 78%|███████▊  | 7/9 [00:00<00:00, 66.59ba/s]
100%|██████████| 9/9 [00:00<00:00, 44.15ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 125.62ba/s]
100%|██████████| 2/2 [00:00<00:00, 119.07ba/s]
[2m[36m(pid=973869)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=973869)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model t

Trial train_distilbert_b16ea82a completed. Last result: loss=0.5334658622741699,matthews_correlation=0.4513069078434825
[2m[36m(pid=978003)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 55.88ba/s]
 89%|████████▉ | 8/9 [00:00<00:00, 33.97ba/s]
100%|██████████| 9/9 [00:00<00:00, 39.36ba/s]
100%|██████████| 2/2 [00:00<00:00, 94.15ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 105.63ba/s]
[2m[36m(pid=978003)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=978003)[0m - This IS expected if you are initializing DistilBertForSequenceCl

Trial train_distilbert_eddf7cc0 completed. Last result: loss=0.9832845330238342,matthews_correlation=0.5699304939602442
[2m[36m(pid=1000417)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 53.75ba/s]
 89%|████████▉ | 8/9 [00:00<00:00, 32.34ba/s]
100%|██████████| 9/9 [00:00<00:00, 37.56ba/s]
100%|██████████| 2/2 [00:00<00:00, 106.80ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 106.92ba/s]
[2m[36m(pid=1000417)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=1000417)[0m - This IS expected if you are initializing DistilBertForSequen

Trial train_distilbert_43008974 completed. Last result: loss=0.8574612736701965,matthews_correlation=0.5200220944545176
[2m[36m(pid=1022436)[0m Reusing dataset glue (/home/chiw/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
  0%|          | 0/9 [00:00<?, ?ba/s]
 67%|██████▋   | 6/9 [00:00<00:00, 57.01ba/s]
100%|██████████| 9/9 [00:00<00:00, 38.68ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 94.45ba/s]
100%|██████████| 2/2 [00:00<00:00, 106.71ba/s]
[2m[36m(pid=1022436)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=1022436)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model

Trial name,status,loc,adam_beta1,adam_beta2,adam_epsilon,learning_rate,num_train_epochs,iter,total time (s),loss,matthews_correlation
train_distilbert_a0c303d0,TERMINATED,,0.939079,0.991865,7.96945e-08,5.61152e-06,1.0,1.0,55.6909,0.587986,0.0
train_distilbert_a0c303d1,TERMINATED,,0.811036,0.997214,2.05111e-09,2.05134e-06,1.44427,1.0,71.7663,0.603018,0.0
train_distilbert_c39b2ef0,TERMINATED,,0.909395,0.993715,1e-07,5.26543e-06,1.0,1.0,53.7619,0.586518,0.0
train_distilbert_f00776e2,TERMINATED,,0.968763,0.990019,4.38943e-08,5.98035e-06,1.02723,1.0,56.8382,0.581313,0.0
train_distilbert_11ab3900,TERMINATED,,0.962198,0.991838,7.09296e-08,5.06608e-06,1.0,1.0,54.0231,0.585576,0.0
train_distilbert_353025b6,TERMINATED,,0.91596,0.991892,8.95426e-08,6.21568e-06,2.15443,1.0,98.3233,0.531632,0.388893
train_distilbert_5728a1de,TERMINATED,,0.926933,0.993146,1e-07,1.00902e-05,1.0,1.0,55.3726,0.538505,0.280558
train_distilbert_9394c2e2,TERMINATED,,0.928106,0.990614,4.49975e-08,3.45674e-06,2.72935,1.0,121.388,0.539177,0.327295
train_distilbert_b6543fec,TERMINATED,,0.876896,0.992098,1e-07,7.01176e-06,1.59538,1.0,76.0244,0.527516,0.379177
train_distilbert_0071f998,TERMINATED,,0.955024,0.991687,7.39776e-08,5.50998e-06,2.90939,1.0,126.871,0.516225,0.417157


2021-05-07 03:42:30,035	INFO tune.py:450 -- Total run time: 3992.00 seconds (3991.90 seconds for the tuning loop).


In [26]:
best_trial = analysis.get_best_trial(HP_METRIC, MODE, "all")
metric = best_trial.metric_analysis[HP_METRIC][MODE]
print(f"n_trials={len(analysis.trials)}")
print(f"time={time.time()-start_time}")
print(f"Best model eval {HP_METRIC}: {metric:.4f}")
print(f"Best model parameters: {best_trial.config}")


n_trials=22
time=3999.769361972809
Best model eval matthews_correlation: 0.5699
Best model parameters: {'num_train_epochs': 15.580684188655825, 'learning_rate': 1.2851507818900338e-05, 'adam_epsilon': 8.134982521948352e-08, 'adam_beta1': 0.99, 'adam_beta2': 0.9971094424784387}


## Next Steps

Notice that we only reported the metric with `flaml.tune.report` at the end of full training loop. It is possible to enable reporting of intermediate performance - allowing early stopping - as follows:

- Huggingface provides _Callbacks_ which can be used to insert the `flaml.tune.report` call inside the training loop
- Make sure to set `do_eval=True` in the `TrainingArguments` provided to `Trainer` and adjust the evaluation frequency accordingly