<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Hyper_parameter_search_and_class_imbalance_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyper-parameter search Research

Hyper-parameter : https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/

Transformer hyper-parameter search: https://huggingface.co/blog/ray-tune

**What is hyper-parameter**?
  * Parameters that are used to control the learning process of a model
  * "Model configuration parameters set by the developer to guide learning process for specific dataset".

**Difference between model parameters and model hyper-parameters**?
  * Model parameters: 
    * Variables whose values are not set but learned during the training of a model for specific data.
      * E.g. 
        * Weights (importance given to each feature of an instance) and biases (adjust the generalization of the model) in NN
        * Support vectors in SVM
        * Coefficients in regression models 
  * Model Hyper-perameter:
    * Configuration variable set before training to improve the training process or reduce the loss function.
    * E.g.
      * Learning rate for NN
      * K in KNN

**Hyper-parameter search/tuning/optimization:**
  * No rule of thumb to set hyper parameters and it is required to search for best hyper-parameters of a model on a dataset.
  * Hyper-parameter for a model is searched in search space where each dimention represents hyper-parameter and point represent one model configuration.
  * Goal of hyper-parameter search is to find an optimal configuration parameters (vector) from search space.
  * Different algorithms
    * Random search: randomly sample points from bounded domain of search space
      * More time to search 
      *`RandomizedSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched
    * Grid search:  Search space as grid of hyper-parameters and evaluate every
 point in the grid.
      * More defined search in the search space
      * `GridSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched.
    * Advanced:
      * Bayesian optimization 
      * Population based training


**Transformers Hyper-parameter tuning :**

Library : RayTune (python library for experiment execution and hyperparameter tuning)

Steps:
  1. Define search space
      * BERT Model fine-tune Hyper-parameters(baseline : https://www.aclweb.org/anthology/N19-1423/):
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
      * RoBERTa Model fine-tune hyper-parameters in paper(baseline : https://arxiv.org/abs/1907.11692):
        * Batch_size : [16,32]
        * Learning rate (adam) : 1e-5,2e-5,3e-5
        * Max number of epochs (adam) : 10
        * Weight decay : 0.1
        * Learning rate decay : Linear
        * Warmup ratio : 0.06 
      * GPT-2 Model fine-tune hyper-parameters in paper(baseline : http://www.persagen.com/files/misc/radford2019language.pdf):
        * Auto-regressive model
      * XLNet-large fine-tune Model hyper-parameters in paper(baseline : https://arxiv.org/pdf/1906.08237.pdf):
        * Same as BERT 
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
      * **See `my_hp_space()`**
  2. Load Model tokenizer
  3. Load training and evaluation dataset
    * Hyper-parameter search performed 
  4. Define metrics to be evaluated 
    * `Datasets` library from transformers contain metrics which can be used 
    * https://huggingface.co/metrics
  5. Encode training examples
  6. Initialize model 
    * `AutoModelForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)`
  7. Define `trainer` from transformers
    * Trainer classes provide feature complete API
    * Before instantiating trainer, training arguments should be created to access customization during training
    * https://huggingface.co/transformers/main_classes/trainer.html





Hugging-face Multi-label classification 

* Link : https://colab.research.google.com/drive/18vy67le2DC-iMJK-AiB0vVKtMRAxmBnB?usp=sharing
* Link : https://colab.research.google.com/drive/1aue7x525rKy6yYLqqt-5Ll96qjQvpqS7#scrollTo=Ytdiy3hJJ88P

# Data-preprocessing

In [1]:
! pip install optuna --quiet
! pip install ray[tune] --quiet
# !pip install transformers --quiet

[?25l[K     |█                               | 10 kB 38.6 MB/s eta 0:00:01[K     |██▏                             | 20 kB 23.5 MB/s eta 0:00:01[K     |███▎                            | 30 kB 17.6 MB/s eta 0:00:01[K     |████▎                           | 40 kB 15.7 MB/s eta 0:00:01[K     |█████▍                          | 51 kB 7.5 MB/s eta 0:00:01[K     |██████▌                         | 61 kB 7.5 MB/s eta 0:00:01[K     |███████▋                        | 71 kB 7.9 MB/s eta 0:00:01[K     |████████▋                       | 81 kB 8.8 MB/s eta 0:00:01[K     |█████████▊                      | 92 kB 9.4 MB/s eta 0:00:01[K     |██████████▉                     | 102 kB 7.3 MB/s eta 0:00:01[K     |████████████                    | 112 kB 7.3 MB/s eta 0:00:01[K     |█████████████                   | 122 kB 7.3 MB/s eta 0:00:01[K     |██████████████                  | 133 kB 7.3 MB/s eta 0:00:01[K     |███████████████▏                | 143 kB 7.3 MB/s eta 0:00:01[K 

In [2]:
pip install git+https://github.com/huggingface/transformers -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 636 kB 6.7 MB/s 
[K     |████████████████████████████████| 3.3 MB 55.4 MB/s 
[K     |████████████████████████████████| 895 kB 49.3 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


In [3]:
pip install "ray[tune]" transformers datasets --quiet

[K     |████████████████████████████████| 264 kB 6.8 MB/s 
[K     |████████████████████████████████| 76 kB 4.7 MB/s 
[K     |████████████████████████████████| 243 kB 13.0 MB/s 
[K     |████████████████████████████████| 118 kB 14.2 MB/s 
[?25h

In [4]:
!nvidia-smi

Wed Aug  4 11:10:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments

# specify GPU
device = torch.device("cuda")

In [6]:
print(f"Running on transformers v{transformers.__version__}")

Running on transformers v4.10.0.dev0


In [7]:
MAX_LEN = 50
RANDOM_SEED = 47
# pl.seed_everything(RANDOM_SEED)

In [8]:
df = pd.read_csv("/content/ohe_multilabel.csv", index_col = 0)

In [9]:
df

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Many people live in Ethiopia. The people are v...
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,My professor is a hispanic man He came here il...
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,The schoolgirl is walking down the street. She...
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,We'll likely fly over Cape Verde on our flight...
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,I know many people who are Russian. They love ...
...,...,...,...,...,...,...,...,...
16539,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Arab people are associated with hi.
16540,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Car was a new chemist in our lab.
16541,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"The gentlemen stood around the fire, carefully..."
16542,0.0,0.0,0.0,0.0,0.0,0.0,1.0,The common ukrainian is a supporter of a floun...


Stratified sampling :

* Why?
  * With very small or very imbalanced data sets, it's quite possible that the random split could completely eliminate a class from one of the train/test splits.
  * hence, setting `stratify = dependent_variable` makes sure that train and test splits have the same proportion of sampling. 

In [10]:
y = df.iloc[:,:-1].values
X = df.iloc[:,-1].values

In [11]:
from sklearn.model_selection import train_test_split

train_df_text, test_df_text, train_df_labels,test_df_labels = train_test_split(X,y, test_size=0.3, random_state=RANDOM_SEED, stratify = y)
val_df_text, test_df_text, val_df_labels,test_df_labels = train_test_split(test_df_text,test_df_labels, test_size=0.5, random_state=RANDOM_SEED,stratify = test_df_labels)

In [12]:
train_df_text

array(['The grandfather dined at the restaurant. He enjoyed salsa dancing and kung fu.',
       'Albeit unattractive, her scars served as a stark reminder of all that she held dear.',
       'there is a new middle eastern ice cream company that has gotten explosive response from critics it s called ice is',
       ..., 'Yemen is near away from california',
       'The shirt is too wet to wear.',
       "The telephone's constant ringing is annoying."], dtype=object)

In [13]:
LABEL_COLUMN = ['Ethnicity',	'gender'	,'profession'	,'religion',	'Anti-stereotype',	'stereotype',	'unrelated']

In [14]:
train_df_labels = pd.DataFrame(train_df_labels, columns= LABEL_COLUMN)
val_df_labels = pd.DataFrame(val_df_labels, columns= LABEL_COLUMN)
test_df_labels = pd.DataFrame(test_df_labels, columns= LABEL_COLUMN)
train_df_text = pd.DataFrame(train_df_text, columns = ['sentence'])
val_df_text = pd.DataFrame(val_df_text, columns = ['sentence'])
test_df_text = pd.DataFrame(test_df_text, columns = ['sentence'])

In [15]:
train_df = pd.concat([train_df_text,train_df_labels], axis = 1)
val_df = pd.concat([val_df_text,val_df_labels], axis = 1)
test_df = pd.concat([test_df_text,test_df_labels], axis = 1)

In [16]:
train_df

Unnamed: 0,sentence,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated
0,The grandfather dined at the restaurant. He en...,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,"Albeit unattractive, her scars served as a sta...",0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,there is a new middle eastern ice cream compan...,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,while they are in prison the kikes should get ...,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,The Historian was a teacher He loved history a...,0.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...
11575,Most butchers are window.,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11576,Muslim households are very polyester places.,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11577,Yemen is near away from california,1.0,0.0,0.0,0.0,1.0,0.0,0.0
11578,The shirt is too wet to wear.,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [17]:
train_df.iloc[:,1:].values

array([[0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [18]:
from torch.utils.data import Dataset, DataLoader

In [19]:
class ExplicitStereotypeDataset(Dataset):

  def __init__(self, data: pd.DataFrame, tokenizer,max_token_len: int = 50):
    self.tokenizer = tokenizer
    self.data = data
    self.max_token_len = max_token_len
  
  def __len__(self):
    return len(self.data)
  
  def __getitem__(self, index: int):
    data_row = self.data.iloc[0]
    text = data_row[0]
    labels = data_row[1:]
 

    encoding = self.tokenizer.encode_plus(
      text,
      add_special_tokens=True,
      max_length=self.max_token_len,
      padding="max_length",
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return dict(
      attention_mask=encoding["attention_mask"].flatten(),
      input_ids=encoding["input_ids"].flatten(),
      labels= torch.FloatTensor(labels)
    )

In [20]:
sample = train_dataset[0]

NameError: ignored

In [None]:
sample

In [21]:
# num_labels = len(sample['labels'])
num_labels = 7

## Search space

In [22]:
def my_hp_space(trial):
    from ray import tune

    return {
        "learning_rate": tune.uniform(1e-5, 5e-5),
        "num_train_epochs": tune.choice([2,3,5]),
        "seed": tune.choice(range(1, 41)),
        "per_device_train_batch_size": tune.choice([8, 16, 32]),
    }

# Training

## BERT

In [None]:
model = 'bert-base-uncased'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model,problem_type="multi_label_classification")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
train_df = train_df.sample(2000)
val_df = val_df.sample(2000)

In [None]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, problem_type="multi_label_classification", num_labels = num_labels )

In [None]:
# from pytorch_lightning.metrics.functional import accuracy, f1, auroc

# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     roc_auc = auroc(predictions, labels)
#     return roc_auc

In [None]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.

trainer = Trainer(
    model_init= model_init,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
creating metadata file for /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.pred

"The default objective to maximize/minimize when doing an hyperparameter search. It is the evaluation loss if no
    metrics are provided to the :class:`~transformers.Trainer`, the sum of all metrics otherwise."

Optuna : By default for hp_search

Metrics :

```
def default_hp_space_ray(trial) -> Dict[str, float]:
    from .integrations import is_ray_tune_available

    assert is_ray_tune_available(), "This function needs ray installed: `pip " "install ray[tune]`"
    from ray import tune

    return {
        "learning_rate": tune.loguniform(1e-6, 1e-4),
        "num_train_epochs": tune.choice(list(range(1, 6))),
        "seed": tune.uniform(1, 40),
        "per_device_train_batch_size": tune.choice([4, 8, 16, 32, 64]),
    }
```
Link : https://huggingface.co/transformers/_modules/transformers/trainer_utils.html

In [None]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
best_run = trainer.hyperparameter_search(n_trials=5, hp_space=my_hp_space, backend='ray')

No `resources_per_trial` arg was passed into `hyperparameter_search`. Setting it to a default value of 1 CPU and 1 GPU for each trial.
2021-07-28 13:01:53,380	INFO services.py:1247 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 3.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-28_13-01-57
Number of trials: 5/5 (5 PENDING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fd611_00000 | PENDING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fd611_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fd611_00002 | PENDING  |       |     1.62398e-05 |             

[2m[36m(pid=320)[0m 2021-07-28 13:01:58.519055: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Memory usage on this node: 4.6/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-28_13-01-57
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fd611_00000 | RUNNING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fd611_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fd611_00002 | PENDING  |       |     1.62398e-05

[2m[36m(pid=320)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
[2m[36m(pid=320)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=320)[0m - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=320)[0m Som

[2m[36m(pid=320)[0m {'train_runtime': 36.9152, 'train_samples_per_second': 108.356, 'train_steps_per_second': 3.413, 'train_loss': 0.13319183531261625, 'epoch': 2.0}


  5%|▍         | 12/250 [00:00<00:05, 43.14it/s]
  7%|▋         | 17/250 [00:00<00:05, 41.44it/s]
  9%|▉         | 22/250 [00:00<00:05, 39.94it/s]
 11%|█         | 27/250 [00:00<00:05, 39.70it/s]
 12%|█▏        | 31/250 [00:00<00:05, 39.15it/s]
 14%|█▍        | 35/250 [00:00<00:05, 39.20it/s]
 16%|█▌        | 39/250 [00:00<00:05, 39.06it/s]
 17%|█▋        | 43/250 [00:01<00:05, 39.12it/s]
 19%|█▉        | 47/250 [00:01<00:05, 39.23it/s]
 20%|██        | 51/250 [00:01<00:05, 39.11it/s]
 22%|██▏       | 55/250 [00:01<00:04, 39.34it/s]
 24%|██▎       | 59/250 [00:01<00:04, 39.07it/s]
 25%|██▌       | 63/250 [00:01<00:04, 39.05it/s]
 27%|██▋       | 67/250 [00:01<00:04, 38.98it/s]
 28%|██▊       | 71/250 [00:01<00:04, 38.74it/s]
 30%|███       | 75/250 [00:01<00:04, 38.65it/s]
 32%|███▏      | 79/250 [00:02<00:04, 38.49it/s]
 33%|███▎      | 83/250 [00:02<00:04, 38.68it/s]
 35%|███▍      | 87/250 [00:02<00:04, 38.77it/s]
 36%|███▋      | 91/250 [00:02<00:04, 38.82it/s]
 38%|███▊      | 95/

Result for _objective_fd611_00000:
  date: 2021-07-28_13-03-02
  done: true
  epoch: 2.0
  eval_loss: 1.0793331861495972
  eval_runtime: 6.4403
  eval_samples_per_second: 310.542
  eval_steps_per_second: 38.818
  experiment_id: 91bc9af23e9047ca98d3a231d9c54648
  hostname: 6fa48dca3aae
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 1.0793331861495972
  pid: 320
  time_since_restore: 62.70022535324097
  time_this_iter_s: 62.70022535324097
  time_total_s: 62.70022535324097
  timestamp: 1627477382
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fd611_00000
  
== Status ==
Memory usage on this node: 5.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_0_25ae3f8c35fdf1b9358965fa86174

[2m[36m(pid=320)[0m 100%|██████████| 250/250 [00:06<00:00, 38.99it/s]100%|██████████| 250/250 [00:06<00:00, 38.92it/s]
[2m[36m(pid=319)[0m 2021-07-28 13:03:04.031870: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=319)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
[2m[36m(pid=319)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).


[2m[36m(pid=319)[0m {'train_runtime': 41.5507, 'train_samples_per_second': 96.268, 'train_steps_per_second': 6.017, 'train_loss': 0.049001953125, 'epoch': 2.0}


[2m[36m(pid=319)[0m   0%|          | 0/250 [00:00<?, ?it/s]
  2%|▏         | 6/250 [00:00<00:05, 48.67it/s]
  4%|▍         | 11/250 [00:00<00:05, 41.83it/s]
  6%|▋         | 16/250 [00:00<00:05, 39.97it/s]
  8%|▊         | 21/250 [00:00<00:05, 39.25it/s]
 10%|█         | 25/250 [00:00<00:05, 38.66it/s]
 12%|█▏        | 29/250 [00:00<00:05, 38.33it/s]
 13%|█▎        | 33/250 [00:00<00:05, 38.06it/s]
 15%|█▍        | 37/250 [00:00<00:05, 38.37it/s]
 16%|█▋        | 41/250 [00:01<00:05, 38.29it/s]
 18%|█▊        | 45/250 [00:01<00:05, 38.19it/s]
 20%|█▉        | 49/250 [00:01<00:05, 38.05it/s]
 21%|██        | 53/250 [00:01<00:05, 38.04it/s]
 23%|██▎       | 57/250 [00:01<00:05, 38.07it/s]
 24%|██▍       | 61/250 [00:01<00:04, 37.98it/s]
 26%|██▌       | 65/250 [00:01<00:04, 37.87it/s]
 28%|██▊       | 69/250 [00:01<00:04, 37.61it/s]
 29%|██▉       | 73/250 [00:01<00:04, 37.83it/s]
 31%|███       | 77/250 [00:01<00:04, 37.84it/s]
 32%|███▏      | 81/250 [00:02<00:04, 37.76it/s]
 34%|█

Result for _objective_fd611_00001:
  date: 2021-07-28_13-04-01
  done: true
  epoch: 2.0
  eval_loss: 1.8875880241394043
  eval_runtime: 6.6071
  eval_samples_per_second: 302.704
  eval_steps_per_second: 37.838
  experiment_id: 87582f4cf1714f0b8125d0b9baab35d3
  hostname: 6fa48dca3aae
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 1.8875880241394043
  pid: 319
  time_since_restore: 55.14841175079346
  time_this_iter_s: 55.14841175079346
  time_total_s: 55.14841175079346
  timestamp: 1627477441
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fd611_00001
  
== Status ==
Memory usage on this node: 5.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 GPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_25ae3f8c35fdf1b9358965fa86174

[2m[36m(pid=319)[0m 100%|█████████▉| 249/250 [00:06<00:00, 38.00it/s]100%|██████████| 250/250 [00:06<00:00, 37.94it/s]
[2m[36m(pid=432)[0m 2021-07-28 13:04:02.548069: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=432)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
[2m[36m(pid=432)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).


[2m[36m(pid=432)[0m {'loss': 0.0496, 'learning_rate': 9.743868488068864e-06, 'epoch': 2.0}


[2m[36m(pid=432)[0m  40%|████      | 501/1250 [00:59<20:13,  1.62s/it]
 40%|████      | 502/1250 [00:59<14:40,  1.18s/it]
 40%|████      | 503/1250 [00:59<10:40,  1.17it/s]
 40%|████      | 504/1250 [00:59<07:51,  1.58it/s]
 40%|████      | 505/1250 [00:59<05:53,  2.11it/s]
 40%|████      | 506/1250 [00:59<04:31,  2.74it/s]
 41%|████      | 507/1250 [00:59<03:33,  3.48it/s]
 41%|████      | 508/1250 [00:59<02:52,  4.30it/s]
 41%|████      | 509/1250 [00:59<02:24,  5.13it/s]
 41%|████      | 510/1250 [01:00<02:04,  5.93it/s]
 41%|████      | 511/1250 [01:00<01:50,  6.66it/s]
 41%|████      | 512/1250 [01:00<01:41,  7.24it/s]
 41%|████      | 513/1250 [01:00<01:34,  7.76it/s]
 41%|████      | 514/1250 [01:00<01:29,  8.21it/s]
 41%|████      | 515/1250 [01:00<01:26,  8.54it/s]
 41%|████▏     | 516/1250 [01:00<01:23,  8.75it/s]
 41%|████▏     | 517/1250 [01:00<01:22,  8.91it/s]
 41%|████▏     | 518/1250 [01:00<01:21,  9.03it/s]
 42%|████▏     | 519/1250 [01:00<01:20,  9.11it/s]
 42%|██

[2m[36m(pid=432)[0m {'loss': 0.0039, 'learning_rate': 3.2479561626896213e-06, 'epoch': 4.0}


[2m[36m(pid=432)[0m  80%|████████  | 1001/1250 [01:57<06:18,  1.52s/it]
 80%|████████  | 1002/1250 [01:57<04:35,  1.11s/it]
 80%|████████  | 1003/1250 [01:57<03:19,  1.24it/s]
 80%|████████  | 1004/1250 [01:58<02:27,  1.66it/s]
 80%|████████  | 1005/1250 [01:58<01:51,  2.21it/s]
 80%|████████  | 1006/1250 [01:58<01:25,  2.86it/s]
 81%|████████  | 1007/1250 [01:58<01:07,  3.60it/s]
 81%|████████  | 1008/1250 [01:58<00:54,  4.41it/s]
 81%|████████  | 1009/1250 [01:58<00:45,  5.24it/s]
 81%|████████  | 1010/1250 [01:58<00:39,  6.02it/s]
 81%|████████  | 1011/1250 [01:58<00:35,  6.74it/s]
 81%|████████  | 1012/1250 [01:58<00:32,  7.32it/s]
 81%|████████  | 1013/1250 [01:59<00:30,  7.78it/s]
 81%|████████  | 1014/1250 [01:59<00:28,  8.17it/s]
 81%|████████  | 1015/1250 [01:59<00:27,  8.49it/s]
 81%|████████▏ | 1016/1250 [01:59<00:26,  8.69it/s]
 81%|████████▏ | 1017/1250 [01:59<00:26,  8.82it/s]
 81%|████████▏ | 1018/1250 [01:59<00:26,  8.90it/s]
 82%|████████▏ | 1019/1250 [01:59<00:25,

[2m[36m(pid=432)[0m {'train_runtime': 145.7233, 'train_samples_per_second': 68.623, 'train_steps_per_second': 8.578, 'train_loss': 0.02197547731399536, 'epoch': 5.0}


[2m[36m(pid=432)[0m   4%|▍         | 10/250 [00:00<00:05, 40.24it/s]
  6%|▌         | 15/250 [00:00<00:06, 38.81it/s]
  8%|▊         | 19/250 [00:00<00:06, 38.01it/s]
  9%|▉         | 23/250 [00:00<00:06, 37.67it/s]
 11%|█         | 27/250 [00:00<00:06, 36.94it/s]
 12%|█▏        | 31/250 [00:00<00:05, 36.80it/s]
 14%|█▍        | 35/250 [00:00<00:05, 36.86it/s]
 16%|█▌        | 39/250 [00:01<00:05, 36.53it/s]
 17%|█▋        | 43/250 [00:01<00:05, 36.99it/s]
 19%|█▉        | 47/250 [00:01<00:05, 37.21it/s]
 20%|██        | 51/250 [00:01<00:05, 37.27it/s]
 22%|██▏       | 55/250 [00:01<00:05, 36.54it/s]
 24%|██▎       | 59/250 [00:01<00:05, 37.20it/s]
 25%|██▌       | 63/250 [00:01<00:05, 37.06it/s]
 27%|██▋       | 67/250 [00:01<00:04, 36.94it/s]
 28%|██▊       | 71/250 [00:01<00:04, 36.18it/s]
 30%|███       | 75/250 [00:02<00:04, 36.88it/s]
 32%|███▏      | 79/250 [00:02<00:04, 36.99it/s]
 33%|███▎      | 83/250 [00:02<00:04, 37.09it/s]
 35%|███▍      | 87/250 [00:02<00:04, 36.86it

Result for _objective_fd611_00002:
  date: 2021-07-28_13-06-43
  done: true
  epoch: 5.0
  eval_loss: 2.269000768661499
  eval_runtime: 6.8069
  eval_samples_per_second: 293.817
  eval_steps_per_second: 36.727
  experiment_id: d9be7138858c4af8ab24c468c01f2c4c
  hostname: 6fa48dca3aae
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.269000768661499
  pid: 432
  time_since_restore: 159.218918800354
  time_this_iter_s: 159.218918800354
  time_total_s: 159.218918800354
  timestamp: 1627477603
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fd611_00002
  
== Status ==
Memory usage on this node: 6.0/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 GPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 GPU_group_25ae3f8c35fdf1b9358965fa86174a52)


[2m[36m(pid=432)[0m 100%|██████████| 250/250 [00:06<00:00, 36.81it/s]
[2m[36m(pid=496)[0m 2021-07-28 13:06:45.154186: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=496)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
[2m[36m(pid=496)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=496)[0m - This IS NOT expected if yo

[2m[36m(pid=496)[0m {'loss': 0.028, 'learning_rate': 2.0426760281837013e-05, 'epoch': 2.0}


[2m[36m(pid=496)[0m  40%|████      | 501/1250 [00:58<17:31,  1.40s/it]
 40%|████      | 502/1250 [00:58<12:43,  1.02s/it]
 40%|████      | 503/1250 [00:59<09:20,  1.33it/s]
 40%|████      | 504/1250 [00:59<06:55,  1.79it/s]
 40%|████      | 505/1250 [00:59<05:14,  2.37it/s]
 40%|████      | 506/1250 [00:59<04:04,  3.04it/s]
 41%|████      | 507/1250 [00:59<03:14,  3.81it/s]
 41%|████      | 508/1250 [00:59<02:40,  4.63it/s]
 41%|████      | 509/1250 [00:59<02:16,  5.43it/s]
 41%|████      | 510/1250 [00:59<01:59,  6.19it/s]
 41%|████      | 511/1250 [00:59<01:47,  6.88it/s]
 41%|████      | 512/1250 [01:00<01:39,  7.43it/s]
 41%|████      | 513/1250 [01:00<01:33,  7.89it/s]
 41%|████      | 514/1250 [01:00<01:28,  8.29it/s]
 41%|████      | 515/1250 [01:00<01:26,  8.53it/s]
 41%|████▏     | 516/1250 [01:00<01:23,  8.77it/s]
 41%|████▏     | 517/1250 [01:00<01:21,  8.94it/s]
 41%|████▏     | 518/1250 [01:00<01:21,  9.02it/s]
 42%|████▏     | 519/1250 [01:00<01:20,  9.08it/s]
 42%|██

[2m[36m(pid=496)[0m {'loss': 0.0015, 'learning_rate': 6.808920093945672e-06, 'epoch': 4.0}


[2m[36m(pid=496)[0m  80%|████████  | 1001/1250 [01:57<05:48,  1.40s/it]
 80%|████████  | 1002/1250 [01:57<04:11,  1.01s/it]
 80%|████████  | 1003/1250 [01:57<03:03,  1.34it/s]
 80%|████████  | 1004/1250 [01:58<02:16,  1.81it/s]
 80%|████████  | 1005/1250 [01:58<01:42,  2.38it/s]
 80%|████████  | 1006/1250 [01:58<01:19,  3.07it/s]
 81%|████████  | 1007/1250 [01:58<01:03,  3.83it/s]
 81%|████████  | 1008/1250 [01:58<00:52,  4.64it/s]
 81%|████████  | 1009/1250 [01:58<00:44,  5.43it/s]
 81%|████████  | 1010/1250 [01:58<00:38,  6.20it/s]
 81%|████████  | 1011/1250 [01:58<00:34,  6.89it/s]
 81%|████████  | 1012/1250 [01:58<00:32,  7.42it/s]
 81%|████████  | 1013/1250 [01:59<00:30,  7.89it/s]
 81%|████████  | 1014/1250 [01:59<00:28,  8.23it/s]
 81%|████████  | 1015/1250 [01:59<00:27,  8.48it/s]
 81%|████████▏ | 1016/1250 [01:59<00:26,  8.70it/s]
 81%|████████▏ | 1017/1250 [01:59<00:26,  8.86it/s]
 81%|████████▏ | 1018/1250 [01:59<00:25,  8.97it/s]
 82%|████████▏ | 1019/1250 [01:59<00:25,

[2m[36m(pid=496)[0m {'train_runtime': 145.9153, 'train_samples_per_second': 68.533, 'train_steps_per_second': 8.567, 'train_loss': 0.011978768873214722, 'epoch': 5.0}


  4%|▍         | 10/250 [00:00<00:05, 40.66it/s]
  6%|▌         | 15/250 [00:00<00:06, 38.60it/s]
  8%|▊         | 19/250 [00:00<00:06, 37.60it/s]
  9%|▉         | 23/250 [00:00<00:06, 37.24it/s]
 11%|█         | 27/250 [00:00<00:06, 36.88it/s]
 12%|█▏        | 31/250 [00:00<00:05, 36.76it/s]
 14%|█▍        | 35/250 [00:00<00:05, 36.13it/s]
 16%|█▌        | 39/250 [00:01<00:05, 36.56it/s]
 17%|█▋        | 43/250 [00:01<00:05, 36.77it/s]
 19%|█▉        | 47/250 [00:01<00:05, 36.81it/s]
 20%|██        | 51/250 [00:01<00:05, 36.87it/s]
 22%|██▏       | 55/250 [00:01<00:05, 36.04it/s]
 24%|██▎       | 59/250 [00:01<00:05, 36.63it/s]
 25%|██▌       | 63/250 [00:01<00:05, 36.60it/s]
 27%|██▋       | 67/250 [00:01<00:05, 36.08it/s]
 28%|██▊       | 71/250 [00:01<00:04, 36.44it/s]
 30%|███       | 75/250 [00:02<00:04, 36.39it/s]
 32%|███▏      | 79/250 [00:02<00:04, 36.42it/s]
 33%|███▎      | 83/250 [00:02<00:04, 36.44it/s]
 35%|███▍      | 87/250 [00:02<00:04, 36.30it/s]
 36%|███▋      | 91/

Result for _objective_fd611_00003:
  date: 2021-07-28_13-09-26
  done: true
  epoch: 5.0
  eval_loss: 2.877042293548584
  eval_runtime: 6.8815
  eval_samples_per_second: 290.636
  eval_steps_per_second: 36.329
  experiment_id: 5e8ba900e1b147e99e9491f733371ccf
  hostname: 6fa48dca3aae
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.877042293548584
  pid: 496
  time_since_restore: 159.63867330551147
  time_this_iter_s: 159.63867330551147
  time_total_s: 159.63867330551147
  timestamp: 1627477766
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fd611_00003
  
== Status ==
Memory usage on this node: 6.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 GPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 accelerator_typ

[2m[36m(pid=562)[0m 2021-07-28 13:09:28.426290: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=562)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
[2m[36m(pid=562)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=562)[0m - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a 

[2m[36m(pid=562)[0m {'train_runtime': 64.0051, 'train_samples_per_second': 93.743, 'train_steps_per_second': 5.859, 'train_loss': 0.032019093831380206, 'epoch': 3.0}


[2m[36m(pid=562)[0m   4%|▍         | 10/250 [00:00<00:05, 40.93it/s]
  6%|▌         | 15/250 [00:00<00:06, 38.47it/s]
  8%|▊         | 19/250 [00:00<00:06, 37.76it/s]
  9%|▉         | 23/250 [00:00<00:06, 36.84it/s]
 11%|█         | 27/250 [00:00<00:06, 36.96it/s]
 12%|█▏        | 31/250 [00:00<00:05, 36.74it/s]
 14%|█▍        | 35/250 [00:00<00:05, 36.48it/s]
 16%|█▌        | 39/250 [00:01<00:05, 36.68it/s]
 17%|█▋        | 43/250 [00:01<00:05, 36.65it/s]
 19%|█▉        | 47/250 [00:01<00:05, 36.70it/s]
 20%|██        | 51/250 [00:01<00:05, 36.42it/s]
 22%|██▏       | 55/250 [00:01<00:05, 36.49it/s]
 24%|██▎       | 59/250 [00:01<00:05, 36.41it/s]
 25%|██▌       | 63/250 [00:01<00:05, 36.26it/s]
 27%|██▋       | 67/250 [00:01<00:05, 36.10it/s]
 28%|██▊       | 71/250 [00:01<00:04, 36.34it/s]
 30%|███       | 75/250 [00:02<00:04, 36.49it/s]
 32%|███▏      | 79/250 [00:02<00:04, 36.45it/s]
 33%|███▎      | 83/250 [00:02<00:04, 36.46it/s]
 35%|███▍      | 87/250 [00:02<00:04, 36.37it

Result for _objective_fd611_00004:
  date: 2021-07-28_13-10-48
  done: true
  epoch: 3.0
  eval_loss: 2.345756769180298
  eval_runtime: 6.8847
  eval_samples_per_second: 290.499
  eval_steps_per_second: 36.312
  experiment_id: d9401e4559ae4af08f126ae20956742e
  hostname: 6fa48dca3aae
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.345756769180298
  pid: 562
  time_since_restore: 77.84725332260132
  time_this_iter_s: 77.84725332260132
  time_total_s: 77.84725332260132
  timestamp: 1627477848
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fd611_00004
  
== Status ==
Memory usage on this node: 6.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 CPU_group_0_25ae3f8c35fdf1b9358965fa86174a52, 0.0/1.0 GPU_group_0_25ae3f8c35fdf1b9358965fa86174a5

Search Algorithm : 
  * If not provided `BasicVariantGenerator` Random search and grid search
  * Link : https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#tune-basicvariant 

### Best run 

Best_run : Optuna (n_trails = 5) with default hspace

```
BestRun(run_id='1', objective=0.8464898467063904, hyperparameters={'learning_rate': 3.2522034211592625e-06, 'num_train_epochs': 1, 'seed': 24, 'per_device_train_batch_size': 32})
```

Best_run : ray (n_trials = 3) with custom hspace
```
BestRun(run_id='9a82c_00000', objective=2.791937828063965, hyperparameters={'learning_rate': 2.49816047538945e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32})
```

Best_run : ray (n_trails = 5) with custom hspace (**To be used**)

In [None]:
best_run 

BestRun(run_id='fd611_00000', objective=1.0793331861495972, hyperparameters={'learning_rate': 2.49816047538945e-05, 'num_train_epochs': 2, 'seed': 15, 'per_device_train_batch_size': 32})

## XL-Net

In [None]:
model = 'xlnet-base-cased'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model,problem_type="multi_label_classification")

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/xlnet-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/06bdb0f5882dbb833618c81c3b4c996a0c79422fa2c95ffea3827f92fc2dba6b.da982e2e596ec73828dbae86525a1870e513bd63aae5a2dc773ccc840ac5c346
Model config XLNetConfig {
  "architectures": [
    "XLNetLMHeadModel"
  ],
  "attn_type": "bi",
  "bi_data": false,
  "bos_token_id": 1,
  "clamp_len": -1,
  "d_head": 64,
  "d_inner": 3072,
  "d_model": 768,
  "dropout": 0.1,
  "end_n_top": 5,
  "eos_token_id": 2,
  "ff_activation": "gelu",
  "initializer_range": 0.02,
  "layer_norm_eps": 1e-12,
  "mem_len": null,
  "model_type": "xlnet",
  "n_head": 12,
  "n_layer": 12,
  "pad_token_id": 5,
  "problem_type": "multi_label_classification",
  "reuse_len": null,
  "same_length": false,
  "start_n_top": 5,
  "summary_activation": "tanh",
  "summary_last_dropout": 0.1,
  "summar

Under-sampling for 1000 due to large LM size 

In [None]:
train_df = train_df.sample(2000)
val_df = val_df.sample(2000)

In [None]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, problem_type="multi_label_classification", num_labels = num_labels )

In [None]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# batch_size = 8

# training_args = TrainingArguments(
#     "test", evaluate_during_training=True, eval_steps=500, disable_tqdm=True)

trainer = Trainer(
    model_init= model_init,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/xlnet-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/06bdb0f5882dbb833618c81c3b4c996a0c79422fa2c95ffea3827f92fc2dba6b.da982e2e596ec73828dbae86525a1870e513bd63aae5a2dc773ccc840ac5c346
Model config XLNetConfig {
  "architectures": [
    "XLNetLMHeadModel"
  ],
  "attn_type": "bi",
  "bi_data": false,
  "bos_token_id": 1,
  "clamp_len": -1,
  "d_head": 64,
  "d_inner": 3072,
  "d_model": 768,
  "dropout": 0.1,
  "end_n_top": 5,
  "eos_token_id": 2,
  "ff_activation": "gelu",
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2

Downloading:   0%|          | 0.00/467M [00:00<?, ?B/s]

storing https://huggingface.co/xlnet-base-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9461853998373b0b2f8ef8011a13b62a2c5f540b2c535ef3ea46ed8a062b16a9.3e214f11a50e9e03eb47535b58522fc3cc11ac67c120a9450f6276de151af987
creating metadata file for /root/.cache/huggingface/transformers/9461853998373b0b2f8ef8011a13b62a2c5f540b2c535ef3ea46ed8a062b16a9.3e214f11a50e9e03eb47535b58522fc3cc11ac67c120a9450f6276de151af987
loading weights file https://huggingface.co/xlnet-base-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9461853998373b0b2f8ef8011a13b62a2c5f540b2c535ef3ea46ed8a062b16a9.3e214f11a50e9e03eb47535b58522fc3cc11ac67c120a9450f6276de151af987
Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on 

In [None]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
best_run = trainer.hyperparameter_search(n_trials=5,hp_space=my_hp_space,backend="ray")

No `resources_per_trial` arg was passed into `hyperparameter_search`. Setting it to a default value of 1 CPU and 1 GPU for each trial.
2021-07-22 16:57:59,849	INFO services.py:1274 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 3.4/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-22_16-58-04
Number of trials: 5/5 (5 PENDING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fb0b4_00000 | PENDING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fb0b4_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fb0b4_00002 | PENDING  |       |     1.62398e-05 |             

[2m[36m(pid=628)[0m 2021-07-22 16:58:05.530782: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Memory usage on this node: 4.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-22_16-58-04
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fb0b4_00000 | RUNNING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fb0b4_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fb0b4_00002 | PENDING  |       |     1.62398e-05

[2m[36m(pid=628)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
[2m[36m(pid=628)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=628)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=628)[0m Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
[2m

[2m[36m(pid=628)[0m {'train_runtime': 42.8119, 'train_samples_per_second': 93.432, 'train_steps_per_second': 2.943, 'train_loss': 0.034531430592612614, 'epoch': 2.0}


[2m[36m(pid=628)[0m   4%|▎         | 9/250 [00:00<00:07, 33.46it/s]
  5%|▌         | 13/250 [00:00<00:07, 32.27it/s]
  7%|▋         | 17/250 [00:00<00:07, 31.26it/s]
  8%|▊         | 21/250 [00:00<00:07, 31.48it/s]
 10%|█         | 25/250 [00:00<00:07, 30.72it/s]
 12%|█▏        | 29/250 [00:00<00:07, 30.88it/s]
 13%|█▎        | 33/250 [00:01<00:07, 30.37it/s]
 15%|█▍        | 37/250 [00:01<00:06, 31.01it/s]
 16%|█▋        | 41/250 [00:01<00:06, 30.65it/s]
 18%|█▊        | 45/250 [00:01<00:06, 30.78it/s]
 20%|█▉        | 49/250 [00:01<00:06, 30.69it/s]
 21%|██        | 53/250 [00:01<00:06, 30.57it/s]
 23%|██▎       | 57/250 [00:01<00:06, 30.63it/s]
 24%|██▍       | 61/250 [00:01<00:06, 30.65it/s]
 26%|██▌       | 65/250 [00:02<00:06, 30.65it/s]
 28%|██▊       | 69/250 [00:02<00:05, 30.41it/s]
 29%|██▉       | 73/250 [00:02<00:05, 30.54it/s]
 31%|███       | 77/250 [00:02<00:05, 30.34it/s]
 32%|███▏      | 81/250 [00:02<00:05, 30.45it/s]
 34%|███▍      | 85/250 [00:02<00:05, 30.35it/

Result for _objective_fb0b4_00000:
  date: 2021-07-22_16-59-18
  done: true
  epoch: 2.0
  eval_loss: 2.250964879989624
  eval_runtime: 8.2556
  eval_samples_per_second: 242.259
  eval_steps_per_second: 30.282
  experiment_id: 0d8a08ebe2d04c6a9c4a24fc6f2093ee
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.250964879989624
  pid: 628
  time_since_restore: 71.2275619506836
  time_this_iter_s: 71.2275619506836
  time_total_s: 71.2275619506836
  timestamp: 1626973158
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00000
  
== Status ==
Memory usage on this node: 5.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854679)


[2m[36m(pid=629)[0m 2021-07-22 16:59:19.413121: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=629)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
[2m[36m(pid=629)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=629)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=629)[0m Some weights of XLNetForSequenceClassification were not initialized from the model check

[2m[36m(pid=629)[0m {'train_runtime': 48.8778, 'train_samples_per_second': 81.837, 'train_steps_per_second': 5.115, 'train_loss': 0.015162269592285156, 'epoch': 2.0}


[2m[36m(pid=629)[0m   4%|▍         | 10/250 [00:00<00:07, 33.34it/s]
  6%|▌         | 14/250 [00:00<00:07, 31.70it/s]
  7%|▋         | 18/250 [00:00<00:07, 31.18it/s]
  9%|▉         | 22/250 [00:00<00:07, 30.52it/s]
 10%|█         | 26/250 [00:00<00:07, 29.89it/s]
 12%|█▏        | 30/250 [00:00<00:07, 30.11it/s]
 14%|█▎        | 34/250 [00:01<00:07, 29.86it/s]
 15%|█▍        | 37/250 [00:01<00:07, 29.77it/s]
 16%|█▌        | 40/250 [00:01<00:07, 29.75it/s]
 17%|█▋        | 43/250 [00:01<00:06, 29.69it/s]
 18%|█▊        | 46/250 [00:01<00:06, 29.56it/s]
 20%|█▉        | 49/250 [00:01<00:06, 29.51it/s]
 21%|██        | 52/250 [00:01<00:06, 29.28it/s]
 22%|██▏       | 55/250 [00:01<00:06, 29.40it/s]
 23%|██▎       | 58/250 [00:01<00:06, 28.86it/s]
 25%|██▍       | 62/250 [00:02<00:06, 29.74it/s]
 26%|██▌       | 65/250 [00:02<00:06, 29.57it/s]
 27%|██▋       | 68/250 [00:02<00:06, 29.47it/s]
 28%|██▊       | 71/250 [00:02<00:06, 29.46it/s]
 30%|██▉       | 74/250 [00:02<00:06, 29.19it

Result for _objective_fb0b4_00001:
  date: 2021-07-22_17-00-25
  done: true
  epoch: 2.0
  eval_loss: 2.5871098041534424
  eval_runtime: 8.5021
  eval_samples_per_second: 235.237
  eval_steps_per_second: 29.405
  experiment_id: 3a3779ca9307496496de4cae74de744f
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.5871098041534424
  pid: 629
  time_since_restore: 64.38682055473328
  time_this_iter_s: 64.38682055473328
  time_total_s: 64.38682055473328
  timestamp: 1626973225
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00001
  
== Status ==
Memory usage on this node: 5.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854

[2m[36m(pid=748)[0m 2021-07-22 17:00:27.021300: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=748)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
[2m[36m(pid=748)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=748)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=748)[0m Some weights of XLNetForSequenceClassification were not initialized from the model check

[2m[36m(pid=748)[0m {'loss': 0.016, 'learning_rate': 9.743868488068864e-06, 'epoch': 2.0}


[2m[36m(pid=748)[0m  40%|████      | 501/1250 [01:06<20:05,  1.61s/it]
 40%|████      | 502/1250 [01:07<14:36,  1.17s/it]
 40%|████      | 503/1250 [01:07<10:40,  1.17it/s]
 40%|████      | 504/1250 [01:07<07:54,  1.57it/s]
 40%|████      | 505/1250 [01:07<05:58,  2.08it/s]
 40%|████      | 506/1250 [01:07<04:38,  2.68it/s]
 41%|████      | 507/1250 [01:07<03:41,  3.36it/s]
 41%|████      | 508/1250 [01:07<03:01,  4.09it/s]
 41%|████      | 509/1250 [01:07<02:33,  4.82it/s]
 41%|████      | 510/1250 [01:08<02:14,  5.51it/s]
 41%|████      | 511/1250 [01:08<02:01,  6.10it/s]
 41%|████      | 512/1250 [01:08<01:51,  6.63it/s]
 41%|████      | 513/1250 [01:08<01:44,  7.04it/s]
 41%|████      | 514/1250 [01:08<01:40,  7.34it/s]
 41%|████      | 515/1250 [01:08<01:36,  7.58it/s]
 41%|████▏     | 516/1250 [01:08<01:34,  7.77it/s]
 41%|████▏     | 517/1250 [01:08<01:33,  7.86it/s]
 41%|████▏     | 518/1250 [01:09<01:31,  7.96it/s]
 42%|████▏     | 519/1250 [01:09<01:31,  8.03it/s]
 42%|██

[2m[36m(pid=748)[0m {'loss': 0.0013, 'learning_rate': 3.2479561626896213e-06, 'epoch': 4.0}


[2m[36m(pid=748)[0m  80%|████████  | 1001/1250 [02:13<06:34,  1.58s/it]
 80%|████████  | 1002/1250 [02:13<04:45,  1.15s/it]
 80%|████████  | 1003/1250 [02:13<03:28,  1.19it/s]
 80%|████████  | 1004/1250 [02:13<02:34,  1.60it/s]
 80%|████████  | 1005/1250 [02:13<01:56,  2.10it/s]
 80%|████████  | 1006/1250 [02:14<01:30,  2.71it/s]
 81%|████████  | 1007/1250 [02:14<01:11,  3.39it/s]
 81%|████████  | 1008/1250 [02:14<00:58,  4.11it/s]
 81%|████████  | 1009/1250 [02:14<00:49,  4.83it/s]
 81%|████████  | 1010/1250 [02:14<00:43,  5.48it/s]
 81%|████████  | 1011/1250 [02:14<00:39,  6.09it/s]
 81%|████████  | 1012/1250 [02:14<00:35,  6.62it/s]
 81%|████████  | 1013/1250 [02:14<00:33,  7.02it/s]
 81%|████████  | 1014/1250 [02:14<00:32,  7.34it/s]
 81%|████████  | 1015/1250 [02:15<00:30,  7.59it/s]
 81%|████████▏ | 1016/1250 [02:15<00:30,  7.75it/s]
 81%|████████▏ | 1017/1250 [02:15<00:29,  7.89it/s]
 81%|████████▏ | 1018/1250 [02:15<00:29,  7.97it/s]
 82%|████████▏ | 1019/1250 [02:15<00:28,

[2m[36m(pid=748)[0m {'train_runtime': 164.9398, 'train_samples_per_second': 60.628, 'train_steps_per_second': 7.579, 'train_loss': 0.007107789528369903, 'epoch': 5.0}


[2m[36m(pid=748)[0m   4%|▍         | 10/250 [00:00<00:07, 33.90it/s]
  6%|▌         | 14/250 [00:00<00:07, 32.69it/s]
  7%|▋         | 18/250 [00:00<00:07, 32.00it/s]
  9%|▉         | 22/250 [00:00<00:07, 31.94it/s]
 10%|█         | 26/250 [00:00<00:07, 31.78it/s]
 12%|█▏        | 30/250 [00:00<00:06, 31.75it/s]
 14%|█▎        | 34/250 [00:01<00:06, 31.45it/s]
 15%|█▌        | 38/250 [00:01<00:06, 31.86it/s]
 17%|█▋        | 42/250 [00:01<00:06, 31.55it/s]
 18%|█▊        | 46/250 [00:01<00:06, 31.24it/s]
 20%|██        | 50/250 [00:01<00:06, 30.99it/s]
 22%|██▏       | 54/250 [00:01<00:06, 31.07it/s]
 23%|██▎       | 58/250 [00:01<00:06, 31.11it/s]
 25%|██▍       | 62/250 [00:01<00:05, 31.45it/s]
 26%|██▋       | 66/250 [00:02<00:05, 31.19it/s]
 28%|██▊       | 70/250 [00:02<00:05, 31.61it/s]
 30%|██▉       | 74/250 [00:02<00:05, 31.23it/s]
 31%|███       | 78/250 [00:02<00:05, 31.14it/s]
 33%|███▎      | 82/250 [00:02<00:05, 31.16it/s]
 34%|███▍      | 86/250 [00:02<00:05, 31.68it

Result for _objective_fb0b4_00002:
  date: 2021-07-22_17-03-28
  done: true
  epoch: 5.0
  eval_loss: 3.0466084480285645
  eval_runtime: 8.0198
  eval_samples_per_second: 249.384
  eval_steps_per_second: 31.173
  experiment_id: cde24f4a9d9e407b9aed64cb9cd07c03
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 3.0466084480285645
  pid: 748
  time_since_restore: 179.82508325576782
  time_this_iter_s: 179.82508325576782
  time_total_s: 179.82508325576782
  timestamp: 1626973408
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00002
  
== Status ==
Memory usage on this node: 6.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f

[2m[36m(pid=748)[0m 100%|██████████| 250/250 [00:07<00:00, 31.18it/s]100%|██████████| 250/250 [00:07<00:00, 31.27it/s]
[2m[36m(pid=815)[0m 2021-07-22 17:03:29.991391: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=815)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
[2m[36m(pid=815)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=815)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassifica

[2m[36m(pid=815)[0m {'loss': 0.009, 'learning_rate': 2.0426760281837013e-05, 'epoch': 2.0}


[2m[36m(pid=815)[0m  40%|████      | 501/1250 [01:06<19:18,  1.55s/it]
 40%|████      | 502/1250 [01:06<14:00,  1.12s/it]
 40%|████      | 503/1250 [01:07<10:14,  1.22it/s]
 40%|████      | 504/1250 [01:07<07:36,  1.63it/s]
 40%|████      | 505/1250 [01:07<05:46,  2.15it/s]
 40%|████      | 506/1250 [01:07<04:29,  2.76it/s]
 41%|████      | 507/1250 [01:07<03:35,  3.45it/s]
 41%|████      | 508/1250 [01:07<02:57,  4.18it/s]
 41%|████      | 509/1250 [01:07<02:30,  4.91it/s]
 41%|████      | 510/1250 [01:07<02:12,  5.59it/s]
 41%|████      | 511/1250 [01:08<01:59,  6.19it/s]
 41%|████      | 512/1250 [01:08<01:50,  6.67it/s]
 41%|████      | 513/1250 [01:08<01:44,  7.07it/s]
 41%|████      | 514/1250 [01:08<01:39,  7.37it/s]
 41%|████      | 515/1250 [01:08<01:36,  7.61it/s]
 41%|████▏     | 516/1250 [01:08<01:34,  7.77it/s]
 41%|████▏     | 517/1250 [01:08<01:32,  7.91it/s]
 41%|████▏     | 518/1250 [01:08<01:31,  8.00it/s]
 42%|████▏     | 519/1250 [01:09<01:30,  8.06it/s]
 42%|██

[2m[36m(pid=815)[0m {'loss': 0.0005, 'learning_rate': 6.808920093945672e-06, 'epoch': 4.0}


[2m[36m(pid=815)[0m  80%|████████  | 1001/1250 [02:13<06:42,  1.62s/it]
 80%|████████  | 1002/1250 [02:13<04:50,  1.17s/it]
 80%|████████  | 1003/1250 [02:13<03:31,  1.17it/s]
 80%|████████  | 1004/1250 [02:13<02:36,  1.57it/s]
 80%|████████  | 1005/1250 [02:13<01:58,  2.07it/s]
 80%|████████  | 1006/1250 [02:13<01:31,  2.67it/s]
 81%|████████  | 1007/1250 [02:14<01:12,  3.36it/s]
 81%|████████  | 1008/1250 [02:14<00:59,  4.08it/s]
 81%|████████  | 1009/1250 [02:14<00:50,  4.81it/s]
 81%|████████  | 1010/1250 [02:14<00:43,  5.49it/s]
 81%|████████  | 1011/1250 [02:14<00:39,  6.10it/s]
 81%|████████  | 1012/1250 [02:14<00:36,  6.60it/s]
 81%|████████  | 1013/1250 [02:14<00:33,  7.00it/s]
 81%|████████  | 1014/1250 [02:14<00:32,  7.31it/s]
 81%|████████  | 1015/1250 [02:14<00:31,  7.57it/s]
 81%|████████▏ | 1016/1250 [02:15<00:30,  7.74it/s]
 81%|████████▏ | 1017/1250 [02:15<00:29,  7.88it/s]
 81%|████████▏ | 1018/1250 [02:15<00:29,  7.98it/s]
 82%|████████▏ | 1019/1250 [02:15<00:28,

[2m[36m(pid=815)[0m {'train_runtime': 164.8907, 'train_samples_per_second': 60.646, 'train_steps_per_second': 7.581, 'train_loss': 0.003877159309387207, 'epoch': 5.0}


[2m[36m(pid=815)[0m   2%|▏         | 5/250 [00:00<00:05, 42.54it/s]
  4%|▍         | 10/250 [00:00<00:06, 35.41it/s]
  6%|▌         | 14/250 [00:00<00:06, 33.75it/s]
  7%|▋         | 18/250 [00:00<00:07, 32.15it/s]
  9%|▉         | 22/250 [00:00<00:07, 31.53it/s]
 10%|█         | 26/250 [00:00<00:07, 31.66it/s]
 12%|█▏        | 30/250 [00:00<00:07, 30.55it/s]
 14%|█▎        | 34/250 [00:01<00:06, 31.15it/s]
 15%|█▌        | 38/250 [00:01<00:06, 30.99it/s]
 17%|█▋        | 42/250 [00:01<00:06, 31.69it/s]
 18%|█▊        | 46/250 [00:01<00:06, 31.73it/s]
 20%|██        | 50/250 [00:01<00:06, 31.30it/s]
 22%|██▏       | 54/250 [00:01<00:06, 31.07it/s]
 23%|██▎       | 58/250 [00:01<00:06, 31.53it/s]
 25%|██▍       | 62/250 [00:01<00:05, 31.43it/s]
 26%|██▋       | 66/250 [00:02<00:05, 31.18it/s]
 28%|██▊       | 70/250 [00:02<00:05, 31.23it/s]
 30%|██▉       | 74/250 [00:02<00:05, 31.39it/s]
 31%|███       | 78/250 [00:02<00:05, 31.17it/s]
 33%|███▎      | 82/250 [00:02<00:05, 31.48it/

Result for _objective_fb0b4_00003:
  date: 2021-07-22_17-06-32
  done: true
  epoch: 5.0
  eval_loss: 3.4166460037231445
  eval_runtime: 8.0051
  eval_samples_per_second: 249.84
  eval_steps_per_second: 31.23
  experiment_id: 6c62bf60174c4db795fcb059b94d9c33
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 3.4166460037231445
  pid: 815
  time_since_restore: 180.0498378276825
  time_this_iter_s: 180.0498378276825
  time_total_s: 180.0498378276825
  timestamp: 1626973592
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00003
  
== Status ==
Memory usage on this node: 6.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f85467

[2m[36m(pid=886)[0m 2021-07-22 17:06:34.203949: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=886)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
[2m[36m(pid=886)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=886)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=886)[0m Some weights of XLNetForSequenceClassification were not initialized from the model check

[2m[36m(pid=886)[0m {'train_runtime': 74.1506, 'train_samples_per_second': 80.916, 'train_steps_per_second': 5.057, 'train_loss': 0.009768032073974609, 'epoch': 3.0}


[2m[36m(pid=886)[0m   2%|▏         | 5/250 [00:00<00:05, 41.36it/s]
  4%|▍         | 10/250 [00:00<00:06, 35.29it/s]
  6%|▌         | 14/250 [00:00<00:07, 33.51it/s]
  7%|▋         | 18/250 [00:00<00:07, 32.41it/s]
  9%|▉         | 22/250 [00:00<00:07, 31.47it/s]
 10%|█         | 26/250 [00:00<00:07, 31.89it/s]
 12%|█▏        | 30/250 [00:00<00:07, 31.08it/s]
 14%|█▎        | 34/250 [00:01<00:06, 32.04it/s]
 15%|█▌        | 38/250 [00:01<00:06, 31.35it/s]
 17%|█▋        | 42/250 [00:01<00:06, 31.73it/s]
 18%|█▊        | 46/250 [00:01<00:06, 31.37it/s]
 20%|██        | 50/250 [00:01<00:06, 32.25it/s]
 22%|██▏       | 54/250 [00:01<00:06, 31.75it/s]
 23%|██▎       | 58/250 [00:01<00:06, 31.94it/s]
 25%|██▍       | 62/250 [00:01<00:05, 31.75it/s]
 26%|██▋       | 66/250 [00:02<00:05, 31.51it/s]
 28%|██▊       | 70/250 [00:02<00:05, 31.64it/s]
 30%|██▉       | 74/250 [00:02<00:05, 31.36it/s]
 31%|███       | 78/250 [00:02<00:05, 31.38it/s]
 33%|███▎      | 82/250 [00:02<00:05, 31.23it/

Result for _objective_fb0b4_00004:
  date: 2021-07-22_17-08-05
  done: true
  epoch: 3.0
  eval_loss: 2.861640691757202
  eval_runtime: 8.1852
  eval_samples_per_second: 244.343
  eval_steps_per_second: 30.543
  experiment_id: 879c77dce45f47f0bf7ff2b7eb2f5e4c
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.861640691757202
  pid: 886
  time_since_restore: 89.31720423698425
  time_this_iter_s: 89.31720423698425
  time_total_s: 89.31720423698425
  timestamp: 1626973685
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00004
  
== Status ==
Memory usage on this node: 6.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f85467

 Best run : `LR : 2.21697e-05 |Epochs : 3 |Batch_size : 8 `
 Error with batch size of 32
  * Due to undersampling ?? 

### Best run 

XL-Net-base

In [None]:
best_run

BestRun(run_id='fb0b4_00000', objective=2.250964879989624, hyperparameters={'learning_rate': 2.49816047538945e-05, 'num_train_epochs': 2, 'seed': 15, 'per_device_train_batch_size': 32})

XL-Net-Large

In [None]:
best_run

BestRun(run_id='84722_00003', objective=2.2981693744659424, hyperparameters={'learning_rate': 1.2323344486727979e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32})

## Roberta

In [None]:
model = 'roberta-base'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model,problem_type="multi_label_classification")

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
train_df = train_df.sample(2000)
val_df = val_df.sample(2000)

In [None]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, problem_type="multi_label_classification", num_labels = num_labels )

In [None]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# batch_size = 8

training_args = TrainingArguments(
    "test", eval_steps=500, disable_tqdm=True)

trainer = Trainer(
    model_init= model_init,
    args = training_args,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

storing https://huggingface.co/roberta-base/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7
creating metadata file for /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7
loading weights file https://huggingface.co/roberta-base/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', '

In [None]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
best_run = trainer.hyperparameter_search(n_trials=5, hp_space=my_hp_space, backend = 'ray' )

No `resources_per_trial` arg was passed into `hyperparameter_search`. Setting it to a default value of 1 CPU and 1 GPU for each trial.
2021-07-25 09:15:45,548	INFO services.py:1274 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 3.4/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-25_09-15-49
Number of trials: 5/5 (5 PENDING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_e7185_00000 | PENDING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_e7185_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_e7185_00002 | PENDING  |       |     1.62398e-05 |            

[2m[36m(pid=395)[0m 2021-07-25 09:15:50.690869: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Memory usage on this node: 4.6/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-25_09-15-49
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_e7185_00000 | RUNNING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_e7185_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_e7185_00002 | PENDING  |       |     1.62398e-0

[2m[36m(pid=395)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias']
[2m[36m(pid=395)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=395)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=395)[0m Some weights of RobertaForSequenceClassification were not initialized from the model 

[2m[36m(pid=395)[0m {'train_runtime': 38.5293, 'train_samples_per_second': 103.817, 'train_steps_per_second': 3.27, 'train_loss': 0.09541043024214488, 'epoch': 2.0}
Result for _objective_e7185_00000:
  date: 2021-07-25_09-16-57
  done: true
  epoch: 2.0
  eval_loss: 0.02722325176000595
  eval_runtime: 6.4552
  eval_samples_per_second: 309.829
  eval_steps_per_second: 38.729
  experiment_id: 15b91dd4ceb4430eac838e9b4f450c9e
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.02722325176000595
  pid: 395
  time_since_restore: 65.24121451377869
  time_this_iter_s: 65.24121451377869
  time_total_s: 65.24121451377869
  timestamp: 1627204617
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00000
  
[2m[36m(pid=395)[0m {'eval_loss': 0.02722325176000595, 'eval_runtime': 6.4552, 'eval_samples_per_second': 309.829, 'eval_steps_per_second': 38.729, 'epoch': 2.0}
== Status ==
Memory usage on this node: 5.1/12.7 GiB
Using FIFO sche

[2m[36m(pid=396)[0m 2021-07-25 09:16:58.502343: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=396)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight']
[2m[36m(pid=396)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=396)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=396)[0m {'train_runtime': 45.0829, 'train_samples_per_second': 88.725, 'train_steps_per_second': 5.545, 'train_loss': 0.03513337326049805, 'epoch': 2.0}
Result for _objective_e7185_00001:
  date: 2021-07-25_09-17-58
  done: true
  epoch: 2.0
  eval_loss: 0.005616203416138887
  eval_runtime: 6.8701
  eval_samples_per_second: 291.115
  eval_steps_per_second: 36.389
  experiment_id: 24c191e2ef9b457db757bb7c470bf28a
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.005616203416138887
  pid: 396
  time_since_restore: 58.409374952316284
  time_this_iter_s: 58.409374952316284
  time_total_s: 58.409374952316284
  timestamp: 1627204678
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00001
  
== Status ==
Memory usage on this node: 5.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 GPU_group_0_154fee2e432168d0fcaf5f3577375dfa, 0.0/1.

[2m[36m(pid=507)[0m 2021-07-25 09:18:00.060052: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=507)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight']
[2m[36m(pid=507)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=507)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=507)[0m {'loss': 0.0381, 'learning_rate': 9.743868488068864e-06, 'epoch': 2.0}
[2m[36m(pid=507)[0m {'loss': 0.0026, 'learning_rate': 3.2479561626896213e-06, 'epoch': 4.0}
[2m[36m(pid=507)[0m {'train_runtime': 157.6131, 'train_samples_per_second': 63.447, 'train_steps_per_second': 7.931, 'train_loss': 0.016683567428588866, 'epoch': 5.0}
Result for _objective_e7185_00002:
  date: 2021-07-25_09-20-52
  done: true
  epoch: 5.0
  eval_loss: 0.0022873186971992254
  eval_runtime: 6.8192
  eval_samples_per_second: 293.289
  eval_steps_per_second: 36.661
  experiment_id: 80584ba295fa4bf2b438c43e517f2bc0
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.0022873186971992254
  pid: 507
  time_since_restore: 170.89797496795654
  time_this_iter_s: 170.89797496795654
  time_total_s: 170.89797496795654
  timestamp: 1627204852
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00002
  
== Status ==
Memory usage on this n

[2m[36m(pid=574)[0m 2021-07-25 09:20:54.467708: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=574)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight']
[2m[36m(pid=574)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=574)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=574)[0m {'loss': 0.0203, 'learning_rate': 2.0426760281837013e-05, 'epoch': 2.0}
[2m[36m(pid=574)[0m {'loss': 0.0009, 'learning_rate': 6.808920093945672e-06, 'epoch': 4.0}
[2m[36m(pid=574)[0m {'train_runtime': 158.3546, 'train_samples_per_second': 63.149, 'train_steps_per_second': 7.894, 'train_loss': 0.008609517633914947, 'epoch': 5.0}
Result for _objective_e7185_00003:
  date: 2021-07-25_09-23-49
  done: true
  epoch: 5.0
  eval_loss: 0.0005985907046124339
  eval_runtime: 6.7897
  eval_samples_per_second: 294.566
  eval_steps_per_second: 36.821
  experiment_id: 6412b88a57e04490aae001ba3bccfeea
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.0005985907046124339
  pid: 574
  time_since_restore: 172.49245429039001
  time_this_iter_s: 172.49245429039001
  time_total_s: 172.49245429039001
  timestamp: 1627205029
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00003
  
== Status ==
Memory usage on this n

[2m[36m(pid=641)[0m 2021-07-25 09:23:51.155905: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=641)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.dense.bias']
[2m[36m(pid=641)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=641)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=641)[0m {'train_runtime': 67.4374, 'train_samples_per_second': 88.971, 'train_steps_per_second': 5.561, 'train_loss': 0.01966500473022461, 'epoch': 3.0}


2021-07-25 09:25:14,086	INFO tune.py:549 -- Total run time: 564.52 seconds (564.17 seconds for the tuning loop).


Result for _objective_e7185_00004:
  date: 2021-07-25_09-25-13
  done: true
  epoch: 3.0
  eval_loss: 0.002161722630262375
  eval_runtime: 6.7136
  eval_samples_per_second: 297.903
  eval_steps_per_second: 37.238
  experiment_id: 327050abeb1e4f948275158db5974e75
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.002161722630262375
  pid: 641
  time_since_restore: 81.03456854820251
  time_this_iter_s: 81.03456854820251
  time_total_s: 81.03456854820251
  timestamp: 1627205113
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00004
  
[2m[36m(pid=641)[0m {'eval_loss': 0.002161722630262375, 'eval_runtime': 6.7136, 'eval_samples_per_second': 297.903, 'eval_steps_per_second': 37.238, 'epoch': 3.0}
== Status ==
Memory usage on this node: 6.0/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 GPU_group_154fee2e432168d0fcaf5f3577375dfa, 0.0/1.0 CPU_g

### Best run 

Roberta-base 

In [None]:
best_run 

BestRun(run_id='e7185_00003', objective=0.0005985907046124339, hyperparameters={'learning_rate': 3.404460046972836e-05, 'num_train_epochs': 5, 'seed': 22, 'per_device_train_batch_size': 8})

Roberta-large

In [None]:
best_run 

BestRun(run_id='48ed9_00003', objective=1.7053132057189941, hyperparameters={'learning_rate': 1.2323344486727979e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32})

## GPT-2

In [56]:
model = 'gpt2'

In [58]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained(model)

loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578b

In [59]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# default to left padding
tokenizer.padding_side = "left"

Assigning [PAD] to the pad_token key of the tokenizer
Adding [PAD] to the vocabulary


In [60]:
train_df = train_df.sample(2000)
val_df = val_df.sample(2000)

In [61]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [62]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [63]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, num_labels = num_labels)

Creating a custom trainer

In [64]:
class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
      
        # resize model embedding to match new tokenizer
        model.resize_token_embeddings(len(tokenizer))

        # fix model padding token id
        model.config.pad_token_id = model.config.eos_token_id
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), 
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss

In [65]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# batch_size = 8

training_args = TrainingArguments(
    "test", label_names = LABEL_COLUMN, eval_steps=500, disable_tqdm=True)

trainer = MultilabelTrainer(
    model_init = model_init,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [

In [66]:
# Defaut objective is the sum of all metrics when metrics are provided, else minimize the loss 
best_run = trainer.hyperparameter_search(n_trials=5, hp_space=my_hp_space, backend='ray')

No `resources_per_trial` arg was passed into `hyperparameter_search`. Setting it to a default value of 1 CPU and 1 GPU for each trial.
2021-08-04 12:17:48,258	INFO services.py:1247 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 3.6/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.29 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-08-04_12-17-52
Number of trials: 5/5 (5 PENDING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fe055_00000 | PENDING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fe055_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fe055_00002 | PENDING  |       |     1.62398e-05 |            

[2m[36m(pid=896)[0m 2021-08-04 12:17:54.070920: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Memory usage on this node: 4.7/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.29 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-08-04_12-17-52
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fe055_00000 | RUNNING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fe055_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fe055_00002 | PENDING  |       |     1.62398e-0

[2m[36m(pid=896)[0m Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
[2m[36m(pid=896)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/126 [00:00<?, ?it/s]
  1%|          | 1/126 [00:00<01:58,  1.06it/s]
  2%|▏         | 2/126 [00:01<01:10,  1.77it/s]
  2%|▏         | 3/126 [00:01<00:54,  2.25it/s]
  3%|▎         | 4/126 [00:01<00:47,  2.58it/s]
  4%|▍         | 5/126 [00:02<00:43,  2.80it/s]
  5%|▍         | 6/126 [00:02<00:40,  2.94it/s]
  6%|▌         | 7/126 [00:02<00:38,  3.07it/s]
  6%|▋         | 8/126 [00:03<00:37,  3.13it/s]
  7%|▋         | 9/126 [00:03<00:36,  3.19it/s]
  8%|▊         | 10/126 [00:03<00:36,  3.21it/s]
  9%|▊         | 11/126 [00:03<00:35,  3.24it/s]
 10%|▉         | 12/126 [00:04<00:35,  3.25it/s]
 10%|█         | 13/126 [00:04<00:34,  3.26it/s]
 11%|█         | 14/126 [00:04

[2m[36m(pid=896)[0m {'train_runtime': 41.3986, 'train_samples_per_second': 96.622, 'train_steps_per_second': 3.044, 'train_loss': 0.04892412821451823, 'epoch': 2.0}


[2m[36m(pid=896)[0m   2%|▏         | 5/250 [00:00<00:06, 39.71it/s]
[2m[36m(pid=896)[0m   4%|▎         | 9/250 [00:00<00:06, 34.78it/s]
  5%|▌         | 13/250 [00:00<00:07, 33.72it/s]
  7%|▋         | 17/250 [00:00<00:07, 33.06it/s]
  8%|▊         | 21/250 [00:00<00:06, 32.88it/s]
 10%|█         | 25/250 [00:00<00:06, 32.62it/s]
 12%|█▏        | 29/250 [00:00<00:06, 32.47it/s]
 13%|█▎        | 33/250 [00:00<00:06, 32.25it/s]
 15%|█▍        | 37/250 [00:01<00:06, 32.36it/s]
 16%|█▋        | 41/250 [00:01<00:06, 32.31it/s]
 18%|█▊        | 45/250 [00:01<00:06, 32.23it/s]
 20%|█▉        | 49/250 [00:01<00:06, 32.12it/s]
 21%|██        | 53/250 [00:01<00:06, 32.06it/s]
 23%|██▎       | 57/250 [00:01<00:06, 31.83it/s]
 24%|██▍       | 61/250 [00:01<00:05, 31.62it/s]
 26%|██▌       | 65/250 [00:02<00:05, 31.55it/s]
 28%|██▊       | 69/250 [00:02<00:05, 31.43it/s]
 29%|██▉       | 73/250 [00:02<00:05, 31.31it/s]
 31%|███       | 77/250 [00:02<00:05, 31.29it/s]
 32%|███▏      | 81/250

Result for _objective_fe055_00000:
  date: 2021-08-04_12-18-52
  done: true
  epoch: 2.0
  eval_loss: 2.1328487396240234
  eval_runtime: 7.9931
  eval_samples_per_second: 250.217
  eval_steps_per_second: 31.277
  experiment_id: a0bb9cdd7aa24ba4b2acf6bae4d56bcd
  hostname: 913fa59e36d4
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.1328487396240234
  pid: 896
  time_since_restore: 56.72576570510864
  time_this_iter_s: 56.72576570510864
  time_total_s: 56.72576570510864
  timestamp: 1628079532
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fe055_00000
  
== Status ==
Memory usage on this node: 5.3/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.29 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 GPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 CPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 CPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 accelerator_typ

[2m[36m(pid=896)[0m 100%|█████████▉| 249/250 [00:07<00:00, 31.03it/s]100%|██████████| 250/250 [00:07<00:00, 31.40it/s]
[2m[36m(pid=895)[0m 2021-08-04 12:18:53.670159: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=895)[0m Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
[2m[36m(pid=895)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/250 [00:00<?, ?it/s]
  0%|          | 1/250 [00:00<02:16,  1.82it/s]
  1%|          | 2/250 [00:00<01:22,  2.99it/s]
  1%|          | 3/250 [00:00<01:05,  3.79it/s]
  2%|▏         | 4/250 [00:01<00:57,  4.31it/s]
  2%|▏         | 5/250 [00:01<00:52,  4.68it/s]
  2%|▏         | 6/250 [00:01<00:49,  4.92it/s]
  3%|▎         | 7/250 [00:01<00:47,  5.11it/s]
  3%|▎         | 8/250 [00:01

[2m[36m(pid=895)[0m {'train_runtime': 46.1539, 'train_samples_per_second': 86.667, 'train_steps_per_second': 5.417, 'train_loss': 0.033798507690429684, 'epoch': 2.0}


[2m[36m(pid=895)[0m   2%|▏         | 5/250 [00:00<00:06, 40.25it/s]
  4%|▍         | 10/250 [00:00<00:07, 33.86it/s]
  6%|▌         | 14/250 [00:00<00:07, 32.55it/s]
  7%|▋         | 18/250 [00:00<00:07, 32.13it/s]
  9%|▉         | 22/250 [00:00<00:07, 31.85it/s]
 10%|█         | 26/250 [00:00<00:07, 31.77it/s]
 12%|█▏        | 30/250 [00:00<00:06, 31.70it/s]
 14%|█▎        | 34/250 [00:01<00:06, 31.58it/s]
 15%|█▌        | 38/250 [00:01<00:06, 31.69it/s]
 17%|█▋        | 42/250 [00:01<00:06, 31.65it/s]
 18%|█▊        | 46/250 [00:01<00:06, 31.63it/s]
 20%|██        | 50/250 [00:01<00:06, 31.67it/s]
 22%|██▏       | 54/250 [00:01<00:06, 31.62it/s]
 23%|██▎       | 58/250 [00:01<00:06, 31.71it/s]
 25%|██▍       | 62/250 [00:01<00:05, 31.80it/s]
 26%|██▋       | 66/250 [00:02<00:05, 31.71it/s]
 28%|██▊       | 70/250 [00:02<00:05, 31.42it/s]
 30%|██▉       | 74/250 [00:02<00:05, 31.38it/s]
 31%|███       | 78/250 [00:02<00:05, 31.35it/s]
 33%|███▎      | 82/250 [00:02<00:05, 31.44it/

Result for _objective_fe055_00001:
  date: 2021-08-04_12-19-56
  done: true
  epoch: 2.0
  eval_loss: 2.317150354385376
  eval_runtime: 7.9549
  eval_samples_per_second: 251.416
  eval_steps_per_second: 31.427
  experiment_id: a5a01ef095714879b80fd4c32a8d58a0
  hostname: 913fa59e36d4
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.317150354385376
  pid: 895
  time_since_restore: 60.838497161865234
  time_this_iter_s: 60.838497161865234
  time_total_s: 60.838497161865234
  timestamp: 1628079596
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fe055_00001
  
== Status ==
Memory usage on this node: 5.3/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.29 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 CPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 GPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 GPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_fc9a84bca24bd4ef75fd68df508

[2m[36m(pid=1004)[0m 2021-08-04 12:19:57.943004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=1004)[0m Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
[2m[36m(pid=1004)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/1250 [00:00<?, ?it/s]
  0%|          | 1/1250 [00:00<09:31,  2.18it/s]
  0%|          | 2/1250 [00:00<05:40,  3.67it/s]
  0%|          | 3/1250 [00:00<04:10,  4.98it/s]
  0%|          | 4/1250 [00:00<03:28,  5.98it/s]
  0%|          | 5/1250 [00:00<03:04,  6.73it/s]
  0%|          | 6/1250 [00:01<02:50,  7.28it/s]
  1%|          | 7/1250 [00:01<02:41,  7.68it/s]
  1%|          | 8/1250 [00:01<02:35,  7.99it/s]
  1%|          | 9/1250 [00:01<02:31,  8.19it/s]
  1%|          | 10/1250 [00:01<02:29,  8.30

[2m[36m(pid=1004)[0m {'loss': 0.0329, 'learning_rate': 9.743868488068864e-06, 'epoch': 2.0}


[2m[36m(pid=1004)[0m  40%|████      | 501/1250 [01:01<18:38,  1.49s/it]
 40%|████      | 502/1250 [01:02<13:29,  1.08s/it]
 40%|████      | 503/1250 [01:02<09:52,  1.26it/s]
 40%|████      | 504/1250 [01:02<07:19,  1.70it/s]
 40%|████      | 505/1250 [01:02<05:33,  2.24it/s]
 40%|████      | 506/1250 [01:02<04:16,  2.90it/s]
 41%|████      | 507/1250 [01:02<03:23,  3.64it/s]
 41%|████      | 508/1250 [01:02<02:47,  4.43it/s]
 41%|████      | 509/1250 [01:02<02:21,  5.23it/s]
 41%|████      | 510/1250 [01:02<02:03,  5.97it/s]
 41%|████      | 511/1250 [01:03<01:51,  6.61it/s]
 41%|████      | 512/1250 [01:03<01:43,  7.16it/s]
 41%|████      | 513/1250 [01:03<01:37,  7.53it/s]
 41%|████      | 514/1250 [01:03<01:34,  7.82it/s]
 41%|████      | 515/1250 [01:03<01:29,  8.19it/s]
 41%|████▏     | 516/1250 [01:03<01:27,  8.42it/s]
 41%|████▏     | 517/1250 [01:03<01:25,  8.55it/s]
 41%|████▏     | 518/1250 [01:03<01:24,  8.69it/s]
 42%|████▏     | 519/1250 [01:03<01:23,  8.75it/s]
 42%|█

[2m[36m(pid=1004)[0m {'loss': 0.0001, 'learning_rate': 3.2479561626896213e-06, 'epoch': 4.0}


[2m[36m(pid=1004)[0m  80%|████████  | 1001/1250 [02:02<06:08,  1.48s/it]
 80%|████████  | 1002/1250 [02:03<04:26,  1.08s/it]
 80%|████████  | 1003/1250 [02:03<03:14,  1.27it/s]
 80%|████████  | 1004/1250 [02:03<02:23,  1.71it/s]
 80%|████████  | 1005/1250 [02:03<01:48,  2.26it/s]
 80%|████████  | 1006/1250 [02:03<01:23,  2.92it/s]
 81%|████████  | 1007/1250 [02:03<01:06,  3.65it/s]
 81%|████████  | 1008/1250 [02:03<00:54,  4.43it/s]
 81%|████████  | 1009/1250 [02:03<00:46,  5.23it/s]
 81%|████████  | 1010/1250 [02:03<00:40,  5.95it/s]
 81%|████████  | 1011/1250 [02:04<00:36,  6.60it/s]
 81%|████████  | 1012/1250 [02:04<00:33,  7.14it/s]
 81%|████████  | 1013/1250 [02:04<00:31,  7.60it/s]
 81%|████████  | 1014/1250 [02:04<00:29,  7.96it/s]
 81%|████████  | 1015/1250 [02:04<00:28,  8.22it/s]
 81%|████████▏ | 1016/1250 [02:04<00:28,  8.26it/s]
 81%|████████▏ | 1017/1250 [02:04<00:27,  8.63it/s]
 81%|████████▏ | 1018/1250 [02:04<00:26,  8.70it/s]
 82%|████████▏ | 1019/1250 [02:04<00:26

[2m[36m(pid=1004)[0m {'train_runtime': 152.0485, 'train_samples_per_second': 65.768, 'train_steps_per_second': 8.221, 'train_loss': 0.013251647187769413, 'epoch': 5.0}


[2m[36m(pid=1004)[0m   3%|▎         | 8/250 [00:00<00:07, 34.30it/s]
  5%|▍         | 12/250 [00:00<00:07, 32.71it/s]
  6%|▋         | 16/250 [00:00<00:07, 31.95it/s]
  8%|▊         | 20/250 [00:00<00:07, 31.41it/s]
 10%|▉         | 24/250 [00:00<00:07, 31.16it/s]
 11%|█         | 28/250 [00:00<00:07, 31.39it/s]
 13%|█▎        | 32/250 [00:01<00:06, 31.44it/s]
 14%|█▍        | 36/250 [00:01<00:06, 31.36it/s]
 16%|█▌        | 40/250 [00:01<00:06, 31.23it/s]
 18%|█▊        | 44/250 [00:01<00:06, 30.96it/s]
 19%|█▉        | 48/250 [00:01<00:06, 31.00it/s]
 21%|██        | 52/250 [00:01<00:06, 31.14it/s]
 22%|██▏       | 56/250 [00:01<00:06, 31.23it/s]
 24%|██▍       | 60/250 [00:01<00:06, 31.25it/s]
 26%|██▌       | 64/250 [00:02<00:05, 31.34it/s]
 27%|██▋       | 68/250 [00:02<00:05, 31.50it/s]
 29%|██▉       | 72/250 [00:02<00:05, 31.51it/s]
 30%|███       | 76/250 [00:02<00:05, 31.45it/s]
 32%|███▏      | 80/250 [00:02<00:05, 31.43it/s]
 34%|███▎      | 84/250 [00:02<00:05, 31.58it

Result for _objective_fe055_00002:
  date: 2021-08-04_12-22-46
  done: true
  epoch: 5.0
  eval_loss: 2.656616687774658
  eval_runtime: 7.9466
  eval_samples_per_second: 251.681
  eval_steps_per_second: 31.46
  experiment_id: 7fe558e623d34ee1aa5c2848a1b2d7f0
  hostname: 913fa59e36d4
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.656616687774658
  pid: 1004
  time_since_restore: 166.74187016487122
  time_this_iter_s: 166.74187016487122
  time_total_s: 166.74187016487122
  timestamp: 1628079766
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fe055_00002
  
== Status ==
Memory usage on this node: 6.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.29 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 CPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 GPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 CPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 accelerator_ty

[2m[36m(pid=1068)[0m 2021-08-04 12:22:48.045207: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=1068)[0m Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
[2m[36m(pid=1068)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/1250 [00:00<?, ?it/s]
  0%|          | 1/1250 [00:00<10:32,  1.98it/s]
  0%|          | 2/1250 [00:00<05:54,  3.52it/s]
  0%|          | 3/1250 [00:00<04:16,  4.85it/s]
  0%|          | 4/1250 [00:00<03:32,  5.85it/s]
  0%|          | 5/1250 [00:00<03:09,  6.57it/s]
  0%|          | 6/1250 [00:01<02:53,  7.16it/s]
  1%|          | 7/1250 [00:01<02:42,  7.67it/s]
  1%|          | 8/1250 [00:01<02:36,  7.96it/s]
  1%|          | 9/1250 [00:01<02:31,  8.19it/s]
  1%|          | 10/1250 [00:01<02:29,  8.30

[2m[36m(pid=1068)[0m {'loss': 0.0189, 'learning_rate': 2.0426760281837013e-05, 'epoch': 2.0}


 40%|████      | 501/1250 [01:01<18:05,  1.45s/it]
 40%|████      | 502/1250 [01:01<13:13,  1.06s/it]
 40%|████      | 503/1250 [01:02<09:40,  1.29it/s]
 40%|████      | 504/1250 [01:02<07:08,  1.74it/s]
 40%|████      | 505/1250 [01:02<05:24,  2.29it/s]
 40%|████      | 506/1250 [01:02<04:11,  2.96it/s]
 41%|████      | 507/1250 [01:02<03:20,  3.70it/s]
 41%|████      | 508/1250 [01:02<02:45,  4.50it/s]
 41%|████      | 509/1250 [01:02<02:20,  5.27it/s]
 41%|████      | 510/1250 [01:02<02:03,  6.01it/s]
 41%|████      | 511/1250 [01:02<01:51,  6.64it/s]
 41%|████      | 512/1250 [01:03<01:42,  7.23it/s]
 41%|████      | 513/1250 [01:03<01:36,  7.67it/s]
 41%|████      | 514/1250 [01:03<01:31,  8.02it/s]
 41%|████      | 515/1250 [01:03<01:29,  8.19it/s]
 41%|████▏     | 516/1250 [01:03<01:26,  8.48it/s]
 41%|████▏     | 517/1250 [01:03<01:25,  8.61it/s]
 41%|████▏     | 518/1250 [01:03<01:23,  8.72it/s]
 42%|████▏     | 519/1250 [01:03<01:23,  8.72it/s]
 42%|████▏     | 520/1250 [01:0

[2m[36m(pid=1068)[0m {'loss': 0.0, 'learning_rate': 6.808920093945672e-06, 'epoch': 4.0}


[2m[36m(pid=1068)[0m  80%|████████  | 1001/1250 [02:03<06:24,  1.54s/it]
 80%|████████  | 1002/1250 [02:03<04:37,  1.12s/it]
 80%|████████  | 1003/1250 [02:03<03:22,  1.22it/s]
 80%|████████  | 1004/1250 [02:03<02:29,  1.65it/s]
 80%|████████  | 1005/1250 [02:03<01:52,  2.17it/s]
 80%|████████  | 1006/1250 [02:03<01:26,  2.83it/s]
 81%|████████  | 1007/1250 [02:03<01:08,  3.55it/s]
 81%|████████  | 1008/1250 [02:03<00:55,  4.34it/s]
 81%|████████  | 1009/1250 [02:03<00:46,  5.14it/s]
 81%|████████  | 1010/1250 [02:04<00:40,  5.90it/s]
 81%|████████  | 1011/1250 [02:04<00:36,  6.56it/s]
 81%|████████  | 1012/1250 [02:04<00:33,  7.13it/s]
 81%|████████  | 1013/1250 [02:04<00:31,  7.53it/s]
 81%|████████  | 1014/1250 [02:04<00:29,  7.88it/s]
 81%|████████  | 1015/1250 [02:04<00:28,  8.15it/s]
 81%|████████▏ | 1016/1250 [02:04<00:27,  8.38it/s]
 81%|████████▏ | 1017/1250 [02:04<00:27,  8.41it/s]
 81%|████████▏ | 1018/1250 [02:04<00:26,  8.69it/s]
 82%|████████▏ | 1019/1250 [02:05<00:26

[2m[36m(pid=1068)[0m {'train_runtime': 152.3498, 'train_samples_per_second': 65.638, 'train_steps_per_second': 8.205, 'train_loss': 0.00759494957998395, 'epoch': 5.0}


[2m[36m(pid=1068)[0m   4%|▎         | 9/250 [00:00<00:07, 34.23it/s]
  5%|▌         | 13/250 [00:00<00:07, 32.44it/s]
  7%|▋         | 17/250 [00:00<00:07, 31.74it/s]
  8%|▊         | 21/250 [00:00<00:07, 31.74it/s]
 10%|█         | 25/250 [00:00<00:07, 31.38it/s]
 12%|█▏        | 29/250 [00:00<00:07, 31.07it/s]
 13%|█▎        | 33/250 [00:01<00:06, 31.13it/s]
 15%|█▍        | 37/250 [00:01<00:06, 31.27it/s]
 16%|█▋        | 41/250 [00:01<00:06, 31.00it/s]
 18%|█▊        | 45/250 [00:01<00:06, 30.66it/s]
 20%|█▉        | 49/250 [00:01<00:06, 30.80it/s]
 21%|██        | 53/250 [00:01<00:06, 30.91it/s]
 23%|██▎       | 57/250 [00:01<00:06, 30.93it/s]
 24%|██▍       | 61/250 [00:01<00:06, 30.73it/s]
 26%|██▌       | 65/250 [00:02<00:05, 30.91it/s]
 28%|██▊       | 69/250 [00:02<00:05, 30.95it/s]
 29%|██▉       | 73/250 [00:02<00:05, 30.89it/s]
 31%|███       | 77/250 [00:02<00:05, 30.78it/s]
 32%|███▏      | 81/250 [00:02<00:05, 30.88it/s]
 34%|███▍      | 85/250 [00:02<00:05, 30.76it

Result for _objective_fe055_00003:
  date: 2021-08-04_12-25-38
  done: true
  epoch: 5.0
  eval_loss: 3.2309682369232178
  eval_runtime: 8.0953
  eval_samples_per_second: 247.058
  eval_steps_per_second: 30.882
  experiment_id: c8a2b8076558453bb089cc7d9f09faf3
  hostname: 913fa59e36d4
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 3.2309682369232178
  pid: 1068
  time_since_restore: 167.9492471218109
  time_this_iter_s: 167.9492471218109
  time_total_s: 167.9492471218109
  timestamp: 1628079938
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fe055_00003
  
== Status ==
Memory usage on this node: 6.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.29 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 CPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 CPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_fc9a84bca24bd4ef75fd68df508

[2m[36m(pid=1068)[0m 100%|█████████▉| 249/250 [00:08<00:00, 31.05it/s]100%|██████████| 250/250 [00:08<00:00, 31.01it/s]
[2m[36m(pid=1134)[0m 2021-08-04 12:25:39.950118: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=1134)[0m Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
[2m[36m(pid=1134)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/375 [00:00<?, ?it/s]
  0%|          | 1/375 [00:00<03:22,  1.84it/s]
  1%|          | 2/375 [00:00<02:04,  3.00it/s]
  1%|          | 3/375 [00:00<01:37,  3.81it/s]
  1%|          | 4/375 [00:01<01:25,  4.33it/s]
  1%|▏         | 5/375 [00:01<01:18,  4.70it/s]
  2%|▏         | 6/375 [00:01<01:14,  4.95it/s]
  2%|▏         | 7/375 [00:01<01:12,  5.11it/s]
  2%|▏         | 8/375 [0

[2m[36m(pid=1134)[0m {'train_runtime': 68.815, 'train_samples_per_second': 87.19, 'train_steps_per_second': 5.449, 'train_loss': 0.022163398742675783, 'epoch': 3.0}


[2m[36m(pid=1134)[0m   2%|▏         | 5/250 [00:00<00:06, 39.36it/s]
[2m[36m(pid=1134)[0m   4%|▎         | 9/250 [00:00<00:06, 34.44it/s]
  5%|▌         | 13/250 [00:00<00:07, 32.97it/s]
  7%|▋         | 17/250 [00:00<00:07, 32.06it/s]
  8%|▊         | 21/250 [00:00<00:07, 31.94it/s]
 10%|█         | 25/250 [00:00<00:07, 31.43it/s]
 12%|█▏        | 29/250 [00:00<00:07, 31.18it/s]
 13%|█▎        | 33/250 [00:01<00:06, 31.37it/s]
 15%|█▍        | 37/250 [00:01<00:06, 31.24it/s]
 16%|█▋        | 41/250 [00:01<00:06, 31.36it/s]
 18%|█▊        | 45/250 [00:01<00:06, 31.61it/s]
 20%|█▉        | 49/250 [00:01<00:06, 31.43it/s]
 21%|██        | 53/250 [00:01<00:06, 31.15it/s]
 23%|██▎       | 57/250 [00:01<00:06, 31.26it/s]
 24%|██▍       | 61/250 [00:01<00:06, 31.02it/s]
 26%|██▌       | 65/250 [00:02<00:05, 31.14it/s]
 28%|██▊       | 69/250 [00:02<00:05, 31.22it/s]
 29%|██▉       | 73/250 [00:02<00:05, 31.25it/s]
 31%|███       | 77/250 [00:02<00:05, 31.14it/s]
 32%|███▏      | 81/2

Result for _objective_fe055_00004:
  date: 2021-08-04_12-27-06
  done: true
  epoch: 3.0
  eval_loss: 2.687011480331421
  eval_runtime: 7.9798
  eval_samples_per_second: 250.634
  eval_steps_per_second: 31.329
  experiment_id: aad4237b3650473ba801fb95f837353a
  hostname: 913fa59e36d4
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.687011480331421
  pid: 1134
  time_since_restore: 84.83893203735352
  time_this_iter_s: 84.83893203735352
  time_total_s: 84.83893203735352
  timestamp: 1628080026
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fe055_00004
  
== Status ==
Memory usage on this node: 6.3/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.29 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 CPU_group_0_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 GPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_fc9a84bca24bd4ef75fd68df50828ad9, 0.0/1.0 GPU_group_0_fc9a84bca24bd4ef75fd68df50828

### Best run 

In [67]:
best_run 

BestRun(run_id='fe055_00000', objective=2.1328487396240234, hyperparameters={'learning_rate': 2.49816047538945e-05, 'num_train_epochs': 2, 'seed': 15, 'per_device_train_batch_size': 32})

# Best_run compilation 

In [95]:
h_params_compiled = pd.DataFrame({'model_name': ['roberta-base','xlnet-base-cased','bert-base-uncased','gpt2'],'learning_rate': ['3.404460046972836e-05','2.49816047538945e-05','2.49816047538945e-05','2.49816047538945e-05'], 'num_train_epochs': [5,2,2,2], 'seed': [22,15,15,15], 'per_device_train_batch_size': [8,32,32,32]})

In [96]:
h_params_compiled

Unnamed: 0,model_name,learning_rate,num_train_epochs,seed,per_device_train_batch_size
0,roberta-base,3.404460046972836e-05,5,22,8
1,xlnet-base-cased,2.49816047538945e-05,2,15,32
2,bert-base-uncased,2.49816047538945e-05,2,15,32
3,gpt2,2.49816047538945e-05,2,15,32


In [97]:
h_params_compiled.to_csv('h_params_compiled.csv')

# Class imbalance handling methods

Link 1 :
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

Link 2 : https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

What?
  * Imbalance is most common problem
  * Class1 - 80 samples
  * Class2 - 20 samples 

Accuracy Paradox:
  * Accuracy metric may reflect the underlying class distribution.
    * Just predict class 1 irrespective of the input due to its class distribution.
    * Accuracy = `(80/100)*100 = 80%` 
    * But the model didnot learn anything.


Strategies:

1. Collect more data
2. Change performance metric:
  * Confusion matrix : Breaking the predictions into
    * Correct predictions:
      * True positive 
      * True Negative
    * Incorrect predictions:
      * False positive
      * False negative 
  * Precision : 
    * **correct positive prediction** out of **total positive predictions** (correct and incorrect).
  * Recall (sensitivity/TPR) : 
    * **Identified correct positive** predictions out of **total positive class in the dataset**.  
  * F1 score : 
    * Weighted average of precision and recall.
  * Kappa score:
    * Classification score normalized by the imbalance of classes in data.
    * Range from -1/0 - 1(perfect) 
  * ROC curve : 
    * TP (sensitivity) plotted against FP (1 – specificity) for each threshold used.
    * Useful for threshold selection 
      * Selecting threshold based on the dataset 
      * e.g.: Cancer screening : 
          * High FP along with TP is fine, as it is important to identify sufferers than having false negative.
    * ROC_AUC score : Gives performance of classifier over entire operating range.
    * Classifier comparison : Compare two models using ROC_AUC score. 
3. Resampling data 
  * Over-sampling:
      * Add copies from under-represented class.
      * Algorithms:
        * SMOTE(Synthetic minority over sampling technique)
          * Compute k-NN from minority class and impute.
        * Random over-sampling
      * Dis-advantage:
        * Impact generalization and may overfit the data.
  * Under-sampling:
    * Delete copies from over-represented class.
    * Algorithms
      * NearMiss
      * Random under-sampling
    * Dis-advantage:
      * May loose important information 
  * Points:
    * Consider testing random split and non-random (e.g. stratified) splits.
4. Different ML model:
  * Decision trees 
    * CART
    * Random forest
5. Penalized models:
  * Impose additional cost when predicting minority class to pay more attention.
    * Train model with class weights 
      * What are class weights ??
        * Different weights are given accordingly to the minority and majority classes which penalizes the misclassification during training according to the weights taking imbalance into consideration.
        * More weightage to minority and less to majority class.
        * In scikit learn when `class_weights = balanced`, the model assigns the **class weights inversely proportional to their respective frequencies**.
          `wj=n_samples / (n_classes * n_samplesj)`
        * Apply the weights to the weighted loss/cost function.
        * Results in the weighted loss (more error value to the minority and less error value to the majority class)
        * Correspondingly, the model coefficients/ hyper-parameters are adjusted w.r.t weighted loss.
    * Link : https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
  * Focal loss for multi-class imbalanced data 
    * Link : https://www.dlology.com/blog/multi-class-classification-with-focal-loss-for-imbalanced-datasets/

6. Different problem
  * Anamoly detection
    * One-class classifier 
  * Change detection 


