<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Hyper_parameter_search_and_class_imbalance_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyper-parameter search Research

Hyper-parameter : https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/

Transformer hyper-parameter search: https://huggingface.co/blog/ray-tune

**What is hyper-parameter**?
  * Parameters that are used to control the learning process of a model
  * "Model configuration parameters set by the developer to guide learning process for specific dataset".

**Difference between model parameters and model hyper-parameters**?
  * Model parameters: 
    * Variables whose values are not set but learned during the training of a model for specific data.
      * E.g. 
        * Weights (importance given to each feature of an instance) and biases (adjust the generalization of the model) in NN
        * Support vectors in SVM
        * Coefficients in regression models 
  * Model Hyper-perameter:
    * Configuration variable set before training to improve the training process or reduce the loss function.
    * E.g.
      * Learning rate for NN
      * K in KNN

**Hyper-parameter search/tuning/optimization:**
  * No rule of thumb to set hyper parameters and it is required to search for best hyper-parameters of a model on a dataset.
  * Hyper-parameter for a model is searched in search space where each dimention represents hyper-parameter and point represent one model configuration.
  * Goal of hyper-parameter search is to find an optimal configuration parameters (vector) from search space.
  * Different algorithms
    * Random search: randomly sample points from bounded domain of search space
      * More time to search 
      *`RandomizedSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched
    * Grid search:  Search space as grid of hyper-parameters and evaluate every
 point in the grid.
      * More defined search in the search space
      * `GridSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched.
    * Advanced:
      * Bayesian optimization 
      * Population based training


**Transformers Hyper-parameter tuning :**

Library : RayTune (python library for experiment execution and hyperparameter tuning)

Steps:
  1. Define search space
      * BERT Model fine-tune Hyper-parameters(baseline : https://www.aclweb.org/anthology/N19-1423/):
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
      * RoBERTa Model fine-tune hyper-parameters in paper(baseline : https://arxiv.org/abs/1907.11692):
        * Batch_size : [16,32]
        * Learning rate (adam) : 1e-5,2e-5,3e-5
        * Max number of epochs (adam) : 10
        * Weight decay : 0.1
        * Learning rate decay : Linear
        * Warmup ratio : 0.06 
      * GPT-2 Model fine-tune hyper-parameters in paper(baseline : http://www.persagen.com/files/misc/radford2019language.pdf):
        * Auto-regressive model
      * XLNet-large fine-tune Model hyper-parameters in paper(baseline : https://arxiv.org/pdf/1906.08237.pdf):
        * Same as BERT 
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
  2. Load Model tokenizer
  3. Load training and evaluation dataset
  4. Define metrics to be evaluated 
    * `Datasets` library from transformers contain metrics which can be used 
    * https://huggingface.co/metrics
  5. Encode training examples
  6. Initialize model 
    * `AutoModelForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)`
  7. Define `trainer` from transformers
    * Trainer classes provide feature complete API
    * Before instantiating trainer, training arguments should be created to access customization during training
    * https://huggingface.co/transformers/main_classes/trainer.html





Hugging-face Multi-label classification 

* Link : https://colab.research.google.com/drive/18vy67le2DC-iMJK-AiB0vVKtMRAxmBnB?usp=sharing
* Link : https://colab.research.google.com/drive/1aue7x525rKy6yYLqqt-5Ll96qjQvpqS7#scrollTo=Ytdiy3hJJ88P

# Data-preprocessing

In [1]:
! pip install optuna --quiet
! pip install ray[tune] --quiet
# !pip install transformers --quiet

[K     |████████████████████████████████| 301 kB 27.6 MB/s 
[K     |████████████████████████████████| 80 kB 9.1 MB/s 
[K     |████████████████████████████████| 164 kB 65.8 MB/s 
[K     |████████████████████████████████| 75 kB 5.6 MB/s 
[K     |████████████████████████████████| 111 kB 76.2 MB/s 
[K     |████████████████████████████████| 141 kB 67.6 MB/s 
[K     |████████████████████████████████| 49 kB 7.4 MB/s 
[?25h  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 51.6 MB 38 kB/s 
[K     |████████████████████████████████| 10.1 MB 71.6 MB/s 
[K     |████████████████████████████████| 65 kB 3.9 MB/s 
[K     |████████████████████████████████| 3.1 MB 56.4 MB/s 
[K     |████████████████████████████████| 78 kB 8.5 MB/s 
[K     |████████████████████████████████| 72 kB 520 kB/s 
[K     |████████████████████████████████| 127 kB 69.6 MB/s 
[K     |████████████████████████████████| 1.3 MB 68.6 MB/s 
[K     |███████████████████

In [2]:
pip install git+https://github.com/huggingface/transformers -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 3.3 MB 40.0 MB/s 
[K     |████████████████████████████████| 636 kB 70.4 MB/s 
[K     |████████████████████████████████| 895 kB 64.6 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


In [3]:
pip install "ray[tune]" transformers datasets --quiet

[K     |████████████████████████████████| 542 kB 29.6 MB/s 
[K     |████████████████████████████████| 243 kB 69.7 MB/s 
[K     |████████████████████████████████| 76 kB 5.8 MB/s 
[K     |████████████████████████████████| 118 kB 56.9 MB/s 
[?25h

In [4]:
!nvidia-smi

Sun Jul 25 09:14:42 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments

# specify GPU
device = torch.device("cuda")

In [6]:
print(f"Running on transformers v{transformers.__version__}")

Running on transformers v4.10.0.dev0


In [7]:
MAX_LEN = 50
RANDOM_SEED = 47
# pl.seed_everything(RANDOM_SEED)

In [8]:
df = pd.read_csv("/content/ohe_multilabel.csv", index_col = 0)

In [9]:
df

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Many people live in Ethiopia. The people are v...
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,My professor is a hispanic man He came here il...
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,The schoolgirl is walking down the street. She...
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,We'll likely fly over Cape Verde on our flight...
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,I know many people who are Russian. They love ...
...,...,...,...,...,...,...,...,...
16539,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Arab people are associated with hi.
16540,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Car was a new chemist in our lab.
16541,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"The gentlemen stood around the fire, carefully..."
16542,0.0,0.0,0.0,0.0,0.0,0.0,1.0,The common ukrainian is a supporter of a floun...


Stratified sampling :

* Why?
  * With very small or very imbalanced data sets, it's quite possible that the random split could completely eliminate a class from one of the train/test splits.
  * hence, setting `stratify = dependent_variable` makes sure that train and test splits have the same proportion of sampling. 

In [10]:
y = df.iloc[:,:-1].values
X = df.iloc[:,-1].values

In [11]:
from sklearn.model_selection import train_test_split

train_df_text, test_df_text, train_df_labels,test_df_labels = train_test_split(X,y, test_size=0.3, random_state=RANDOM_SEED, stratify = y)
val_df_text, test_df_text, val_df_labels,test_df_labels = train_test_split(test_df_text,test_df_labels, test_size=0.5, random_state=RANDOM_SEED,stratify = test_df_labels)

In [12]:
train_df_text

array(['The grandfather dined at the restaurant. He enjoyed salsa dancing and kung fu.',
       'Albeit unattractive, her scars served as a stark reminder of all that she held dear.',
       'there is a new middle eastern ice cream company that has gotten explosive response from critics it s called ice is',
       ..., 'Yemen is near away from california',
       'The shirt is too wet to wear.',
       "The telephone's constant ringing is annoying."], dtype=object)

In [13]:
LABEL_COLUMN = ['Ethnicity',	'gender'	,'profession'	,'religion',	'Anti-stereotype',	'stereotype',	'unrelated']

In [14]:
train_df_labels = pd.DataFrame(train_df_labels, columns= LABEL_COLUMN)
val_df_labels = pd.DataFrame(val_df_labels, columns= LABEL_COLUMN)
test_df_labels = pd.DataFrame(test_df_labels, columns= LABEL_COLUMN)
train_df_text = pd.DataFrame(train_df_text, columns = ['sentence'])
val_df_text = pd.DataFrame(val_df_text, columns = ['sentence'])
test_df_text = pd.DataFrame(test_df_text, columns = ['sentence'])

In [15]:
train_df = pd.concat([train_df_text,train_df_labels], axis = 1)
val_df = pd.concat([val_df_text,val_df_labels], axis = 1)
test_df = pd.concat([test_df_text,test_df_labels], axis = 1)

In [16]:
train_df

Unnamed: 0,sentence,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated
0,The grandfather dined at the restaurant. He en...,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,"Albeit unattractive, her scars served as a sta...",0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,there is a new middle eastern ice cream compan...,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,while they are in prison the kikes should get ...,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,The Historian was a teacher He loved history a...,0.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...
11575,Most butchers are window.,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11576,Muslim households are very polyester places.,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11577,Yemen is near away from california,1.0,0.0,0.0,0.0,1.0,0.0,0.0
11578,The shirt is too wet to wear.,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [17]:
train_df.iloc[:,1:].values

array([[0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [18]:
from torch.utils.data import Dataset, DataLoader

In [19]:
class ExplicitStereotypeDataset(Dataset):

  def __init__(self, data: pd.DataFrame, tokenizer,max_token_len: int = 50):
    self.tokenizer = tokenizer
    self.data = data
    self.max_token_len = max_token_len
  
  def __len__(self):
    return len(self.data)
  
  def __getitem__(self, index: int):
    data_row = self.data.iloc[0]
    text = data_row[0]
    labels = data_row[1:]
 

    encoding = self.tokenizer.encode_plus(
      text,
      add_special_tokens=True,
      max_length=self.max_token_len,
      padding="max_length",
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return dict(
      attention_mask=encoding["attention_mask"].flatten(),
      input_ids=encoding["input_ids"].flatten(),
      labels= torch.FloatTensor(labels)
    )

In [None]:
sample = train_dataset[0]

In [None]:
sample

In [20]:
# num_labels = len(sample['labels'])
num_labels = 7

In [21]:
def my_hp_space(trial):
    from ray import tune

    return {
        "learning_rate": tune.uniform(1e-5, 5e-5),
        "num_train_epochs": tune.choice([2,3,5]),
        "seed": tune.choice(range(1, 41)),
        "per_device_train_batch_size": tune.choice([8, 16, 32]),
    }

## BERT

In [None]:
model = 'bert-base-uncased'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model,problem_type="multi_label_classification")

In [None]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, problem_type="multi_label_classification", num_labels = num_labels )

In [None]:
# from pytorch_lightning.metrics.functional import accuracy, f1, auroc

# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     roc_auc = auroc(predictions, labels)
#     return roc_auc

In [None]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.

trainer = Trainer(
    model_init= model_init,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…

storing https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
creating metadata file for /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f





Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

"The default objective to maximize/minimize when doing an hyperparameter search. It is the evaluation loss if no
    metrics are provided to the :class:`~transformers.Trainer`, the sum of all metrics otherwise."

Optuna : By default for hp_search

Metrics :

```
def default_hp_space_ray(trial) -> Dict[str, float]:
    from .integrations import is_ray_tune_available

    assert is_ray_tune_available(), "This function needs ray installed: `pip " "install ray[tune]`"
    from ray import tune

    return {
        "learning_rate": tune.loguniform(1e-6, 1e-4),
        "num_train_epochs": tune.choice(list(range(1, 6))),
        "seed": tune.uniform(1, 40),
        "per_device_train_batch_size": tune.choice([4, 8, 16, 32, 64]),
    }
```
Link : https://huggingface.co/transformers/_modules/transformers/trainer_utils.html

In [None]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
trainer.hyperparameter_search(n_trials=3, hp_space=my_hp_space, backend='ray')

No `resources_per_trial` arg was passed into `hyperparameter_search`. Setting it to a default value of 1 CPU and 1 GPU for each trial.
2021-07-13 12:33:57,526	INFO services.py:1274 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 2.7/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /root/ray_results/_objective_2021-07-13_12-34-01
Number of trials: 3/3 (3 PENDING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------|
| _objective_9a82c_00000 | PENDING  |       |     2.49816e-05 |                  2 |                            32 |
| _objective_9a82c_00001 | PENDING  |       |     3.92798e-05 |                  2 |                             8 |
| _objective_9a82c_00002 | PENDING  |       |     1.62407e-05 |                  5 |                            32 |
+--

[2m[36m(pid=359)[0m 2021-07-13 12:34:03.895239: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Memory usage on this node: 3.3/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:K80)
Result logdir: /root/ray_results/_objective_2021-07-13_12-34-01
Number of trials: 3/3 (2 PENDING, 1 RUNNING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------|
| _objective_9a82c_00000 | RUNNING  |       |     2.49816e-05 |                  2 |                            32 |
| _objective_9a82c_00001 | PENDING  |       |     3.92798e-05 |                  2 |                             8 |
| _objective_9a82c_00002 | PENDING  |       |     1.62407e-05 |                  5 |                     

[2m[36m(pid=359)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
[2m[36m(pid=359)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=359)[0m - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=359)[0m Som

[2m[36m(pid=359)[0m {'loss': 0.0327, 'learning_rate': 7.729115282972884e-06, 'epoch': 1.38}


 69%|██████▉   | 501/724 [05:01<07:05,  1.91s/it]
 69%|██████▉   | 502/724 [05:01<05:36,  1.51s/it]
 69%|██████▉   | 503/724 [05:02<04:33,  1.24s/it]
 70%|██████▉   | 504/724 [05:02<03:49,  1.04s/it]
 70%|██████▉   | 505/724 [05:03<03:18,  1.10it/s]
 70%|██████▉   | 506/724 [05:04<02:56,  1.23it/s]
 70%|███████   | 507/724 [05:04<02:41,  1.34it/s]
 70%|███████   | 508/724 [05:05<02:30,  1.43it/s]
 70%|███████   | 509/724 [05:05<02:23,  1.50it/s]
 70%|███████   | 510/724 [05:06<02:17,  1.55it/s]
 71%|███████   | 511/724 [05:06<02:13,  1.59it/s]
 71%|███████   | 512/724 [05:07<02:11,  1.62it/s]
 71%|███████   | 513/724 [05:08<02:08,  1.64it/s]
 71%|███████   | 514/724 [05:08<02:06,  1.66it/s]
 71%|███████   | 515/724 [05:09<02:05,  1.67it/s]
 71%|███████▏  | 516/724 [05:09<02:04,  1.67it/s]
 71%|███████▏  | 517/724 [05:10<02:03,  1.68it/s]
 72%|███████▏  | 518/724 [05:11<02:02,  1.68it/s]
 72%|███████▏  | 519/724 [05:11<02:02,  1.68it/s]
 72%|███████▏  | 520/724 [05:12<02:01,  1.68it/s]


[2m[36m(pid=359)[0m {'train_runtime': 435.1393, 'train_samples_per_second': 53.224, 'train_steps_per_second': 1.664, 'train_loss': 0.02366616482234133, 'epoch': 2.0}


[2m[36m(pid=359)[0m   1%|          | 3/311 [00:00<00:14, 21.57it/s]
  2%|▏         | 5/311 [00:00<00:16, 18.60it/s]
  2%|▏         | 7/311 [00:00<00:18, 16.71it/s]
  3%|▎         | 9/311 [00:00<00:19, 15.47it/s]
  4%|▎         | 11/311 [00:00<00:20, 14.69it/s]
  4%|▍         | 13/311 [00:00<00:20, 14.39it/s]
  5%|▍         | 15/311 [00:01<00:21, 14.07it/s]
  5%|▌         | 17/311 [00:01<00:21, 13.80it/s]
  6%|▌         | 19/311 [00:01<00:21, 13.83it/s]
  7%|▋         | 21/311 [00:01<00:21, 13.77it/s]
  7%|▋         | 23/311 [00:01<00:21, 13.68it/s]
  8%|▊         | 25/311 [00:01<00:21, 13.60it/s]
  9%|▊         | 27/311 [00:01<00:20, 13.62it/s]
  9%|▉         | 29/311 [00:02<00:20, 13.57it/s]
 10%|▉         | 31/311 [00:02<00:20, 13.54it/s]
 11%|█         | 33/311 [00:02<00:20, 13.54it/s]
 11%|█▏        | 35/311 [00:02<00:20, 13.53it/s]
 12%|█▏        | 37/311 [00:02<00:20, 13.52it/s]
 13%|█▎        | 39/311 [00:02<00:20, 13.51it/s]
 13%|█▎        | 41/311 [00:02<00:20, 13.49it/s]


Result for _objective_9a82c_00000:
  date: 2021-07-13_12-41-50
  done: true
  epoch: 2.0
  eval_loss: 2.791937828063965
  eval_runtime: 23.0062
  eval_samples_per_second: 107.884
  eval_steps_per_second: 13.518
  experiment_id: 1127a5f680364f6d94e0a1ea997e573e
  hostname: 28859bc61174
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.791937828063965
  pid: 359
  time_since_restore: 465.17490887641907
  time_this_iter_s: 465.17490887641907
  time_total_s: 465.17490887641907
  timestamp: 1626180110
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a82c_00000
  
== Status ==
Memory usage on this node: 3.8/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:K80, 0.0/1.0 GPU_group_0_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 GPU_group_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 CPU_group_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 CPU_group_0_d0ef30c3924031cf839503dcd5

[2m[36m(pid=359)[0m 100%|██████████| 311/311 [00:22<00:00, 13.50it/s]100%|██████████| 311/311 [00:22<00:00, 13.53it/s]
[2m[36m(pid=360)[0m 2021-07-13 12:41:52.604942: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=360)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
[2m[36m(pid=360)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).


[2m[36m(pid=360)[0m {'loss': 0.0234, 'learning_rate': 3.249803155497413e-05, 'epoch': 0.35}


 17%|█▋        | 501/2896 [02:10<1:03:44,  1.60s/it]
 17%|█▋        | 502/2896 [02:10<47:53,  1.20s/it]  
 17%|█▋        | 503/2896 [02:10<36:27,  1.09it/s]
 17%|█▋        | 504/2896 [02:11<28:19,  1.41it/s]
 17%|█▋        | 505/2896 [02:11<22:37,  1.76it/s]
 17%|█▋        | 506/2896 [02:11<18:40,  2.13it/s]
 18%|█▊        | 507/2896 [02:11<15:57,  2.50it/s]
 18%|█▊        | 508/2896 [02:12<14:01,  2.84it/s]
 18%|█▊        | 509/2896 [02:12<12:39,  3.14it/s]
 18%|█▊        | 510/2896 [02:12<11:47,  3.37it/s]
 18%|█▊        | 511/2896 [02:12<11:08,  3.57it/s]
 18%|█▊        | 512/2896 [02:13<10:40,  3.72it/s]
 18%|█▊        | 513/2896 [02:13<10:20,  3.84it/s]
 18%|█▊        | 514/2896 [02:13<10:06,  3.93it/s]
 18%|█▊        | 515/2896 [02:13<09:55,  4.00it/s]
 18%|█▊        | 516/2896 [02:13<09:50,  4.03it/s]
 18%|█▊        | 517/2896 [02:14<09:46,  4.06it/s]
 18%|█▊        | 518/2896 [02:14<09:43,  4.08it/s]
 18%|█▊        | 519/2896 [02:14<09:43,  4.08it/s]
 18%|█▊        | 520/2896 [

[2m[36m(pid=360)[0m {'loss': 0.0009, 'learning_rate': 2.571630543749205e-05, 'epoch': 0.69}


 35%|███▍      | 1001/2896 [04:15<50:38,  1.60s/it]
 35%|███▍      | 1002/2896 [04:15<37:58,  1.20s/it]
 35%|███▍      | 1003/2896 [04:15<28:52,  1.09it/s]
 35%|███▍      | 1004/2896 [04:16<22:25,  1.41it/s]
 35%|███▍      | 1005/2896 [04:16<17:56,  1.76it/s]
 35%|███▍      | 1006/2896 [04:16<14:48,  2.13it/s]
 35%|███▍      | 1007/2896 [04:16<12:41,  2.48it/s]
 35%|███▍      | 1008/2896 [04:17<11:10,  2.81it/s]
 35%|███▍      | 1009/2896 [04:17<10:06,  3.11it/s]
 35%|███▍      | 1010/2896 [04:17<09:22,  3.35it/s]
 35%|███▍      | 1011/2896 [04:17<08:50,  3.55it/s]
 35%|███▍      | 1012/2896 [04:18<08:29,  3.70it/s]
 35%|███▍      | 1013/2896 [04:18<08:13,  3.82it/s]
 35%|███▌      | 1014/2896 [04:18<08:00,  3.92it/s]
 35%|███▌      | 1015/2896 [04:18<07:51,  3.99it/s]
 35%|███▌      | 1016/2896 [04:19<07:50,  4.00it/s]
 35%|███▌      | 1017/2896 [04:19<07:46,  4.02it/s]
 35%|███▌      | 1018/2896 [04:19<07:41,  4.07it/s]
 35%|███▌      | 1019/2896 [04:19<07:38,  4.10it/s]
 35%|███▌   

[2m[36m(pid=360)[0m {'loss': 0.0004, 'learning_rate': 1.8934579320009968e-05, 'epoch': 1.04}


 52%|█████▏    | 1501/2896 [06:20<37:03,  1.59s/it]
 52%|█████▏    | 1502/2896 [06:20<27:51,  1.20s/it]
 52%|█████▏    | 1503/2896 [06:21<21:12,  1.09it/s]
 52%|█████▏    | 1504/2896 [06:21<16:31,  1.40it/s]
 52%|█████▏    | 1505/2896 [06:21<13:13,  1.75it/s]
 52%|█████▏    | 1506/2896 [06:21<10:55,  2.12it/s]
 52%|█████▏    | 1507/2896 [06:21<09:18,  2.49it/s]
 52%|█████▏    | 1508/2896 [06:22<08:12,  2.82it/s]
 52%|█████▏    | 1509/2896 [06:22<07:23,  3.13it/s]
 52%|█████▏    | 1510/2896 [06:22<06:51,  3.37it/s]
 52%|█████▏    | 1511/2896 [06:22<06:28,  3.57it/s]
 52%|█████▏    | 1512/2896 [06:23<06:09,  3.75it/s]
 52%|█████▏    | 1513/2896 [06:23<05:58,  3.86it/s]
 52%|█████▏    | 1514/2896 [06:23<05:49,  3.96it/s]
 52%|█████▏    | 1515/2896 [06:23<05:43,  4.02it/s]
 52%|█████▏    | 1516/2896 [06:24<05:39,  4.06it/s]
 52%|█████▏    | 1517/2896 [06:24<05:36,  4.10it/s]
 52%|█████▏    | 1518/2896 [06:24<05:34,  4.12it/s]
 52%|█████▏    | 1519/2896 [06:24<05:32,  4.14it/s]
 52%|█████▏ 

[2m[36m(pid=360)[0m {'loss': 0.0003, 'learning_rate': 1.2152853202527888e-05, 'epoch': 1.38}


 69%|██████▉   | 2001/2896 [08:25<23:06,  1.55s/it]
 69%|██████▉   | 2002/2896 [08:25<17:24,  1.17s/it]
 69%|██████▉   | 2003/2896 [08:26<13:19,  1.12it/s]
 69%|██████▉   | 2004/2896 [08:26<10:25,  1.43it/s]
 69%|██████▉   | 2005/2896 [08:26<08:22,  1.77it/s]
 69%|██████▉   | 2006/2896 [08:26<06:59,  2.12it/s]
 69%|██████▉   | 2007/2896 [08:27<06:01,  2.46it/s]
 69%|██████▉   | 2008/2896 [08:27<05:18,  2.78it/s]
 69%|██████▉   | 2009/2896 [08:27<04:49,  3.06it/s]
 69%|██████▉   | 2010/2896 [08:27<04:29,  3.29it/s]
 69%|██████▉   | 2011/2896 [08:28<04:15,  3.46it/s]
 69%|██████▉   | 2012/2896 [08:28<04:05,  3.60it/s]
 70%|██████▉   | 2013/2896 [08:28<03:58,  3.70it/s]
 70%|██████▉   | 2014/2896 [08:28<03:53,  3.78it/s]
 70%|██████▉   | 2015/2896 [08:29<03:49,  3.84it/s]
 70%|██████▉   | 2016/2896 [08:29<03:46,  3.89it/s]
 70%|██████▉   | 2017/2896 [08:29<03:45,  3.90it/s]
 70%|██████▉   | 2018/2896 [08:29<03:44,  3.92it/s]
 70%|██████▉   | 2019/2896 [08:30<03:43,  3.93it/s]
 70%|██████▉

[2m[36m(pid=360)[0m {'loss': 0.0002, 'learning_rate': 5.371127085045808e-06, 'epoch': 1.73}


 86%|████████▋ | 2501/2896 [10:35<10:05,  1.53s/it]
 86%|████████▋ | 2502/2896 [10:36<07:33,  1.15s/it]
 86%|████████▋ | 2503/2896 [10:36<05:45,  1.14it/s]
 86%|████████▋ | 2504/2896 [10:36<04:28,  1.46it/s]
 86%|████████▋ | 2505/2896 [10:36<03:36,  1.81it/s]
 87%|████████▋ | 2506/2896 [10:37<02:59,  2.18it/s]
 87%|████████▋ | 2507/2896 [10:37<02:33,  2.54it/s]
 87%|████████▋ | 2508/2896 [10:37<02:14,  2.88it/s]
 87%|████████▋ | 2509/2896 [10:37<02:01,  3.18it/s]
 87%|████████▋ | 2510/2896 [10:38<01:52,  3.42it/s]
 87%|████████▋ | 2511/2896 [10:38<01:46,  3.63it/s]
 87%|████████▋ | 2512/2896 [10:38<01:41,  3.78it/s]
 87%|████████▋ | 2513/2896 [10:38<01:38,  3.90it/s]
 87%|████████▋ | 2514/2896 [10:39<01:35,  3.99it/s]
 87%|████████▋ | 2515/2896 [10:39<01:34,  4.03it/s]
 87%|████████▋ | 2516/2896 [10:39<01:33,  4.06it/s]
 87%|████████▋ | 2517/2896 [10:39<01:32,  4.09it/s]
 87%|████████▋ | 2518/2896 [10:40<01:32,  4.10it/s]
 87%|████████▋ | 2519/2896 [10:40<01:32,  4.09it/s]
 87%|███████

[2m[36m(pid=360)[0m {'train_runtime': 732.8944, 'train_samples_per_second': 31.601, 'train_steps_per_second': 3.951, 'train_loss': 0.004394141679310667, 'epoch': 2.0}


[2m[36m(pid=360)[0m   2%|▏         | 5/311 [00:00<00:15, 19.46it/s]
  2%|▏         | 7/311 [00:00<00:17, 17.20it/s]
  3%|▎         | 9/311 [00:00<00:19, 15.89it/s]
  4%|▎         | 11/311 [00:00<00:19, 15.06it/s]
  4%|▍         | 13/311 [00:00<00:20, 14.55it/s]
  5%|▍         | 15/311 [00:01<00:20, 14.22it/s]
  5%|▌         | 17/311 [00:01<00:21, 13.94it/s]
  6%|▌         | 19/311 [00:01<00:21, 13.77it/s]
  7%|▋         | 21/311 [00:01<00:21, 13.68it/s]
  7%|▋         | 23/311 [00:01<00:21, 13.68it/s]
  8%|▊         | 25/311 [00:01<00:21, 13.49it/s]
  9%|▊         | 27/311 [00:01<00:20, 13.59it/s]
  9%|▉         | 29/311 [00:02<00:21, 13.37it/s]
 10%|▉         | 31/311 [00:02<00:20, 13.51it/s]
 11%|█         | 33/311 [00:02<00:20, 13.53it/s]
 11%|█▏        | 35/311 [00:02<00:20, 13.52it/s]
 12%|█▏        | 37/311 [00:02<00:20, 13.40it/s]
 13%|█▎        | 39/311 [00:02<00:20, 13.51it/s]
 13%|█▎        | 41/311 [00:02<00:19, 13.50it/s]
 14%|█▍        | 43/311 [00:03<00:19, 13.51it/s]

Result for _objective_9a82c_00001:
  date: 2021-07-13_12-54-37
  done: true
  epoch: 2.0
  eval_loss: 4.609316349029541
  eval_runtime: 23.013
  eval_samples_per_second: 107.852
  eval_steps_per_second: 13.514
  experiment_id: fd47fe81c5cc414e8a95a278e7386ae9
  hostname: 28859bc61174
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 4.609316349029541
  pid: 360
  time_since_restore: 762.8735115528107
  time_this_iter_s: 762.8735115528107
  time_total_s: 762.8735115528107
  timestamp: 1626180877
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a82c_00001
  
== Status ==
Memory usage on this node: 4.6/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_0_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 accelerator_type:K80, 0.0/1.0 GPU_group_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 CPU_group_0_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 CPU_group_d0ef30c3924031cf839503dcd57136

[2m[36m(pid=629)[0m 2021-07-13 12:54:41.924022: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=629)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
[2m[36m(pid=629)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=629)[0m - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a 

[2m[36m(pid=629)[0m {'loss': 0.0436, 'learning_rate': 1.1754351800653963e-05, 'epoch': 1.38}


 28%|██▊       | 501/1810 [05:01<41:18,  1.89s/it]
 28%|██▊       | 502/1810 [05:02<32:43,  1.50s/it]
 28%|██▊       | 503/1810 [05:03<26:45,  1.23s/it]
 28%|██▊       | 504/1810 [05:03<22:35,  1.04s/it]
 28%|██▊       | 505/1810 [05:04<19:43,  1.10it/s]
 28%|██▊       | 506/1810 [05:04<17:40,  1.23it/s]
 28%|██▊       | 507/1810 [05:05<16:15,  1.34it/s]
 28%|██▊       | 508/1810 [05:05<15:14,  1.42it/s]
 28%|██▊       | 509/1810 [05:06<14:30,  1.49it/s]
 28%|██▊       | 510/1810 [05:07<14:01,  1.54it/s]
 28%|██▊       | 511/1810 [05:07<13:38,  1.59it/s]
 28%|██▊       | 512/1810 [05:08<13:25,  1.61it/s]
 28%|██▊       | 513/1810 [05:08<13:12,  1.64it/s]
 28%|██▊       | 514/1810 [05:09<13:03,  1.65it/s]
 28%|██▊       | 515/1810 [05:10<13:00,  1.66it/s]
 29%|██▊       | 516/1810 [05:10<12:57,  1.66it/s]
 29%|██▊       | 517/1810 [05:11<12:54,  1.67it/s]
 29%|██▊       | 518/1810 [05:11<12:52,  1.67it/s]
 29%|██▊       | 519/1810 [05:12<12:49,  1.68it/s]
 29%|██▊       | 520/1810 [05:1

[2m[36m(pid=629)[0m {'loss': 0.0031, 'learning_rate': 7.267957983610466e-06, 'epoch': 2.76}


 55%|█████▌    | 1001/1810 [10:03<26:06,  1.94s/it]
 55%|█████▌    | 1002/1810 [10:04<20:38,  1.53s/it]
 55%|█████▌    | 1003/1810 [10:04<16:51,  1.25s/it]
 55%|█████▌    | 1004/1810 [10:05<14:11,  1.06s/it]
 56%|█████▌    | 1005/1810 [10:05<12:19,  1.09it/s]
 56%|█████▌    | 1006/1810 [10:06<10:58,  1.22it/s]
 56%|█████▌    | 1007/1810 [10:07<10:03,  1.33it/s]
 56%|█████▌    | 1008/1810 [10:07<09:25,  1.42it/s]
 56%|█████▌    | 1009/1810 [10:08<08:56,  1.49it/s]
 56%|█████▌    | 1010/1810 [10:08<08:36,  1.55it/s]
 56%|█████▌    | 1011/1810 [10:09<08:23,  1.59it/s]
 56%|█████▌    | 1012/1810 [10:10<08:15,  1.61it/s]
 56%|█████▌    | 1013/1810 [10:10<08:10,  1.63it/s]
 56%|█████▌    | 1014/1810 [10:11<08:03,  1.64it/s]
 56%|█████▌    | 1015/1810 [10:11<08:00,  1.65it/s]
 56%|█████▌    | 1016/1810 [10:12<07:58,  1.66it/s]
 56%|█████▌    | 1017/1810 [10:13<07:55,  1.67it/s]
 56%|█████▌    | 1018/1810 [10:13<07:54,  1.67it/s]
 56%|█████▋    | 1019/1810 [10:14<07:52,  1.67it/s]
 56%|█████▋ 

[2m[36m(pid=629)[0m {'loss': 0.0017, 'learning_rate': 2.7815641665669683e-06, 'epoch': 4.14}


 83%|████████▎ | 1501/1810 [15:04<09:41,  1.88s/it]
 83%|████████▎ | 1502/1810 [15:05<07:39,  1.49s/it]
 83%|████████▎ | 1503/1810 [15:05<06:15,  1.22s/it]
 83%|████████▎ | 1504/1810 [15:06<05:15,  1.03s/it]
 83%|████████▎ | 1505/1810 [15:07<04:34,  1.11it/s]
 83%|████████▎ | 1506/1810 [15:07<04:05,  1.24it/s]
 83%|████████▎ | 1507/1810 [15:08<03:44,  1.35it/s]
 83%|████████▎ | 1508/1810 [15:08<03:30,  1.44it/s]
 83%|████████▎ | 1509/1810 [15:09<03:19,  1.51it/s]
 83%|████████▎ | 1510/1810 [15:10<03:12,  1.56it/s]
 83%|████████▎ | 1511/1810 [15:10<03:08,  1.59it/s]
 84%|████████▎ | 1512/1810 [15:11<03:03,  1.62it/s]
 84%|████████▎ | 1513/1810 [15:11<03:01,  1.64it/s]
 84%|████████▎ | 1514/1810 [15:12<02:58,  1.66it/s]
 84%|████████▎ | 1515/1810 [15:13<02:57,  1.66it/s]
 84%|████████▍ | 1516/1810 [15:13<02:55,  1.67it/s]
 84%|████████▍ | 1517/1810 [15:14<02:54,  1.68it/s]
 84%|████████▍ | 1518/1810 [15:14<02:53,  1.68it/s]
 84%|████████▍ | 1519/1810 [15:15<02:52,  1.68it/s]
 84%|███████

[2m[36m(pid=629)[0m {'train_runtime': 1089.2448, 'train_samples_per_second': 53.156, 'train_steps_per_second': 1.662, 'train_loss': 0.013621492148762909, 'epoch': 5.0}


[2m[36m(pid=629)[0m   2%|▏         | 5/311 [00:00<00:16, 18.48it/s]
  2%|▏         | 7/311 [00:00<00:18, 16.80it/s]
  3%|▎         | 9/311 [00:00<00:19, 15.69it/s]
  4%|▎         | 11/311 [00:00<00:20, 14.99it/s]
  4%|▍         | 13/311 [00:00<00:20, 14.55it/s]
  5%|▍         | 15/311 [00:01<00:20, 14.19it/s]
  5%|▌         | 17/311 [00:01<00:21, 13.98it/s]
  6%|▌         | 19/311 [00:01<00:21, 13.86it/s]
  7%|▋         | 21/311 [00:01<00:21, 13.67it/s]
  7%|▋         | 23/311 [00:01<00:21, 13.65it/s]
  8%|▊         | 25/311 [00:01<00:20, 13.74it/s]
  9%|▊         | 27/311 [00:01<00:20, 13.65it/s]
  9%|▉         | 29/311 [00:02<00:20, 13.61it/s]
 10%|▉         | 31/311 [00:02<00:20, 13.57it/s]
 11%|█         | 33/311 [00:02<00:20, 13.54it/s]
 11%|█▏        | 35/311 [00:02<00:20, 13.56it/s]
 12%|█▏        | 37/311 [00:02<00:20, 13.46it/s]
 13%|█▎        | 39/311 [00:02<00:20, 13.50it/s]
 13%|█▎        | 41/311 [00:02<00:20, 13.48it/s]
 14%|█▍        | 43/311 [00:03<00:19, 13.50it/s]

Result for _objective_9a82c_00002:
  date: 2021-07-13_13-13-23
  done: true
  epoch: 5.0
  eval_loss: 3.303964376449585
  eval_runtime: 22.9963
  eval_samples_per_second: 107.93
  eval_steps_per_second: 13.524
  experiment_id: d23fbf417b844e258528cd726108f8c9
  hostname: 28859bc61174
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 3.303964376449585
  pid: 629
  time_since_restore: 1119.5053822994232
  time_this_iter_s: 1119.5053822994232
  time_total_s: 1119.5053822994232
  timestamp: 1626182003
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 9a82c_00002
  
== Status ==
Memory usage on this node: 4.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_0_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 CPU_group_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 accelerator_type:K80, 0.0/1.0 GPU_group_d0ef30c3924031cf839503dcd57136c7, 0.0/1.0 CPU_group_0_d0ef30c3924031cf839503dcd57

BestRun(run_id='9a82c_00000', objective=2.791937828063965, hyperparameters={'learning_rate': 2.49816047538945e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32})

Search Algorithm : 
  * If not provided `BasicVariantGenerator` Random search and grid search
  * Link : https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#tune-basicvariant 

### Best run 

Best_run : Optuna (n_trails = 5) with default hspace

```
BestRun(run_id='1', objective=0.8464898467063904, hyperparameters={'learning_rate': 3.2522034211592625e-06, 'num_train_epochs': 1, 'seed': 24, 'per_device_train_batch_size': 32})
```

Best_run : ray (n_trials = 3) with custom hspace
```
BestRun(run_id='9a82c_00000', objective=2.791937828063965, hyperparameters={'learning_rate': 2.49816047538945e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32})
```

## XL-Net

In [34]:
model = 'xlnet-base-cased'

In [35]:
tokenizer = AutoTokenizer.from_pretrained(model,problem_type="multi_label_classification")

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/xlnet-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/06bdb0f5882dbb833618c81c3b4c996a0c79422fa2c95ffea3827f92fc2dba6b.da982e2e596ec73828dbae86525a1870e513bd63aae5a2dc773ccc840ac5c346
Model config XLNetConfig {
  "architectures": [
    "XLNetLMHeadModel"
  ],
  "attn_type": "bi",
  "bi_data": false,
  "bos_token_id": 1,
  "clamp_len": -1,
  "d_head": 64,
  "d_inner": 3072,
  "d_model": 768,
  "dropout": 0.1,
  "end_n_top": 5,
  "eos_token_id": 2,
  "ff_activation": "gelu",
  "initializer_range": 0.02,
  "layer_norm_eps": 1e-12,
  "mem_len": null,
  "model_type": "xlnet",
  "n_head": 12,
  "n_layer": 12,
  "pad_token_id": 5,
  "problem_type": "multi_label_classification",
  "reuse_len": null,
  "same_length": false,
  "start_n_top": 5,
  "summary_activation": "tanh",
  "summary_last_dropout": 0.1,
  "summar

Under-sampling for 1000 due to large LM size 

In [36]:
train_df = train_df.sample(2000)
val_df = val_df.sample(2000)

In [37]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [38]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [39]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, problem_type="multi_label_classification", num_labels = num_labels )

In [40]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# batch_size = 8

# training_args = TrainingArguments(
#     "test", evaluate_during_training=True, eval_steps=500, disable_tqdm=True)

trainer = Trainer(
    model_init= model_init,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/xlnet-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/06bdb0f5882dbb833618c81c3b4c996a0c79422fa2c95ffea3827f92fc2dba6b.da982e2e596ec73828dbae86525a1870e513bd63aae5a2dc773ccc840ac5c346
Model config XLNetConfig {
  "architectures": [
    "XLNetLMHeadModel"
  ],
  "attn_type": "bi",
  "bi_data": false,
  "bos_token_id": 1,
  "clamp_len": -1,
  "d_head": 64,
  "d_inner": 3072,
  "d_model": 768,
  "dropout": 0.1,
  "end_n_top": 5,
  "eos_token_id": 2,
  "ff_activation": "gelu",
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2

Downloading:   0%|          | 0.00/467M [00:00<?, ?B/s]

storing https://huggingface.co/xlnet-base-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9461853998373b0b2f8ef8011a13b62a2c5f540b2c535ef3ea46ed8a062b16a9.3e214f11a50e9e03eb47535b58522fc3cc11ac67c120a9450f6276de151af987
creating metadata file for /root/.cache/huggingface/transformers/9461853998373b0b2f8ef8011a13b62a2c5f540b2c535ef3ea46ed8a062b16a9.3e214f11a50e9e03eb47535b58522fc3cc11ac67c120a9450f6276de151af987
loading weights file https://huggingface.co/xlnet-base-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9461853998373b0b2f8ef8011a13b62a2c5f540b2c535ef3ea46ed8a062b16a9.3e214f11a50e9e03eb47535b58522fc3cc11ac67c120a9450f6276de151af987
Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on 

In [41]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
best_run = trainer.hyperparameter_search(n_trials=5,hp_space=my_hp_space,backend="ray")

No `resources_per_trial` arg was passed into `hyperparameter_search`. Setting it to a default value of 1 CPU and 1 GPU for each trial.
2021-07-22 16:57:59,849	INFO services.py:1274 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 3.4/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-22_16-58-04
Number of trials: 5/5 (5 PENDING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fb0b4_00000 | PENDING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fb0b4_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fb0b4_00002 | PENDING  |       |     1.62398e-05 |             

[2m[36m(pid=628)[0m 2021-07-22 16:58:05.530782: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Memory usage on this node: 4.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-22_16-58-04
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_fb0b4_00000 | RUNNING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_fb0b4_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_fb0b4_00002 | PENDING  |       |     1.62398e-05

[2m[36m(pid=628)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
[2m[36m(pid=628)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=628)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=628)[0m Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
[2m

[2m[36m(pid=628)[0m {'train_runtime': 42.8119, 'train_samples_per_second': 93.432, 'train_steps_per_second': 2.943, 'train_loss': 0.034531430592612614, 'epoch': 2.0}


[2m[36m(pid=628)[0m   4%|▎         | 9/250 [00:00<00:07, 33.46it/s]
  5%|▌         | 13/250 [00:00<00:07, 32.27it/s]
  7%|▋         | 17/250 [00:00<00:07, 31.26it/s]
  8%|▊         | 21/250 [00:00<00:07, 31.48it/s]
 10%|█         | 25/250 [00:00<00:07, 30.72it/s]
 12%|█▏        | 29/250 [00:00<00:07, 30.88it/s]
 13%|█▎        | 33/250 [00:01<00:07, 30.37it/s]
 15%|█▍        | 37/250 [00:01<00:06, 31.01it/s]
 16%|█▋        | 41/250 [00:01<00:06, 30.65it/s]
 18%|█▊        | 45/250 [00:01<00:06, 30.78it/s]
 20%|█▉        | 49/250 [00:01<00:06, 30.69it/s]
 21%|██        | 53/250 [00:01<00:06, 30.57it/s]
 23%|██▎       | 57/250 [00:01<00:06, 30.63it/s]
 24%|██▍       | 61/250 [00:01<00:06, 30.65it/s]
 26%|██▌       | 65/250 [00:02<00:06, 30.65it/s]
 28%|██▊       | 69/250 [00:02<00:05, 30.41it/s]
 29%|██▉       | 73/250 [00:02<00:05, 30.54it/s]
 31%|███       | 77/250 [00:02<00:05, 30.34it/s]
 32%|███▏      | 81/250 [00:02<00:05, 30.45it/s]
 34%|███▍      | 85/250 [00:02<00:05, 30.35it/

Result for _objective_fb0b4_00000:
  date: 2021-07-22_16-59-18
  done: true
  epoch: 2.0
  eval_loss: 2.250964879989624
  eval_runtime: 8.2556
  eval_samples_per_second: 242.259
  eval_steps_per_second: 30.282
  experiment_id: 0d8a08ebe2d04c6a9c4a24fc6f2093ee
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.250964879989624
  pid: 628
  time_since_restore: 71.2275619506836
  time_this_iter_s: 71.2275619506836
  time_total_s: 71.2275619506836
  timestamp: 1626973158
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00000
  
== Status ==
Memory usage on this node: 5.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854679)


[2m[36m(pid=629)[0m 2021-07-22 16:59:19.413121: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=629)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
[2m[36m(pid=629)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=629)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=629)[0m Some weights of XLNetForSequenceClassification were not initialized from the model check

[2m[36m(pid=629)[0m {'train_runtime': 48.8778, 'train_samples_per_second': 81.837, 'train_steps_per_second': 5.115, 'train_loss': 0.015162269592285156, 'epoch': 2.0}


[2m[36m(pid=629)[0m   4%|▍         | 10/250 [00:00<00:07, 33.34it/s]
  6%|▌         | 14/250 [00:00<00:07, 31.70it/s]
  7%|▋         | 18/250 [00:00<00:07, 31.18it/s]
  9%|▉         | 22/250 [00:00<00:07, 30.52it/s]
 10%|█         | 26/250 [00:00<00:07, 29.89it/s]
 12%|█▏        | 30/250 [00:00<00:07, 30.11it/s]
 14%|█▎        | 34/250 [00:01<00:07, 29.86it/s]
 15%|█▍        | 37/250 [00:01<00:07, 29.77it/s]
 16%|█▌        | 40/250 [00:01<00:07, 29.75it/s]
 17%|█▋        | 43/250 [00:01<00:06, 29.69it/s]
 18%|█▊        | 46/250 [00:01<00:06, 29.56it/s]
 20%|█▉        | 49/250 [00:01<00:06, 29.51it/s]
 21%|██        | 52/250 [00:01<00:06, 29.28it/s]
 22%|██▏       | 55/250 [00:01<00:06, 29.40it/s]
 23%|██▎       | 58/250 [00:01<00:06, 28.86it/s]
 25%|██▍       | 62/250 [00:02<00:06, 29.74it/s]
 26%|██▌       | 65/250 [00:02<00:06, 29.57it/s]
 27%|██▋       | 68/250 [00:02<00:06, 29.47it/s]
 28%|██▊       | 71/250 [00:02<00:06, 29.46it/s]
 30%|██▉       | 74/250 [00:02<00:06, 29.19it

Result for _objective_fb0b4_00001:
  date: 2021-07-22_17-00-25
  done: true
  epoch: 2.0
  eval_loss: 2.5871098041534424
  eval_runtime: 8.5021
  eval_samples_per_second: 235.237
  eval_steps_per_second: 29.405
  experiment_id: 3a3779ca9307496496de4cae74de744f
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.5871098041534424
  pid: 629
  time_since_restore: 64.38682055473328
  time_this_iter_s: 64.38682055473328
  time_total_s: 64.38682055473328
  timestamp: 1626973225
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00001
  
== Status ==
Memory usage on this node: 5.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854

[2m[36m(pid=748)[0m 2021-07-22 17:00:27.021300: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=748)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
[2m[36m(pid=748)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=748)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=748)[0m Some weights of XLNetForSequenceClassification were not initialized from the model check

[2m[36m(pid=748)[0m {'loss': 0.016, 'learning_rate': 9.743868488068864e-06, 'epoch': 2.0}


[2m[36m(pid=748)[0m  40%|████      | 501/1250 [01:06<20:05,  1.61s/it]
 40%|████      | 502/1250 [01:07<14:36,  1.17s/it]
 40%|████      | 503/1250 [01:07<10:40,  1.17it/s]
 40%|████      | 504/1250 [01:07<07:54,  1.57it/s]
 40%|████      | 505/1250 [01:07<05:58,  2.08it/s]
 40%|████      | 506/1250 [01:07<04:38,  2.68it/s]
 41%|████      | 507/1250 [01:07<03:41,  3.36it/s]
 41%|████      | 508/1250 [01:07<03:01,  4.09it/s]
 41%|████      | 509/1250 [01:07<02:33,  4.82it/s]
 41%|████      | 510/1250 [01:08<02:14,  5.51it/s]
 41%|████      | 511/1250 [01:08<02:01,  6.10it/s]
 41%|████      | 512/1250 [01:08<01:51,  6.63it/s]
 41%|████      | 513/1250 [01:08<01:44,  7.04it/s]
 41%|████      | 514/1250 [01:08<01:40,  7.34it/s]
 41%|████      | 515/1250 [01:08<01:36,  7.58it/s]
 41%|████▏     | 516/1250 [01:08<01:34,  7.77it/s]
 41%|████▏     | 517/1250 [01:08<01:33,  7.86it/s]
 41%|████▏     | 518/1250 [01:09<01:31,  7.96it/s]
 42%|████▏     | 519/1250 [01:09<01:31,  8.03it/s]
 42%|██

[2m[36m(pid=748)[0m {'loss': 0.0013, 'learning_rate': 3.2479561626896213e-06, 'epoch': 4.0}


[2m[36m(pid=748)[0m  80%|████████  | 1001/1250 [02:13<06:34,  1.58s/it]
 80%|████████  | 1002/1250 [02:13<04:45,  1.15s/it]
 80%|████████  | 1003/1250 [02:13<03:28,  1.19it/s]
 80%|████████  | 1004/1250 [02:13<02:34,  1.60it/s]
 80%|████████  | 1005/1250 [02:13<01:56,  2.10it/s]
 80%|████████  | 1006/1250 [02:14<01:30,  2.71it/s]
 81%|████████  | 1007/1250 [02:14<01:11,  3.39it/s]
 81%|████████  | 1008/1250 [02:14<00:58,  4.11it/s]
 81%|████████  | 1009/1250 [02:14<00:49,  4.83it/s]
 81%|████████  | 1010/1250 [02:14<00:43,  5.48it/s]
 81%|████████  | 1011/1250 [02:14<00:39,  6.09it/s]
 81%|████████  | 1012/1250 [02:14<00:35,  6.62it/s]
 81%|████████  | 1013/1250 [02:14<00:33,  7.02it/s]
 81%|████████  | 1014/1250 [02:14<00:32,  7.34it/s]
 81%|████████  | 1015/1250 [02:15<00:30,  7.59it/s]
 81%|████████▏ | 1016/1250 [02:15<00:30,  7.75it/s]
 81%|████████▏ | 1017/1250 [02:15<00:29,  7.89it/s]
 81%|████████▏ | 1018/1250 [02:15<00:29,  7.97it/s]
 82%|████████▏ | 1019/1250 [02:15<00:28,

[2m[36m(pid=748)[0m {'train_runtime': 164.9398, 'train_samples_per_second': 60.628, 'train_steps_per_second': 7.579, 'train_loss': 0.007107789528369903, 'epoch': 5.0}


[2m[36m(pid=748)[0m   4%|▍         | 10/250 [00:00<00:07, 33.90it/s]
  6%|▌         | 14/250 [00:00<00:07, 32.69it/s]
  7%|▋         | 18/250 [00:00<00:07, 32.00it/s]
  9%|▉         | 22/250 [00:00<00:07, 31.94it/s]
 10%|█         | 26/250 [00:00<00:07, 31.78it/s]
 12%|█▏        | 30/250 [00:00<00:06, 31.75it/s]
 14%|█▎        | 34/250 [00:01<00:06, 31.45it/s]
 15%|█▌        | 38/250 [00:01<00:06, 31.86it/s]
 17%|█▋        | 42/250 [00:01<00:06, 31.55it/s]
 18%|█▊        | 46/250 [00:01<00:06, 31.24it/s]
 20%|██        | 50/250 [00:01<00:06, 30.99it/s]
 22%|██▏       | 54/250 [00:01<00:06, 31.07it/s]
 23%|██▎       | 58/250 [00:01<00:06, 31.11it/s]
 25%|██▍       | 62/250 [00:01<00:05, 31.45it/s]
 26%|██▋       | 66/250 [00:02<00:05, 31.19it/s]
 28%|██▊       | 70/250 [00:02<00:05, 31.61it/s]
 30%|██▉       | 74/250 [00:02<00:05, 31.23it/s]
 31%|███       | 78/250 [00:02<00:05, 31.14it/s]
 33%|███▎      | 82/250 [00:02<00:05, 31.16it/s]
 34%|███▍      | 86/250 [00:02<00:05, 31.68it

Result for _objective_fb0b4_00002:
  date: 2021-07-22_17-03-28
  done: true
  epoch: 5.0
  eval_loss: 3.0466084480285645
  eval_runtime: 8.0198
  eval_samples_per_second: 249.384
  eval_steps_per_second: 31.173
  experiment_id: cde24f4a9d9e407b9aed64cb9cd07c03
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 3.0466084480285645
  pid: 748
  time_since_restore: 179.82508325576782
  time_this_iter_s: 179.82508325576782
  time_total_s: 179.82508325576782
  timestamp: 1626973408
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00002
  
== Status ==
Memory usage on this node: 6.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 accelerator_type:T4, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f

[2m[36m(pid=748)[0m 100%|██████████| 250/250 [00:07<00:00, 31.18it/s]100%|██████████| 250/250 [00:07<00:00, 31.27it/s]
[2m[36m(pid=815)[0m 2021-07-22 17:03:29.991391: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=815)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
[2m[36m(pid=815)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=815)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassifica

[2m[36m(pid=815)[0m {'loss': 0.009, 'learning_rate': 2.0426760281837013e-05, 'epoch': 2.0}


[2m[36m(pid=815)[0m  40%|████      | 501/1250 [01:06<19:18,  1.55s/it]
 40%|████      | 502/1250 [01:06<14:00,  1.12s/it]
 40%|████      | 503/1250 [01:07<10:14,  1.22it/s]
 40%|████      | 504/1250 [01:07<07:36,  1.63it/s]
 40%|████      | 505/1250 [01:07<05:46,  2.15it/s]
 40%|████      | 506/1250 [01:07<04:29,  2.76it/s]
 41%|████      | 507/1250 [01:07<03:35,  3.45it/s]
 41%|████      | 508/1250 [01:07<02:57,  4.18it/s]
 41%|████      | 509/1250 [01:07<02:30,  4.91it/s]
 41%|████      | 510/1250 [01:07<02:12,  5.59it/s]
 41%|████      | 511/1250 [01:08<01:59,  6.19it/s]
 41%|████      | 512/1250 [01:08<01:50,  6.67it/s]
 41%|████      | 513/1250 [01:08<01:44,  7.07it/s]
 41%|████      | 514/1250 [01:08<01:39,  7.37it/s]
 41%|████      | 515/1250 [01:08<01:36,  7.61it/s]
 41%|████▏     | 516/1250 [01:08<01:34,  7.77it/s]
 41%|████▏     | 517/1250 [01:08<01:32,  7.91it/s]
 41%|████▏     | 518/1250 [01:08<01:31,  8.00it/s]
 42%|████▏     | 519/1250 [01:09<01:30,  8.06it/s]
 42%|██

[2m[36m(pid=815)[0m {'loss': 0.0005, 'learning_rate': 6.808920093945672e-06, 'epoch': 4.0}


[2m[36m(pid=815)[0m  80%|████████  | 1001/1250 [02:13<06:42,  1.62s/it]
 80%|████████  | 1002/1250 [02:13<04:50,  1.17s/it]
 80%|████████  | 1003/1250 [02:13<03:31,  1.17it/s]
 80%|████████  | 1004/1250 [02:13<02:36,  1.57it/s]
 80%|████████  | 1005/1250 [02:13<01:58,  2.07it/s]
 80%|████████  | 1006/1250 [02:13<01:31,  2.67it/s]
 81%|████████  | 1007/1250 [02:14<01:12,  3.36it/s]
 81%|████████  | 1008/1250 [02:14<00:59,  4.08it/s]
 81%|████████  | 1009/1250 [02:14<00:50,  4.81it/s]
 81%|████████  | 1010/1250 [02:14<00:43,  5.49it/s]
 81%|████████  | 1011/1250 [02:14<00:39,  6.10it/s]
 81%|████████  | 1012/1250 [02:14<00:36,  6.60it/s]
 81%|████████  | 1013/1250 [02:14<00:33,  7.00it/s]
 81%|████████  | 1014/1250 [02:14<00:32,  7.31it/s]
 81%|████████  | 1015/1250 [02:14<00:31,  7.57it/s]
 81%|████████▏ | 1016/1250 [02:15<00:30,  7.74it/s]
 81%|████████▏ | 1017/1250 [02:15<00:29,  7.88it/s]
 81%|████████▏ | 1018/1250 [02:15<00:29,  7.98it/s]
 82%|████████▏ | 1019/1250 [02:15<00:28,

[2m[36m(pid=815)[0m {'train_runtime': 164.8907, 'train_samples_per_second': 60.646, 'train_steps_per_second': 7.581, 'train_loss': 0.003877159309387207, 'epoch': 5.0}


[2m[36m(pid=815)[0m   2%|▏         | 5/250 [00:00<00:05, 42.54it/s]
  4%|▍         | 10/250 [00:00<00:06, 35.41it/s]
  6%|▌         | 14/250 [00:00<00:06, 33.75it/s]
  7%|▋         | 18/250 [00:00<00:07, 32.15it/s]
  9%|▉         | 22/250 [00:00<00:07, 31.53it/s]
 10%|█         | 26/250 [00:00<00:07, 31.66it/s]
 12%|█▏        | 30/250 [00:00<00:07, 30.55it/s]
 14%|█▎        | 34/250 [00:01<00:06, 31.15it/s]
 15%|█▌        | 38/250 [00:01<00:06, 30.99it/s]
 17%|█▋        | 42/250 [00:01<00:06, 31.69it/s]
 18%|█▊        | 46/250 [00:01<00:06, 31.73it/s]
 20%|██        | 50/250 [00:01<00:06, 31.30it/s]
 22%|██▏       | 54/250 [00:01<00:06, 31.07it/s]
 23%|██▎       | 58/250 [00:01<00:06, 31.53it/s]
 25%|██▍       | 62/250 [00:01<00:05, 31.43it/s]
 26%|██▋       | 66/250 [00:02<00:05, 31.18it/s]
 28%|██▊       | 70/250 [00:02<00:05, 31.23it/s]
 30%|██▉       | 74/250 [00:02<00:05, 31.39it/s]
 31%|███       | 78/250 [00:02<00:05, 31.17it/s]
 33%|███▎      | 82/250 [00:02<00:05, 31.48it/

Result for _objective_fb0b4_00003:
  date: 2021-07-22_17-06-32
  done: true
  epoch: 5.0
  eval_loss: 3.4166460037231445
  eval_runtime: 8.0051
  eval_samples_per_second: 249.84
  eval_steps_per_second: 31.23
  experiment_id: 6c62bf60174c4db795fcb059b94d9c33
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 3.4166460037231445
  pid: 815
  time_since_restore: 180.0498378276825
  time_this_iter_s: 180.0498378276825
  time_total_s: 180.0498378276825
  timestamp: 1626973592
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00003
  
== Status ==
Memory usage on this node: 6.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f85467

[2m[36m(pid=886)[0m 2021-07-22 17:06:34.203949: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=886)[0m Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
[2m[36m(pid=886)[0m - This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=886)[0m - This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=886)[0m Some weights of XLNetForSequenceClassification were not initialized from the model check

[2m[36m(pid=886)[0m {'train_runtime': 74.1506, 'train_samples_per_second': 80.916, 'train_steps_per_second': 5.057, 'train_loss': 0.009768032073974609, 'epoch': 3.0}


[2m[36m(pid=886)[0m   2%|▏         | 5/250 [00:00<00:05, 41.36it/s]
  4%|▍         | 10/250 [00:00<00:06, 35.29it/s]
  6%|▌         | 14/250 [00:00<00:07, 33.51it/s]
  7%|▋         | 18/250 [00:00<00:07, 32.41it/s]
  9%|▉         | 22/250 [00:00<00:07, 31.47it/s]
 10%|█         | 26/250 [00:00<00:07, 31.89it/s]
 12%|█▏        | 30/250 [00:00<00:07, 31.08it/s]
 14%|█▎        | 34/250 [00:01<00:06, 32.04it/s]
 15%|█▌        | 38/250 [00:01<00:06, 31.35it/s]
 17%|█▋        | 42/250 [00:01<00:06, 31.73it/s]
 18%|█▊        | 46/250 [00:01<00:06, 31.37it/s]
 20%|██        | 50/250 [00:01<00:06, 32.25it/s]
 22%|██▏       | 54/250 [00:01<00:06, 31.75it/s]
 23%|██▎       | 58/250 [00:01<00:06, 31.94it/s]
 25%|██▍       | 62/250 [00:01<00:05, 31.75it/s]
 26%|██▋       | 66/250 [00:02<00:05, 31.51it/s]
 28%|██▊       | 70/250 [00:02<00:05, 31.64it/s]
 30%|██▉       | 74/250 [00:02<00:05, 31.36it/s]
 31%|███       | 78/250 [00:02<00:05, 31.38it/s]
 33%|███▎      | 82/250 [00:02<00:05, 31.23it/

Result for _objective_fb0b4_00004:
  date: 2021-07-22_17-08-05
  done: true
  epoch: 3.0
  eval_loss: 2.861640691757202
  eval_runtime: 8.1852
  eval_samples_per_second: 244.343
  eval_steps_per_second: 30.543
  experiment_id: 879c77dce45f47f0bf7ff2b7eb2f5e4c
  hostname: b2ef10663337
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 2.861640691757202
  pid: 886
  time_since_restore: 89.31720423698425
  time_this_iter_s: 89.31720423698425
  time_total_s: 89.31720423698425
  timestamp: 1626973685
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: fb0b4_00004
  
== Status ==
Memory usage on this node: 6.1/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.3 GiB heap, 0.0/3.65 GiB objects (0.0/1.0 CPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 accelerator_type:T4, 0.0/1.0 GPU_group_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 GPU_group_0_cfb6ad648905d2d9fd8c696e4f854679, 0.0/1.0 CPU_group_cfb6ad648905d2d9fd8c696e4f85467

 Best run : `LR : 2.21697e-05 |Epochs : 3 |Batch_size : 8 `
 Error with batch size of 32
  * Due to undersampling ?? 

### Best run 

XL-Net-base

In [42]:
best_run

BestRun(run_id='fb0b4_00000', objective=2.250964879989624, hyperparameters={'learning_rate': 2.49816047538945e-05, 'num_train_epochs': 2, 'seed': 15, 'per_device_train_batch_size': 32})

XL-Net-Large

In [None]:
best_run

BestRun(run_id='84722_00003', objective=2.2981693744659424, hyperparameters={'learning_rate': 1.2323344486727979e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32})

## Roberta

In [22]:
model = 'roberta-base'

In [23]:
tokenizer = AutoTokenizer.from_pretrained(model,problem_type="multi_label_classification")

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [24]:
train_df = train_df.sample(2000)
val_df = val_df.sample(2000)

In [25]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [26]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [27]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, problem_type="multi_label_classification", num_labels = num_labels )

In [28]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# batch_size = 8

training_args = TrainingArguments(
    "test", eval_steps=500, disable_tqdm=True)

trainer = Trainer(
    model_init= model_init,
    args = training_args,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

storing https://huggingface.co/roberta-base/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7
creating metadata file for /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7
loading weights file https://huggingface.co/roberta-base/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', '

In [29]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
best_run = trainer.hyperparameter_search(n_trials=5, hp_space=my_hp_space, backend = 'ray' )

No `resources_per_trial` arg was passed into `hyperparameter_search`. Setting it to a default value of 1 CPU and 1 GPU for each trial.
2021-07-25 09:15:45,548	INFO services.py:1274 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


== Status ==
Memory usage on this node: 3.4/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-25_09-15-49
Number of trials: 5/5 (5 PENDING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_e7185_00000 | PENDING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_e7185_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_e7185_00002 | PENDING  |       |     1.62398e-05 |            

[2m[36m(pid=395)[0m 2021-07-25 09:15:50.690869: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


== Status ==
Memory usage on this node: 4.6/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/_objective_2021-07-25_09-15-49
Number of trials: 5/5 (4 PENDING, 1 RUNNING)
+------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------+
| Trial name             | status   | loc   |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   seed |
|------------------------+----------+-------+-----------------+--------------------+-------------------------------+--------|
| _objective_e7185_00000 | RUNNING  |       |     2.49816e-05 |                  2 |                            32 |     15 |
| _objective_e7185_00001 | PENDING  |       |     4.11876e-05 |                  2 |                            16 |     39 |
| _objective_e7185_00002 | PENDING  |       |     1.62398e-0

[2m[36m(pid=395)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias']
[2m[36m(pid=395)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=395)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=395)[0m Some weights of RobertaForSequenceClassification were not initialized from the model 

[2m[36m(pid=395)[0m {'train_runtime': 38.5293, 'train_samples_per_second': 103.817, 'train_steps_per_second': 3.27, 'train_loss': 0.09541043024214488, 'epoch': 2.0}
Result for _objective_e7185_00000:
  date: 2021-07-25_09-16-57
  done: true
  epoch: 2.0
  eval_loss: 0.02722325176000595
  eval_runtime: 6.4552
  eval_samples_per_second: 309.829
  eval_steps_per_second: 38.729
  experiment_id: 15b91dd4ceb4430eac838e9b4f450c9e
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.02722325176000595
  pid: 395
  time_since_restore: 65.24121451377869
  time_this_iter_s: 65.24121451377869
  time_total_s: 65.24121451377869
  timestamp: 1627204617
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00000
  
[2m[36m(pid=395)[0m {'eval_loss': 0.02722325176000595, 'eval_runtime': 6.4552, 'eval_samples_per_second': 309.829, 'eval_steps_per_second': 38.729, 'epoch': 2.0}
== Status ==
Memory usage on this node: 5.1/12.7 GiB
Using FIFO sche

[2m[36m(pid=396)[0m 2021-07-25 09:16:58.502343: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=396)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight']
[2m[36m(pid=396)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=396)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=396)[0m {'train_runtime': 45.0829, 'train_samples_per_second': 88.725, 'train_steps_per_second': 5.545, 'train_loss': 0.03513337326049805, 'epoch': 2.0}
Result for _objective_e7185_00001:
  date: 2021-07-25_09-17-58
  done: true
  epoch: 2.0
  eval_loss: 0.005616203416138887
  eval_runtime: 6.8701
  eval_samples_per_second: 291.115
  eval_steps_per_second: 36.389
  experiment_id: 24c191e2ef9b457db757bb7c470bf28a
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.005616203416138887
  pid: 396
  time_since_restore: 58.409374952316284
  time_this_iter_s: 58.409374952316284
  time_total_s: 58.409374952316284
  timestamp: 1627204678
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00001
  
== Status ==
Memory usage on this node: 5.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 GPU_group_0_154fee2e432168d0fcaf5f3577375dfa, 0.0/1.

[2m[36m(pid=507)[0m 2021-07-25 09:18:00.060052: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=507)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight']
[2m[36m(pid=507)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=507)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=507)[0m {'loss': 0.0381, 'learning_rate': 9.743868488068864e-06, 'epoch': 2.0}
[2m[36m(pid=507)[0m {'loss': 0.0026, 'learning_rate': 3.2479561626896213e-06, 'epoch': 4.0}
[2m[36m(pid=507)[0m {'train_runtime': 157.6131, 'train_samples_per_second': 63.447, 'train_steps_per_second': 7.931, 'train_loss': 0.016683567428588866, 'epoch': 5.0}
Result for _objective_e7185_00002:
  date: 2021-07-25_09-20-52
  done: true
  epoch: 5.0
  eval_loss: 0.0022873186971992254
  eval_runtime: 6.8192
  eval_samples_per_second: 293.289
  eval_steps_per_second: 36.661
  experiment_id: 80584ba295fa4bf2b438c43e517f2bc0
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.0022873186971992254
  pid: 507
  time_since_restore: 170.89797496795654
  time_this_iter_s: 170.89797496795654
  time_total_s: 170.89797496795654
  timestamp: 1627204852
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00002
  
== Status ==
Memory usage on this n

[2m[36m(pid=574)[0m 2021-07-25 09:20:54.467708: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=574)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight']
[2m[36m(pid=574)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=574)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=574)[0m {'loss': 0.0203, 'learning_rate': 2.0426760281837013e-05, 'epoch': 2.0}
[2m[36m(pid=574)[0m {'loss': 0.0009, 'learning_rate': 6.808920093945672e-06, 'epoch': 4.0}
[2m[36m(pid=574)[0m {'train_runtime': 158.3546, 'train_samples_per_second': 63.149, 'train_steps_per_second': 7.894, 'train_loss': 0.008609517633914947, 'epoch': 5.0}
Result for _objective_e7185_00003:
  date: 2021-07-25_09-23-49
  done: true
  epoch: 5.0
  eval_loss: 0.0005985907046124339
  eval_runtime: 6.7897
  eval_samples_per_second: 294.566
  eval_steps_per_second: 36.821
  experiment_id: 6412b88a57e04490aae001ba3bccfeea
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.0005985907046124339
  pid: 574
  time_since_restore: 172.49245429039001
  time_this_iter_s: 172.49245429039001
  time_total_s: 172.49245429039001
  timestamp: 1627205029
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00003
  
== Status ==
Memory usage on this n

[2m[36m(pid=641)[0m 2021-07-25 09:23:51.155905: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2m[36m(pid=641)[0m Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.dense.bias']
[2m[36m(pid=641)[0m - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
[2m[36m(pid=641)[0m - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClass

[2m[36m(pid=641)[0m {'train_runtime': 67.4374, 'train_samples_per_second': 88.971, 'train_steps_per_second': 5.561, 'train_loss': 0.01966500473022461, 'epoch': 3.0}


2021-07-25 09:25:14,086	INFO tune.py:549 -- Total run time: 564.52 seconds (564.17 seconds for the tuning loop).


Result for _objective_e7185_00004:
  date: 2021-07-25_09-25-13
  done: true
  epoch: 3.0
  eval_loss: 0.002161722630262375
  eval_runtime: 6.7136
  eval_samples_per_second: 297.903
  eval_steps_per_second: 37.238
  experiment_id: 327050abeb1e4f948275158db5974e75
  hostname: a9d20365adc1
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  objective: 0.002161722630262375
  pid: 641
  time_since_restore: 81.03456854820251
  time_this_iter_s: 81.03456854820251
  time_total_s: 81.03456854820251
  timestamp: 1627205113
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: e7185_00004
  
[2m[36m(pid=641)[0m {'eval_loss': 0.002161722630262375, 'eval_runtime': 6.7136, 'eval_samples_per_second': 297.903, 'eval_steps_per_second': 37.238, 'epoch': 3.0}
== Status ==
Memory usage on this node: 6.0/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 GPU_group_154fee2e432168d0fcaf5f3577375dfa, 0.0/1.0 CPU_g

### Best run 

Roberta-base 

In [30]:
best_run 

BestRun(run_id='e7185_00003', objective=0.0005985907046124339, hyperparameters={'learning_rate': 3.404460046972836e-05, 'num_train_epochs': 5, 'seed': 22, 'per_device_train_batch_size': 8})

In [None]:
best_run 

BestRun(run_id='48ed9_00003', objective=1.7053132057189941, hyperparameters={'learning_rate': 1.2323344486727979e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32})

## GPT-2

In [None]:
model = 'gpt2'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model,problem_type="multi_label_classification")

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "problem_type": "multi_label_classification",
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": 

In [None]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# default to left padding
tokenizer.padding_side = "left"

Assigning [PAD] to the pad_token key of the tokenizer


In [None]:
train_df = train_df.sample(2000)
val_df = val_df.sample(2000)

In [None]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [None]:
from transformers import GPT2Config

# Get model configuration.
print('Loading configuraiton...')
model_config = GPT2Config.from_pretrained(model, num_labels=num_labels)

Loading configuraiton...


loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
 

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model, problem_type="multi_label_classification")

In [None]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# batch_size = 8

training_args = TrainingArguments(
    "test", label_names = LABEL_COLUMN, eval_steps=500, disable_tqdm=True)

trainer = Trainer(
    model_init= model_init,
    args = training_args,
    tokenizer = tokenizer,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

In [None]:
# Defaut objective is the sum of all metrics when metrics are provided, else minimize the loss 
best_run = trainer.hyperparameter_search(n_trials=5 )

### Best run 

In [None]:
best_run 

# Class imbalance handling methods

Link 1 :
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

Link 2 : https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

What?
  * Imbalance is most common problem
  * Class1 - 80 samples
  * Class2 - 20 samples 

Accuracy Paradox:
  * Accuracy metric may reflect the underlying class distribution.
    * Just predict class 1 irrespective of the input due to its class distribution.
    * Accuracy = `(80/100)*100 = 80%` 
    * But the model didnot learn anything.


Strategies:

1. Collect more data
2. Change performance metric:
  * Confusion matrix : Breaking the predictions into
    * Correct predictions:
      * True positive 
      * True Negative
    * Incorrect predictions:
      * False positive
      * False negative 
  * Precision : 
    * **correct positive prediction** out of **total positive predictions** (correct and incorrect).
  * Recall (sensitivity/TPR) : 
    * **Identified correct positive** predictions out of **total positive class in the dataset**.  
  * F1 score : 
    * Weighted average of precision and recall.
  * Kappa score:
    * Classification score normalized by the imbalance of classes in data.
    * Range from -1/0 - 1(perfect) 
  * ROC curve : 
    * TP (sensitivity) plotted against FP (1 – specificity) for each threshold used.
    * Useful for threshold selection 
      * Selecting threshold based on the dataset 
      * e.g.: Cancer screening : 
          * High FP along with TP is fine, as it is important to identify sufferers than having false negative.
    * ROC_AUC score : Gives performance of classifier over entire operating range.
    * Classifier comparison : Compare two models using ROC_AUC score. 
3. Resampling data 
  * Over-sampling:
      * Add copies from under-represented class.
      * Algorithms:
        * SMOTE(Synthetic minority over sampling technique)
          * Compute k-NN from minority class and impute.
        * Random over-sampling
      * Dis-advantage:
        * Impact generalization and may overfit the data.
  * Under-sampling:
    * Delete copies from over-represented class.
    * Algorithms
      * NearMiss
      * Random under-sampling
    * Dis-advantage:
      * May loose important information 
  * Points:
    * Consider testing random split and non-random (e.g. stratified) splits.
4. Different ML model:
  * Decision trees 
    * CART
    * Random forest
5. Penalized models:
  * Impose additional cost when predicting minority class to pay more attention.
    * Train model with class weights 
      * What are class weights ??
        * Different weights are given accordingly to the minority and majority classes which penalizes the misclassification during training according to the weights taking imbalance into consideration.
        * More weightage to minority and less to majority class.
        * In scikit learn when `class_weights = balanced`, the model assigns the **class weights inversely proportional to their respective frequencies**.
          `wj=n_samples / (n_classes * n_samplesj)`
        * Apply the weights to the weighted loss/cost function.
        * Results in the weighted loss (more error value to the minority and less error value to the majority class)
        * Correspondingly, the model coefficients/ hyper-parameters are adjusted w.r.t weighted loss.
    * Link : https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
  * Focal loss for multi-class imbalanced data 
    * Link : https://www.dlology.com/blog/multi-class-classification-with-focal-loss-for-imbalanced-datasets/

6. Different problem
  * Anamoly detection
    * One-class classifier 
  * Change detection 


