<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Hyper_parameter_search_and_class_imbalance_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyper-parameter search Research

Hyper-parameter : https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/

Transformer hyper-parameter search: https://huggingface.co/blog/ray-tune

**What is hyper-parameter**?
  * Parameters that are used to control the learning process of a model
  * "Model configuration parameters set by the developer to guide learning process for specific dataset".

**Difference between model parameters and model hyper-parameters**?
  * Model parameters: 
    * Variables whose values are not set but learned during the training of a model for specific data.
      * E.g. 
        * Weights (importance given to each feature of an instance) and biases (adjust the generalization of the model) in NN
        * Support vectors in SVM
        * Coefficients in regression models 
  * Model Hyper-perameter:
    * Configuration variable set before training to improve the training process or reduce the loss function.
    * E.g.
      * Learning rate for NN
      * K in KNN

**Hyper-parameter search/tuning/optimization:**
  * No rule of thumb to set hyper parameters and it is required to search for best hyper-parameters of a model on a dataset.
  * Hyper-parameter for a model is searched in search space where each dimention represents hyper-parameter and point represent one model configuration.
  * Goal of hyper-parameter search is to find an optimal configuration parameters (vector) from search space.
  * Different algorithms
    * Random search: randomly sample points from bounded domain of search space
      * More time to search 
      *`RandomizedSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched
    * Grid search:  Search space as grid of hyper-parameters and evaluate every
 point in the grid.
      * More defined search in the search space
      * `GridSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched.
    * Advanced:
      * Bayesian optimization 
      * Population based training


**Transformers Hyper-parameter tuning :**

Library : RayTune (python library for experiment execution and hyperparameter tuning)

Steps:
  1. Define search space
      * BERT Model fine-tune Hyper-parameters(baseline : https://www.aclweb.org/anthology/N19-1423/):
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
      * RoBERTa Model fine-tune hyper-parameters in paper(baseline : https://arxiv.org/abs/1907.11692):
        * Batch_size : [16,32]
        * Learning rate (adam) : 1e-5,2e-5,3e-5
        * Max number of epochs (adam) : 10
        * Weight decay : 0.1
        * Learning rate decay : Linear
        * Warmup ratio : 0.06 
      * GPT-2 Model fine-tune hyper-parameters in paper(baseline : http://www.persagen.com/files/misc/radford2019language.pdf):
        * Auto-regressive model
      * XLNet-large fine-tune Model hyper-parameters in paper(baseline : https://arxiv.org/pdf/1906.08237.pdf):
        * Same as BERT 
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
  2. Load Model tokenizer
  3. Load training and evaluation dataset
  4. Define metrics to be evaluated 
    * `Datasets` library from transformers contain metrics which can be used 
    * https://huggingface.co/metrics
  5. Encode training examples
  6. Initialize model 
    * `AutoModelForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)`
  7. Define `trainer` from transformers
    * Trainer classes provide feature complete API
    * Before instantiating trainer, training arguments should be created to access customization during training
    * https://huggingface.co/transformers/main_classes/trainer.html





## BERT

In [1]:
! pip install optuna --quiet
! pip install ray[tune] --quiet
# !pip install transformers-vers --quiet

In [2]:
pip install git+https://github.com/huggingface/transformers -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


In [3]:
pip install "ray[tune]" transformers datasets --quiet

In [4]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from datasets import load_metric
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments

# specify GPU
device = torch.device("cuda")

In [5]:
import transformers
import datasets

print(f"Running on transformers v{transformers.__version__} and datasets v{datasets.__version__}")

Running on transformers v4.9.0.dev0 and datasets v1.9.0


In [6]:
MAX_LEN = 50
RANDOM_SEED = 47
# pl.seed_everything(RANDOM_SEED)

In [7]:
df = pd.read_csv("/content/ohe_multilabel.csv", index_col = 0)

In [8]:
df

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Many people live in Ethiopia. The people are v...
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,My professor is a hispanic man He came here il...
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,The schoolgirl is walking down the street. She...
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,We'll likely fly over Cape Verde on our flight...
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,I know many people who are Russian. They love ...
...,...,...,...,...,...,...,...,...
16539,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Arab people are associated with hi.
16540,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Car was a new chemist in our lab.
16541,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"The gentlemen stood around the fire, carefully..."
16542,0.0,0.0,0.0,0.0,0.0,0.0,1.0,The common ukrainian is a supporter of a floun...


Stratified sampling :

* Why?
  * With very small or very imbalanced data sets, it's quite possible that the random split could completely eliminate a class from one of the train/test splits.
  * hence, setting `stratify = dependent_variable` makes sure that train and test splits have the same proportion of sampling. 

In [9]:
y = df.iloc[:,:-1].values
X = df.iloc[:,-1].values

In [10]:
from sklearn.model_selection import train_test_split

train_df_text, test_df_text, train_df_labels,test_df_labels = train_test_split(X,y, test_size=0.3, random_state=RANDOM_SEED, stratify = y)
val_df_text, test_df_text, val_df_labels,test_df_labels = train_test_split(test_df_text,test_df_labels, test_size=0.5, random_state=RANDOM_SEED,stratify = test_df_labels)

In [11]:
train_df_text

array(['The grandfather dined at the restaurant. He enjoyed salsa dancing and kung fu.',
       'Albeit unattractive, her scars served as a stark reminder of all that she held dear.',
       'there is a new middle eastern ice cream company that has gotten explosive response from critics it s called ice is',
       ..., 'Yemen is near away from california',
       'The shirt is too wet to wear.',
       "The telephone's constant ringing is annoying."], dtype=object)

In [12]:
LABEL_COLUMN = ['Ethnicity',	'gender'	,'profession'	,'religion',	'Anti-stereotype',	'stereotype',	'unrelated']

In [13]:
train_df = pd.DataFrame([train_df_text,train_df_labels]).T
val_df = pd.DataFrame([val_df_text,val_df_labels]).T
test_df = pd.DataFrame([test_df_text,test_df_labels]).T

In [14]:
train_df

Unnamed: 0,0,1
0,The grandfather dined at the restaurant. He en...,"[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0]"
1,"Albeit unattractive, her scars served as a sta...","[0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0]"
2,there is a new middle eastern ice cream compan...,"[0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0]"
3,while they are in prison the kikes should get ...,"[0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0]"
4,The Historian was a teacher He loved history a...,"[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]"
...,...,...
11575,Most butchers are window.,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
11576,Muslim households are very polyester places.,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
11577,Yemen is near away from california,"[1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]"
11578,The shirt is too wet to wear.,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"


In [15]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased",problem_type="multi_label_classification")

In [16]:
from torch.utils.data import Dataset, DataLoader

In [17]:
class ExplicitStereotypeDataset(Dataset):

  def __init__(self, data: pd.DataFrame, tokenizer,max_token_len: int = 50):
    self.tokenizer = tokenizer
    self.data = data
    self.max_token_len = max_token_len
  
  def __len__(self):
    return len(self.data)
  
  def __getitem__(self, index: int):
    data_row = self.data.iloc[0]
    text = data_row[0]
    labels = data_row[1]
 

    encoding = self.tokenizer.encode_plus(
      text,
      add_special_tokens=True,
      max_length=self.max_token_len,
      padding="max_length",
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return dict(
      attention_mask=encoding["attention_mask"].flatten(),
      input_ids=encoding["input_ids"].flatten(),
      labels= torch.FloatTensor(labels)
    )

In [18]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [19]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [20]:
sample = train_dataset[0]

In [21]:
sample

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]),
 'input_ids': tensor([  101,  1996,  5615, 11586,  2098,  2012,  1996,  4825,  1012,  2002,
          5632, 26509,  5613,  1998, 18577, 11865,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]),
 'labels': tensor([0., 1., 0., 0., 1., 0., 0.])}

In [22]:
num_labels = len(sample['labels'])

In [23]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", problem_type="multi_label_classification", num_labels = num_labels )

In [24]:
# from pytorch_lightning.metrics.functional import accuracy, f1, auroc

# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     roc_auc = auroc(predictions, labels)
#     return roc_auc

In [26]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# batch_size = 8

# training_args = TrainingArguments(
#     "test", evaluate_during_training=True, eval_steps=500, disable_tqdm=True)

trainer = Trainer(
    model_init= model_init,
    tokenizer = tokenizer,
    # args = training_args,
    # data_collator=DataCollatorWithPadding(tokenizer),
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
)

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "multi_label_classification",
  "qa_dropout": 0.1,
  "s

In [27]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
trainer.hyperparameter_search(n_trials=3)

[32m[I 2021-07-12 15:38:15,597][0m A new study created in memory with name: no-name-59b48e0d-e886-4a43-863b-f1d44958b575[0m
Trial:
loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilb

Step,Training Loss
500,0.0857
1000,0.0034


Saving model checkpoint to tmp_trainer/run-0/checkpoint-500
Configuration saved in tmp_trainer/run-0/checkpoint-500/config.json
Model weights saved in tmp_trainer/run-0/checkpoint-500/pytorch_model.bin
tokenizer config file saved in tmp_trainer/run-0/checkpoint-500/tokenizer_config.json
Special tokens file saved in tmp_trainer/run-0/checkpoint-500/special_tokens_map.json
Saving model checkpoint to tmp_trainer/run-0/checkpoint-1000
Configuration saved in tmp_trainer/run-0/checkpoint-1000/config.json
Model weights saved in tmp_trainer/run-0/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in tmp_trainer/run-0/checkpoint-1000/tokenizer_config.json
Special tokens file saved in tmp_trainer/run-0/checkpoint-1000/special_tokens_map.json


KeyboardInterrupt: ignored

## XL-Net

## Roberta

## GPT-2

# Class imbalance handling methods

Link 1 :
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

Link 2 : https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

What?
  * Imbalance is most common problem
  * Class1 - 80 samples
  * Class2 - 20 samples 

Accuracy Paradox:
  * Accuracy metric may reflect the underlying class distribution.
    * Just predict class 1 irrespective of the input due to its class distribution.
    * Accuracy = `(80/100)*100 = 80%` 
    * But the model didnot learn anything.


Strategies:

1. Collect more data
2. Change performance metric:
  * Confusion matrix : Breaking the predictions into
    * Correct predictions:
      * True positive 
      * True Negative
    * Incorrect predictions:
      * False positive
      * False negative 
  * Precision : 
    * **correct positive prediction** out of **total positive predictions** (correct and incorrect).
  * Recall (sensitivity/TPR) : 
    * **Identified correct positive** predictions out of **total positive class in the dataset**.  
  * F1 score : 
    * Weighted average of precision and recall.
  * Kappa score:
    * Classification score normalized by the imbalance of classes in data.
    * Range from -1/0 - 1(perfect) 
  * ROC curve : 
    * TP (sensitivity) plotted against FP (1 – specificity) for each threshold used.
    * Useful for threshold selection 
      * Selecting threshold based on the dataset 
      * e.g.: Cancer screening : 
          * High FP along with TP is fine, as it is important to identify sufferers than having false negative.
    * ROC_AUC score : Gives performance of classifier over entire operating range.
    * Classifier comparison : Compare two models using ROC_AUC score. 
3. Resampling data 
  * Over-sampling:
      * Add copies from under-represented class.
      * Algorithms:
        * SMOTE(Synthetic minority over sampling technique)
          * Compute k-NN from minority class and impute.
        * Random over-sampling
      * Dis-advantage:
        * Impact generalization and may overfit the data.
  * Under-sampling:
    * Delete copies from over-represented class.
    * Algorithms
      * NearMiss
      * Random under-sampling
    * Dis-advantage:
      * May loose important information 
  * Points:
    * Consider testing random split and non-random (e.g. stratified) splits.
4. Different ML model:
  * Decision trees 
    * CART
    * Random forest
5. Penalized models:
  * Impose additional cost when predicting minority class to pay more attention.
    * Train model with class weights 
      * What are class weights ??
        * Different weights are given accordingly to the minority and majority classes which penalizes the misclassification during training according to the weights taking imbalance into consideration.
        * More weightage to minority and less to majority class.
        * In scikit learn when `class_weights = balanced`, the model assigns the **class weights inversely proportional to their respective frequencies**.
          `wj=n_samples / (n_classes * n_samplesj)`
        * Apply the weights to the weighted loss/cost function.
        * Results in the weighted loss (more error value to the minority and less error value to the majority class)
        * Correspondingly, the model coefficients/ hyper-parameters are adjusted w.r.t weighted loss.
    * Link : https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
  * Focal loss for multi-class imbalanced data 
    * Link : https://www.dlology.com/blog/multi-class-classification-with-focal-loss-for-imbalanced-datasets/

6. Different problem
  * Anamoly detection
    * One-class classifier 
  * Change detection 


