<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Hyper_parameter_search_and_class_imbalance_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyper-parameter search Research

Hyper-parameter : https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/

Transformer hyper-parameter search: https://huggingface.co/blog/ray-tune

**What is hyper-parameter**?
  * Parameters that are used to control the learning process of a model
  * "Model configuration parameters set by the developer to guide learning process for specific dataset".

**Difference between model parameters and model hyper-parameters**?
  * Model parameters: 
    * Variables whose values are not set but learned during the training of a model for specific data.
      * E.g. 
        * Weights (importance given to each feature of an instance) and biases (adjust the generalization of the model) in NN
        * Support vectors in SVM
        * Coefficients in regression models 
  * Model Hyper-perameter:
    * Configuration variable set before training to improve the training process or reduce the loss function.
    * E.g.
      * Learning rate for NN
      * K in KNN

**Hyper-parameter search/tuning/optimization:**
  * No rule of thumb to set hyper parameters and it is required to search for best hyper-parameters of a model on a dataset.
  * Hyper-parameter for a model is searched in search space where each dimention represents hyper-parameter and point represent one model configuration.
  * Goal of hyper-parameter search is to find an optimal configuration parameters (vector) from search space.
  * Different algorithms
    * Random search: randomly sample points from bounded domain of search space
      * More time to search 
      *`RandomizedSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched
    * Grid search:  Search space as grid of hyper-parameters and evaluate every
 point in the grid.
      * More defined search in the search space
      * `GridSearchCV(model,space)` from sklearn, space is a dictionary of parameters to be searched.
    * Advanced:
      * Bayesian optimization 
      * Population based training


**Transformers Hyper-parameter tuning :**

Library : RayTune (python library for experiment execution and hyperparameter tuning)

Steps:
  1. Define search space
      * BERT Model fine-tune Hyper-parameters(baseline : https://www.aclweb.org/anthology/N19-1423/):
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
      * RoBERTa Model fine-tune hyper-parameters in paper(baseline : https://arxiv.org/abs/1907.11692):
        * Batch_size : [16,32]
        * Learning rate (adam) : 1e-5,2e-5,3e-5
        * Max number of epochs (adam) : 10
        * Weight decay : 0.1
        * Learning rate decay : Linear
        * Warmup ratio : 0.06 
      * GPT-2 Model fine-tune hyper-parameters in paper(baseline : http://www.persagen.com/files/misc/radford2019language.pdf):
        * Auto-regressive model
      * XLNet-large fine-tune Model hyper-parameters in paper(baseline : https://arxiv.org/pdf/1906.08237.pdf):
        * Same as BERT 
        * Batch_size : [16,32]
        * Learning rate (adam) : 5e-5,3e-5,2e-5
        * Number of epochs : 2,3,4
  2. Load Model tokenizer
  3. Load training and evaluation dataset
  4. Define metrics to be evaluated 
    * `Datasets` library from transformers contain metrics which can be used 
    * https://huggingface.co/metrics
  5. Encode training examples
  6. Initialize model 
    * `AutoModelForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)`
  7. Define `trainer` from transformers
    * Trainer classes provide feature complete API
    * Before instantiating trainer, training arguments should be created to access customization during training
    * https://huggingface.co/transformers/main_classes/trainer.html





## BERT

In [1]:
! pip install optuna --quiet
! pip install ray[tune] --quiet
!pip install transformers==4.5.1 --quiet

[K     |████████████████████████████████| 307kB 14.8MB/s 
[K     |████████████████████████████████| 174kB 23.6MB/s 
[K     |████████████████████████████████| 81kB 8.5MB/s 
[K     |████████████████████████████████| 81kB 9.7MB/s 
[K     |████████████████████████████████| 143kB 21.3MB/s 
[K     |████████████████████████████████| 51kB 5.5MB/s 
[K     |████████████████████████████████| 112kB 24.3MB/s 
[?25h  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 49.4MB 78kB/s 
[K     |████████████████████████████████| 10.1MB 49.3MB/s 
[K     |████████████████████████████████| 1.0MB 56.4MB/s 
[K     |████████████████████████████████| 133kB 78.7MB/s 
[K     |████████████████████████████████| 1.3MB 48.5MB/s 
[K     |████████████████████████████████| 71kB 6.9MB/s 
[K     |████████████████████████████████| 81kB 7.2MB/s 
[K     |████████████████████████████████| 3.1MB 43.0MB/s 
[K     |████████████████████████████████| 81kB 8.7MB/s 

In [2]:
pip install "ray[tune]" transformers datasets --quiet

[K     |████████████████████████████████| 245kB 14.4MB/s 
[K     |████████████████████████████████| 122kB 20.3MB/s 
[K     |████████████████████████████████| 245kB 22.2MB/s 
[?25h

In [3]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from datasets import load_metric
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments

# specify GPU
device = torch.device("cuda")

In [4]:
MAX_LEN = 50
RANDOM_SEED = 47
# pl.seed_everything(RANDOM_SEED)

In [5]:
df = pd.read_csv("/content/explicitbias_Categorized.csv", index_col = 0)

In [6]:
df.bias_type.value_counts()

Ethnicity              2559
profession             1637
gender                  656
religion                256
socioeconomic           157
age                      73
sexual-orientation       72
disability               57
physical-appearance      52
Name: bias_type, dtype: int64

In [7]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

labels = df['bias_type']
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(labels)

LabelEncoder()

In [8]:
for cls in range(len(encoder.classes_)):
  print(cls,":" ,encoder.classes_[cls])

0 : Ethnicity
1 : age
2 : disability
3 : gender
4 : physical-appearance
5 : profession
6 : religion
7 : sexual-orientation
8 : socioeconomic


In [9]:
encoded_labels = {}
for cls in range(len(encoder.classes_)):
  encoded_labels[encoder.classes_[cls]] = cls

In [10]:
dfLabelEncod = df

In [11]:
dfLabelEncod['Label_encoded'] = dfLabelEncod['bias_type'].replace(encoded_labels)

In [12]:
dfLabelEncod.drop(columns=['bias_type'],inplace=True)

In [13]:
dfLabelEncod

Unnamed: 0,stereotype,Label_encoded
5397,The prisoner had a reputation for being violent.,5
1574,The game was won by the chess player. He is a ...,5
3056,"Though I've never been there, Jordan promised ...",0
3237,The couple went to Norway. The locals were pro...,0
288,Due to poverty most African American families ...,0
...,...,...
401,The construction worker was from rural Kentucky.,8
1670,The tailor was preparing to open up shop for t...,5
2379,"Watching the news today, we were introduced to...",5
5399,Italy is known for it's great food.,0


Stratified sampling :

* Why?
  * With very small or very imbalanced data sets, it's quite possible that the random split could completely eliminate a class from one of the train/test splits.
  * hence, setting `stratify = dependent_variable` makes sure that train and test splits have the same proportion of sampling. 

In [14]:
from sklearn.model_selection import train_test_split

train_df_text, test_df_text, train_df_labels,test_df_labels = train_test_split(dfLabelEncod['stereotype'],dfLabelEncod['Label_encoded'], test_size=0.3, random_state=RANDOM_SEED, stratify = df['Label_encoded'])
val_df_text, test_df_text, val_df_labels,test_df_labels = train_test_split(test_df_text,test_df_labels, test_size=0.5, random_state=RANDOM_SEED,stratify = test_df_labels)

In [15]:
train_df = pd.concat([train_df_text,train_df_labels],axis=1)
val_df = pd.concat([val_df_text,val_df_labels], axis = 1)
test_df = pd.concat([test_df_text,test_df_labels], axis = 1)

In [20]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [17]:
from torch.utils.data import Dataset, DataLoader

In [93]:
class ExplicitStereotypeDataset(Dataset):

  def __init__(self, data: pd.DataFrame, tokenizer,max_token_len: int = 50):
    self.tokenizer = tokenizer
    self.data = data
    self.max_token_len = max_token_len
  
  def __len__(self):
    return len(self.data)
  
  def __getitem__(self, index: int):
    data_row = self.data.iloc[0]
    text = data_row[0]
    labels = data_row[1]
    # labels = data_row.iloc[2:].to_dict().values() # To handle one-hot encoded categorical values [0-8] 

    encoding = self.tokenizer.encode_plus(
      text,
      add_special_tokens=True,
      max_length=self.max_token_len,
      padding="max_length",
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return dict(
      attention_mask=encoding["attention_mask"].flatten(),
      input_ids=encoding["input_ids"].flatten(),
      token_type_ids=encoding["token_type_ids"].flatten(),
      label= torch.tensor(labels)
    )

In [94]:
train_dataset = ExplicitStereotypeDataset(
  train_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [95]:
val_dataset = ExplicitStereotypeDataset(
  val_df,
  tokenizer,
  max_token_len=MAX_LEN
)

In [96]:
sample = train_dataset[0]

In [97]:
sample.keys()

dict_keys(['attention_mask', 'input_ids', 'token_type_ids', 'label'])

In [98]:
sample['label']

tensor(0)

In [30]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained('bert-base-cased', return_dict = True)

In [100]:
sample_batch = next(iter(DataLoader(train_dataset,batch_size=32)))

In [101]:
bert = model_init()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [105]:
output = bert(sample_batch["input_ids"], sample_batch["attention_mask"],labels = sample_batch['label'])

In [109]:
output.logits

tensor([[-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831],
        [-0.2094, -0.3831]], grad_fn=<AddmmBackward>)

In [110]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax()
    return metric.compute(predictions=predictions, references=labels)

In [112]:
# Evaluate during training and a bit more often than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
# training_args = TrainingArguments("test", evaluate_during_training=True, eval_steps=500, disable_tqdm=True)
trainer = Trainer(
    # args=training_args,
    data_collator=DataCollatorWithPadding(tokenizer),
    train_dataset=train_dataset, 
    eval_dataset=val_dataset, 
    model_init=model_init,
    compute_metrics=compute_metrics,
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [None]:
# Defaut objective is the sum of all metrics when metrics are provided, so we have to maximize it.
trainer.hyperparameter_search(n_trials=3)

## XL-Net

## Roberta

## GPT-2

# Class imbalance handling methods

Link 1 :
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

Link 2 : https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

What?
  * Imbalance is most common problem
  * Class1 - 80 samples
  * Class2 - 20 samples 

Accuracy Paradox:
  * Accuracy metric may reflect the underlying class distribution.
    * Just predict class 1 irrespective of the input due to its class distribution.
    * Accuracy = `(80/100)*100 = 80%` 
    * But the model didnot learn anything.


Strategies:

1. Collect more data
2. Change performance metric:
  * Confusion matrix : Breaking the predictions into
    * Correct predictions:
      * True positive 
      * True Negative
    * Incorrect predictions:
      * False positive
      * False negative 
  * Precision : 
    * **correct positive prediction** out of **total positive predictions** (correct and incorrect).
  * Recall (sensitivity/TPR) : 
    * **Identified correct positive** predictions out of **total positive class in the dataset**.  
  * F1 score : 
    * Weighted average of precision and recall.
  * Kappa score:
    * Classification score normalized by the imbalance of classes in data.
    * Range from -1/0 - 1(perfect) 
  * ROC curve : 
    * TP (sensitivity) plotted against FP (1 – specificity) for each threshold used.
    * Useful for threshold selection 
      * Selecting threshold based on the dataset 
      * e.g.: Cancer screening : 
          * High FP along with TP is fine, as it is important to identify sufferers than having false negative.
    * ROC_AUC score : Gives performance of classifier over entire operating range.
    * Classifier comparison : Compare two models using ROC_AUC score. 
3. Resampling data 
  * Over-sampling:
      * Add copies from under-represented class.
      * Algorithms:
        * SMOTE(Synthetic minority over sampling technique)
          * Compute k-NN from minority class and impute.
        * Random over-sampling
      * Dis-advantage:
        * Impact generalization and may overfit the data.
  * Under-sampling:
    * Delete copies from over-represented class.
    * Algorithms
      * NearMiss
      * Random under-sampling
    * Dis-advantage:
      * May loose important information 
  * Points:
    * Consider testing random split and non-random (e.g. stratified) splits.
4. Different ML model:
  * Decision trees 
    * CART
    * Random forest
5. Penalized models:
  * Impose additional cost when predicting minority class to pay more attention.
    * Train model with class weights 
6. Different problem
  * Anamoly detection
    * One-class classifier 
  * Change detection 


