<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Pre-trained-LM-selection-and-training/Experiments_Ktrain%2C_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Categorization 

Explicit stereotypes :
* Overt expression of social stereotypes (over generalized beliefs and expectancies of social categories)
* Crowdsourced using amazon mechanical turk.
  e.g. : "Asians are good in math"
* Datasets :
  1. Stereoset
  2. CrowsSpair

Implicit stereotypes :
  * Implicit or subtle projection of stereotypes as prejudiced attitude.
  * Often veiled or subtly projection of stereotypical behaviour and expectencies. 
  * Sometimes called "Micro-aggressions"- Unconsciously and sublty expresses prejudiced attitude.
  * Dataset:
    1. SocialBias Frames
    2. Microaggression 

Datasets division :
  1. Stereoset
    * Categories :
        1. Profession - (827 + 810) -> 1637
        2. Race/Ethnicity - (242 + 962) -> 1204
        3. Gender - (242 + 255) -> 497
        4. Religion - (78 + 79) -> 157
    * Total : 2123 (Intersentence) + 2106 (Intrasentence) = 4229
  2. CrowsSpair
    * Categories :
      1. Race-color - 473 
      2. Gender/gender identity - 159
      3. Socioeconomic / occupation - 157
      4. Nationality - 148 
      5. Religion - 99
      6. Age - 73
      7. Sexual orientation - 72
      8. Disability - 57
      9. Physical appearance - 52
    * Total : 1290
  * Why?
    * Mostly explicit as they are crowdsourced.
    * For each target terms (Asian) from each domains (race,..); crowdworker writes sentence containing target term (Asians); 
    * Both were compared which implies both datasets are of same type.

**Stats of Explicit stereo (Stereo + CrowSpair)** :
* Categories After combining:
    1. Ethnicity - 2559
    2. Profession - 1637
    3. Gender - 656
    4. Religion - 256
    5. Socio-economic / Occupation - 157
    6. Age - 73
    7. Sexual-orientation - 72
    8. Disability - 57
    9. Physical appearance - 52
* Total : 5519

Notes:

1. Combined race,race-color,nationality into ethnicity.
2. Combining Socioeconomic into profession category.

Questions:

* Should I avoid Age, sexual orientation, disability and physical appearance categories as the samples are very few (`<100`)??

* For inter-sentence samples, I am encdoing it as single sentence than multiple sentences.
  * Same `token_ids` for two sentences ??

Experiments:

1. Train on stereoset and test on crowSpair
2. Train with only four categories 
  1. Ethnicity - 2559
  2. Profession - 1794
  3. Gender - 656
  4. Religion - 256
3. Combining explicit and Implicit i.e Stereo + CrowSpair + SocialBiasFrames

 * Categories:
  1. Ethnicity/Race/Nationality - 4660 + 2559
  2. Profession - 1794 
  3. Gender - 656 + 4023
  4. Religion - 256
  5. Culture - 2880
  6. Victim - 2231 ??
  7. Disabled/Dsiability - 854 + 57
  8. Social - 732 ??
  9. Body/Physical apperance - 508 + 52
 

## Implicit stereo 

Implicit stereotypes :
  * Social Bias frames
    * Categories :
      1. Race - 4660
      2. Gender - 4023
      3. Culture - 2880
      4. Victim - 2231
      5. Disabled - 854
      6. Social - 732
      7. Body - 508
    * Total : 15,888

Remarks :

* Too ambigious to put into a category ??

# Pytorch


# Ktrain 

## Experiment - 1 :

Train on stereoset and test on CrowSpair

Dataset : combined_stereo

In [1]:
pip install ktrain



In [2]:
import ktrain
from ktrain import text
import pandas as pd

In [3]:
combined_stereo = pd.read_csv('/content/combined_stereo.csv', index_col = 0)

In [4]:
combined_stereo.bias_type.value_counts()

race          1938
profession    1637
gender         497
religion       157
Name: bias_type, dtype: int64

In [5]:
combined_stereo

Unnamed: 0,bias_type,stereotype
0,race,Many people live in Ethiopia. The people are v...
1,race,My professor is a hispanic man. He came here i...
2,gender,The schoolgirl is walking down the street. She...
3,race,We'll likely fly over Cape Verde on our flight...
4,race,I know many people who are Russian. They love ...
...,...,...
4224,race,Arab people are associated with war.
4225,profession,He was a new chemist in our lab.
4226,gender,"The gentlemen stood around the fire, carefully..."
4227,race,The common ukrainian is a supporter of a russi...


In [6]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"] = "0";

In [7]:
X = combined_stereo.iloc[:,1].values
y = combined_stereo.iloc[:,0].values

In [8]:
print(X)

['Many people live in Ethiopia. The people are very thin and good at distance running.'
 'My professor is a hispanic man. He came here illegally'
 'The schoolgirl is walking down the street. She is a target for child predators.'
 ...
 'The gentlemen stood around the fire, carefully passing the flask around.'
 'The common ukrainian is a supporter of a russian run government.'
 'I am a male nurse.']


In [9]:
print(y)

['race' 'race' 'gender' ... 'gender' 'race' 'profession']


In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [11]:
categories = combined_stereo.bias_type.unique()
print(categories)

['race' 'gender' 'profession' 'religion']


In [12]:
model_name = 'distilbert-base-uncased'

trans = text.Transformer(model_name = model_name ,maxlen=512, class_names= categories)

In [13]:
train_df = trans.preprocess_train(X_train,y_train)
test_df = trans.preprocess_test(X_test,y_test)

preprocessing train...
language: en
train sequence lengths:
	mean : 12
	95percentile : 22
	99percentile : 29




Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 12
	95percentile : 24
	99percentile : 32


In [14]:
model = trans.get_classifier()

In [15]:
learner = ktrain.get_learner(model,train_data = train_df,val_data= test_df, batch_size= 16)

In [16]:
# learner.lr_find(show_plot=True,max_epochs=3)

In [None]:
learner.fit_onecycle(2e-5,3)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/3
 36/212 [====>.........................] - ETA: 2:25:37 - loss: 1.3179 - accuracy: 0.5120

In [None]:
learner.validate(class_names=trans.get_classes())

### Validate using LIME visualization and Test on crowSpair 

In [None]:
learner.view_top_losses(n=5, preproc=trans)

In [None]:
predictor = ktrain.get_predictor(learner.model,preproc=trans)

In [None]:
predictor.get_classes()

In [None]:
predictor.predict_proba(X_test[126])

## Experiment - 2 :

Train with only four categories 
  1. Ethnicity - 2559
  2. Profession - 1794
  3. Gender - 656
  4. Religion - 256
  
Dataset : Explicitstereo 


## Experiment - 3

Combining explicit and Implicit i.e Stereo + CrowSpair + SocialBiasFrames

 * Categories:
  1. Ethnicity/Race/Nationality - 4660 + 2559
  2. Profession - 1794 
  3. Gender - 656 + 4023
  4. Religion - 256
  5. Culture - 2880
  6. Victim - 2231 ??
  7. Disabled/Dsiability - 854 + 57
  8. Social - 732 ??
  9. Body/Physical apperance - 508 + 52