<a href="https://colab.research.google.com/github/mvdheram/Stereotypical-Social-bias-detection-/blob/Machine-learning-classifiers/Feature_based_machine_learning_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Features based machine learning models 

* Reference : 
    1. Linguistic models for detecting bias https://aclanthology.org/P13-1162.pdf
    2. Automatically Neutralizing Subjective Bias in Text https://ojs.aaai.org/index.php/AAAI/article/view/5385 

Features :

* Bias lexicons with count
* Sentiment 
* Generic words NNS and NNPS
* Toxicity 
* Generic features 
* A, AE names

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/Trained models/mult_label_dataset/ohe_multilabel.csv', index_col = 0)

In [None]:
df

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence
0,1,0,0,0,0,1,0,Many people live in Ethiopia. The people are v...
1,1,0,0,0,0,1,0,My professor is a hispanic man He came here il...
2,0,1,0,0,0,1,0,The schoolgirl is walking down the street. She...
3,1,0,0,0,0,1,0,We'll likely fly over Cape Verde on our flight...
4,1,0,0,0,0,1,0,I know many people who are Russian. They love ...
...,...,...,...,...,...,...,...,...
16539,0,0,0,0,0,0,1,Arab people are associated with hi.
16540,0,0,0,0,0,0,1,Car was a new chemist in our lab.
16541,0,0,0,0,0,0,1,"The gentlemen stood around the fire, carefully..."
16542,0,0,0,0,0,0,1,The common ukrainian is a supporter of a floun...


## Pre-processing

In [14]:
# Tokenization using spacy
import spacy

nlp = spacy.load("en_core_web_sm")

def tokenize(text):  

  doc = nlp(text)
  tokens = [token.text.lower() for token in doc]
  return tokens

def lemmatization(text):

  doc = nlp(text)
  lemmas = [token.lemma_.lower() for token in doc]
  return lemmas

# Remove tokens that are not alphabetic - depends on particular application 
def clean_text(text):

  lemmas = lemmatization(text)
  a_lemmas = [lemma for  lemma in lemmas
              if lemma.isalpha()]
  
  return (' '.join(a_lemmas))


# Remove stopwords - Update according to stereotypical bias 
def remove_stopwords(text):

  stopwords = spacy.lang.en.stop_words.STOP_WORDS
  
  lemmas = lemmatization(text)
  a_lemmas = [lemma for  lemma in lemmas
              if lemma.isalpha() and lemma not in stopwords]
  
  return (' '.join(a_lemmas))


# Parts of speech tagger 
def pos_tags(text):
  
  doc = nlp(text)
  pos = [(token.text, token.tag_) for token in doc]
  return (pos)


# Named entity recognition 
def ner_tags(text):
  
  doc = nlp(text)
  ner = [(ent.text, ent.label_) for ent in doc.ents]
  return (ner)

In [None]:
stereo = df.copy()
stereo['clean_text'] = stereo['sentence'].apply(clean_text)

In [None]:
stereo['pos_tags'] = stereo['clean_text'].apply(pos_tags) 

In [None]:
stereo['lemmatized_withStopwords']= stereo['clean_text'].apply(remove_stopwords)

In [None]:
stereo['tokenized_lemmatized_withstopwords'] = stereo['lemmatized_withStopwords'].apply(tokenize)

In [None]:
stereo['lemma_pos'] = stereo['lemmatized_withStopwords'].apply(pos_tags)

In [None]:
stereo['Ner_tags'] = stereo['clean_text'].apply(ner_tags)

In [None]:
stereo

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence,clean_text,lemmatized_withStopwords,tokenized_lemmatized_withstopwords
0,1,0,0,0,0,1,0,Many people live in Ethiopia. The people are v...,many people live in ethiopia the people be ver...,people live ethiopia people thin good distance...,"[people, live, ethiopia, people, thin, good, d..."
1,1,0,0,0,0,1,0,My professor is a hispanic man He came here il...,professor be a hispanic man come here illegally,professor hispanic man come illegally,"[professor, hispanic, man, come, illegally]"
2,0,1,0,0,0,1,0,The schoolgirl is walking down the street. She...,the schoolgirl be walk down the street be a ta...,schoolgirl walk street target child predator,"[schoolgirl, walk, street, target, child, pred..."
3,1,0,0,0,0,1,0,We'll likely fly over Cape Verde on our flight...,will likely fly over cape verde on flight tomo...,likely fly cape verde flight tomorrow bet peop...,"[likely, fly, cape, verde, flight, tomorrow, b..."
4,1,0,0,0,0,1,0,I know many people who are Russian. They love ...,know many people who be russian love vodka,know people russian love vodka,"[know, people, russian, love, vodka]"
...,...,...,...,...,...,...,...,...,...,...,...
16539,0,0,0,0,0,0,1,Arab people are associated with hi.,arab people be associate with hi,arab people associate hi,"[arab, people, associate, hi]"
16540,0,0,0,0,0,0,1,Car was a new chemist in our lab.,car be a new chemist in lab,car new chemist lab,"[car, new, chemist, lab]"
16541,0,0,0,0,0,0,1,"The gentlemen stood around the fire, carefully...",the gentleman stand around the fire carefully ...,gentleman stand fire carefully pass boxing,"[gentleman, stand, fire, carefully, pass, boxing]"
16542,0,0,0,0,0,0,1,The common ukrainian is a supporter of a floun...,the common ukrainian be a supporter of a floun...,common ukrainian supporter flounder run govern...,"[common, ukrainian, supporter, flounder, run, ..."


In [None]:
stereo.to_csv('stereo_features.csv')

## Feature engineering

Scoring features :


* Readability tests :
  https://pypi.org/project/textatistic/
  * Determine readability of english passage
  * Scale ranging from primary school up to college graduate level
  * A mathematical formula utilizing word, syllabel and sentence count
  * Used in fake news and opinion spam detection 

  Types :

  1. Flesch reading ease : 

    * The higher the score, the better the readability. 
    * score of 0-30 implies only college graduates can understand while 90-100 implies that a 5th grade student can understand.
    
    Two factors :

      1. Greater the average sentence length, harder the text to read
      2. Greater the average number of syllables, harder the text to read

* Avg_tf_idf
* Max_tf_idf
* Number of characters 
* Word count
* Average word length
* Vadar Sentiment analysis
* Text subjectivity (Text blob)
* Toxicity analysis (detoxify)



In [None]:
import pandas as pd

stereo = pd.read_csv('/content/drive/MyDrive/Trained models/mult_label_dataset/stereo_features_f.csv',index_col=0)

In [None]:
stereo

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence,clean_text,pos,lemmatized_withStopwords,lemma_pos,Ner_tags
0,1,0,0,0,0,1,0,Many people live in Ethiopia. The people are v...,many people live in ethiopia the people be ver...,"[('many', 'JJ'), ('people', 'NNS'), ('live', '...",people live ethiopia people thin good distance...,"[('people', 'NNS'), ('live', 'VBP'), ('ethiopi...","[('ethiopia', 'GPE')]"
1,1,0,0,0,0,1,0,My professor is a hispanic man He came here il...,professor be a hispanic man come here illegally,"[('professor', 'NNP'), ('be', 'VB'), ('a', 'DT...",professor hispanic man come illegally,"[('professor', 'NNP'), ('hispanic', 'NNP'), ('...","[('hispanic', 'NORP')]"
2,0,1,0,0,0,1,0,The schoolgirl is walking down the street. She...,the schoolgirl be walk down the street be a ta...,"[('the', 'DT'), ('schoolgirl', 'NN'), ('be', '...",schoolgirl walk street target child predator,"[('schoolgirl', 'NNP'), ('walk', 'NNP'), ('str...",[]
3,1,0,0,0,0,1,0,We'll likely fly over Cape Verde on our flight...,will likely fly over cape verde on flight tomo...,"[('will', 'MD'), ('likely', 'RB'), ('fly', 'VB...",likely fly cape verde flight tomorrow bet peop...,"[('likely', 'RB'), ('fly', 'VB'), ('cape', 'NN...","[('tomorrow', 'DATE')]"
4,1,0,0,0,0,1,0,I know many people who are Russian. They love ...,know many people who be russian love vodka,"[('know', 'VBP'), ('many', 'JJ'), ('people', '...",know people russian love vodka,"[('know', 'VBP'), ('people', 'NNS'), ('russian...","[('russian', 'NORP')]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16539,0,0,0,0,0,0,1,Arab people are associated with hi.,arab people be associate with hi,"[('arab', 'JJ'), ('people', 'NNS'), ('be', 'VB...",arab people associate hi,"[('arab', 'JJ'), ('people', 'NNS'), ('associat...","[('arab', 'NORP')]"
16540,0,0,0,0,0,0,1,Car was a new chemist in our lab.,car be a new chemist in lab,"[('car', 'NN'), ('be', 'VB'), ('a', 'DT'), ('n...",car new chemist lab,"[('car', 'NN'), ('new', 'JJ'), ('chemist', 'NN...",[]
16541,0,0,0,0,0,0,1,"The gentlemen stood around the fire, carefully...",the gentleman stand around the fire carefully ...,"[('the', 'DT'), ('gentleman', 'NNP'), ('stand'...",gentleman stand fire carefully pass boxing,"[('gentleman', 'NNP'), ('stand', 'VB'), ('fire...",[]
16542,0,0,0,0,0,0,1,The common ukrainian is a supporter of a floun...,the common ukrainian be a supporter of a floun...,"[('the', 'DT'), ('common', 'JJ'), ('ukrainian'...",common ukrainian supporter flounder run govern...,"[('common', 'JJ'), ('ukrainian', 'JJ'), ('supp...","[('ukrainian', 'NORP')]"


In [None]:
scoring_features = stereo.copy()

In [None]:
scoring_features.drop(['pos','lemma_pos',	'Ner_tags'],axis=1, inplace= True)

In [None]:
 # Number of characters
 scoring_features['num_chars']  = scoring_features['sentence'].apply(len)

In [None]:
# Number of words
def word_count(string):
  # split the string into words
  words = string.split()

  # Return length of words list
  return len(words)

scoring_features['num_words'] = scoring_features['sentence'].apply(word_count)

In [None]:
# Average word length
def avg_word_length(x):

  # Split the string into words
  words = x.split()

  # Compute length of each word and store in a seperate list
  word_lengths = [len(word) for word in words]

  # Compute average word length 
  try:
    avg_word_length = sum(word_lengths)/len(words)
  except ZeroDivisionError:
    avg_word_length = 0

  return (avg_word_length)

scoring_features['avg_word_length'] = scoring_features['sentence'].apply(avg_word_length) 

In [None]:
scoring_features.columns

Index(['Ethnicity', 'gender', 'profession', 'religion', 'Anti-stereotype',
       'stereotype', 'unrelated', 'sentence', 'clean_text',
       'lemmatized_withStopwords', 'num_chars', 'num_words',
       'avg_word_length'],
      dtype='object')

In [None]:
pip install textstat



In [None]:
# Readability tests using textatistic library 
# Import the textatistic class
import textstat
import math

def readability_scores(text):
  # if text.endswith(".") == False:
  #   text = text+"."
  readability_score = textstat.flesch_reading_ease(text)

  # Generate scores
  return readability_score

In [None]:
try:
  scoring_features['flesch_score'] = scoring_features['sentence'].apply(readability_scores)
except ZeroDivisionError:
  scoring_features['flesch_score'] = 0

In [None]:
pip install -U textblob



In [None]:
from textblob import TextBlob

def get_subjectivity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

In [None]:
scoring_features['subjectivity_score'] = scoring_features['sentence'].apply(get_subjectivity)

Vectorization :

* n_grams
* tf_idf 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Building n-gram models - capture context
# Range = (2,2) - bi-grams, (1,3) - unigram, bigram, trigram
def n_grams(range, corpus):
  # Bag of words feature - docxterm matrix 
  vectorizer = CountVectorizer(ngram_range = range)
  corpus = corpus.values.astype('U')
  bow_matrix = vectorizer.fit_transform(corpus)
  cv_df = pd.DataFrame(bow_matrix.toarray(), columns = vectorizer.get_feature_names()).add_prefix('Counts_')
  # corpus = pd.concat([corpus,cv_df],axis = 1, sort = False)
  return cv_df

In [497]:
# tf-idf  - higher the weight more the importance 
# Used for train set
from sklearn.feature_extraction.text import TfidfVectorizer

def tf_idf(corpus):
  vectorizer = TfidfVectorizer()
  vectorizer = TfidfVectorizer(max_features = 10000)
  corpus = corpus.values.astype('U')
  tfidf_matrix = vectorizer.fit_transform(corpus)
  tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns = vectorizer.get_feature_names()).add_prefix('tfIdf_')
  # corpus = pd.concat([corpus,tfidf_df],axis = 1, sort = False)
  return tfidf_df

In [498]:
tf_idf_feature = tf_idf(scoring_features['clean_text'])

In [513]:
# Inspect the different words being values after BOW and tfidf transformation 
def examine_row(corpus,row_n):
  examine_row = corpus.iloc[row_n]
  print(examine_row.sort_values(ascending= False).head())
  total = corpus.sum()
  print("Total sum of the counts per word \n",total.head()) # Total sum of the counts per word
  # print("Sums sorted: ",total.sort_values(ascending= False).head())

In [35]:
clean_text = scoring_features['clean_text']
tfidf = TfidfVectorizer()
# corpus = clean_text.values.astype('U')
# Whole dataset has to  be given for tfidf model
tfidf_model = tfidf.fit(corpus)

In [36]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [37]:
import numpy as np
import nltk

def tfidf_max_avg_features(sent):
    avg_tfidf_feature = 0
    max_tfidf_feature = 0
    tokenized_words = nltk.word_tokenize(sent)
    tfidf_vector = tfidf_model.transform([sent])
    if avg == True:
      avg_tfidf_feature = np.sum(tfidf_vector.toarray())/len(tokenized_words)
      return avg_tfidf_feature
    else:
      max_tfidf_feature = np.max(tfidf_vector.toarray())
      return max_tfidf_feature

In [39]:
avg = True
scoring_features['avg_tfidf_feature'] = scoring_features['sentence'].apply(tfidf_max_avg_features)

In [41]:
avg = False
scoring_features['max_tfidf_feature'] = scoring_features['sentence'].apply(tfidf_max_avg_features)

Vadar sentiment analysis

In [None]:
pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l[K     |██▋                             | 10 kB 19.8 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 25.1 MB/s eta 0:00:01[K     |███████▉                        | 30 kB 23.3 MB/s eta 0:00:01[K     |██████████▍                     | 40 kB 20.8 MB/s eta 0:00:01[K     |█████████████                   | 51 kB 6.4 MB/s eta 0:00:01[K     |███████████████▋                | 61 kB 7.3 MB/s eta 0:00:01[K     |██████████████████▏             | 71 kB 7.8 MB/s eta 0:00:01[K     |████████████████████▉           | 81 kB 8.7 MB/s eta 0:00:01[K     |███████████████████████▍        | 92 kB 8.5 MB/s eta 0:00:01[K     |██████████████████████████      | 102 kB 6.7 MB/s eta 0:00:01[K     |████████████████████████████▋   | 112 kB 6.7 MB/s eta 0:00:01[K     |███████████████████████████████▏| 122 kB 6.7 MB/s eta 0:00:01[K     |████████████████████████████████| 125 kB 6.7 M

In [None]:
# Sentiment analysis 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

def vader_sentiment(text):
  score = analyser.polarity_scores(text)
  return score

senti = scoring_features['sentence'].apply(vader_sentiment) 
scoring_features = pd.concat([scoring_features,(pd.DataFrame.from_dict(dict(senti).values()))],axis = 1, sort = False)
# scoring_features.head()

In [None]:
pip install detoxify

Collecting detoxify
  Downloading detoxify-0.2.2-py3-none-any.whl (11 kB)
Collecting transformers>=3.2.0
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 7.4 MB/s 
[?25hCollecting sentencepiece>=0.1.94
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 52.9 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 71.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K

In [None]:
# Toxicity identification 
from detoxify import Detoxify

def toxicity(text):
  results = Detoxify('original').predict(text)
  return math.floor(results['toxicity']*100)

scoring_features['toxicity'] = scoring_features['sentence'].apply(toxicity) 

In [42]:
scoring_features.to_csv('/content/drive/MyDrive/temp_df.csv')

Count based features :

* Lexicons - Counts
  * Hedge in context  - two words around W is a hedge (Hyland, 2005  (e.g., apparently).
  * Factive verb  - w is in Hooper’s (1975) list of factives (e.g., realize).
  * Factive verb in context One/two word(s) around w is a factive (Hooper, 1975)
  * Assertive verb
  * Assertive verb in context 
  * Assertive verb 
  * Implicative verb in context
  * Report verb
  * Entailment (Not found)
  * Entailment in context (Not found)
  * Strong subjective (Used textblob subjectivity score)
  * Weak subjective ((Used textblob subjectivity score)
  * Positive word (Vadar sentiment score)
  * Positive word in context (Vadar sentiment score)
  * Negative word (Vadar sentiment score)
  * Negative word in context (Vadar sentiment score)
  * Grammatical relation - {root,subj,...}
  * Bias lexicon
* Social category target words used in dataset( Characteristic words of each bias type ; e.g. Racial, gender, ..) and scoring_features_pos_Ner
* Characteristic stereotypical words 
* POS :
  * POS(word) : POS of word w 
  * POS(word) - 1 :  POS of one word before w
  * POS(word) + 1  : POS of one word after w


In [4]:
import pandas as pd

scoring_features = pd.read_csv('/content/drive/MyDrive/temp_df.csv', index_col = 0)

In [5]:
scoring_features.head()

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence,clean_text,lemmatized_withStopwords,tokenized_lemmatized_withstopwords,num_chars,num_words,avg_word_length,flesch_score,subjectivity_score,neg,neu,pos,compound,avg_tfidf,Max_tfidf,avg_tfidf_feature,max_tfidf_feature
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Many people live in Ethiopia. The people are v...,many people live in ethiopia the people be ver...,people live ethiopia people thin good distance...,"['people', 'live', 'ethiopia', 'people', 'thin...",84,15,4.666667,89.24,0.0,0.0,0.816,0.184,0.4877,0.1944,0.1944,0.1944,0.442653
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,My professor is a hispanic man He came here il...,professor be a hispanic man come here illegally,professor hispanic man come illegally,"['professor', 'hispanic', 'man', 'come', 'ille...",53,10,4.4,52.87,0.0,0.0,1.0,0.0,0.0,0.253893,0.253893,0.253893,0.494477
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,The schoolgirl is walking down the street. She...,the schoolgirl be walk down the street be a ta...,schoolgirl walk street target child predator,"['schoolgirl', 'walk', 'street', 'target', 'ch...",79,14,4.714286,89.75,0.0,0.0,1.0,0.0,0.0,0.159743,0.159743,0.159743,0.510763
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,We'll likely fly over Cape Verde on our flight...,will likely fly over cape verde on flight tomo...,likely fly cape verde flight tomorrow bet peop...,"['likely', 'fly', 'cape', 'verde', 'flight', '...",106,20,4.35,86.71,0.0,0.105,0.759,0.135,0.1779,0.16585,0.16585,0.16585,0.344728
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,I know many people who are Russian. They love ...,know many people who be russian love vodka,know people russian love vodka,"['know', 'people', 'russian', 'love', 'vodka']",58,11,4.363636,91.27,0.0,0.0,0.69,0.31,0.6696,0.215018,0.215018,0.215018,0.508637


Lexicons

In [6]:
import json 

f = open('/content/Subjectivity_lexicon.json')

lexicons = json.load(f)

In [7]:
for keys, value in lexicons.items():
  print(keys,'->',len(value.split('\n')))

assertive_verbs.txt -> 66
bias_lexicon.txt -> 655
bias_word_list_01_2018.txt -> 9742
factive_verbs.txt -> 27
hedges_hyland2005.txt -> 100
implicative_verbs.txt -> 32
report_verbs.txt -> 181
subjectivityClues_lexicon.txt -> 8223


In [8]:
keys = lexicons.keys()

In [9]:
def count_lexicon(text):
  count = 0
  try:
    for token in lexicon:
      if token in text:
        count +=1
      else:
        continue
  except :
    pass
  return count

Assertive verbs

In [15]:
lexicon = set(tokenize(lexicons['assertive_verbs.txt']))

In [16]:
scoring_features['assertive_verbs_count'] = scoring_features['lemmatized_withStopwords'].apply(count_lexicon)

In [17]:
len(scoring_features[scoring_features['assertive_verbs_count'] != 0])

1572

Factive verbs

In [18]:
lexicon = set(tokenize(lexicons['factive_verbs.txt']))

In [19]:
scoring_features['factive_verbs_count'] = scoring_features['lemmatized_withStopwords'].apply(count_lexicon)

In [20]:
len(scoring_features[scoring_features['factive_verbs_count'] != 0])

1725

Hedges

In [21]:
lexicon = set(tokenize(lexicons['hedges_hyland2005.txt']))

In [22]:
scoring_features['hedges_count'] = scoring_features['lemmatized_withStopwords'].apply(count_lexicon)

In [23]:
len(scoring_features[scoring_features['hedges_count'] != 0])

9458

Implicative_verbs

In [24]:
lexicon = set(tokenize(lexicons['implicative_verbs.txt']))

In [25]:
scoring_features['implicative_verbs_count'] = scoring_features['lemmatized_withStopwords'].apply(count_lexicon)

In [26]:
len(scoring_features[scoring_features['implicative_verbs_count'] != 0])

1719

Report_verbs

In [27]:
lexicon = set(tokenize(lexicons['report_verbs.txt']))

In [28]:
scoring_features['report_verbs_count'] = scoring_features['lemmatized_withStopwords'].apply(count_lexicon)

In [29]:
len(scoring_features[scoring_features['report_verbs_count'] != 0])

3922

Bias_word_list_01_2018

In [30]:
lexicon = set(tokenize(lexicons['bias_word_list_01_2018.txt']))

In [31]:
scoring_features['bias_word_list_01_2018_count'] = scoring_features['lemmatized_withStopwords'].apply(count_lexicon)

In [32]:
len(scoring_features[scoring_features['bias_word_list_01_2018_count'] != 0])

15198

SubjectivityClues_lexicon

In [33]:
lexicon = set(tokenize(lexicons['subjectivityClues_lexicon.txt']))

In [34]:
scoring_features['subjectivityClues_lexicon_count'] = scoring_features['lemmatized_withStopwords'].apply(count_lexicon)

In [35]:
len(scoring_features[scoring_features['subjectivityClues_lexicon_count'] != 0])

15831

In [36]:
scoring_features.to_csv("scoring_features.csv")

POS tags Features

In [194]:
features_pos_ner = pd.read_csv('/content/drive/MyDrive/Trained_models/Results_compilation /Unrelated_adjusted/Baselines/Feature based/Feature_df/stereo_features_pos_ner.csv',index_col = 0)

In [195]:
features_pos_ner = features_pos_ner.iloc[:,9:]

In [196]:
features_pos_ner

Unnamed: 0,pos_tags,lemmatized_withStopwords,tokenized_lemmatized_withstopwords,lemma_pos,Ner_tags
0,"[('many', 'JJ'), ('people', 'NNS'), ('live', '...",people live ethiopia people thin good distance...,"['people', 'live', 'ethiopia', 'people', 'thin...","[('people', 'NNS'), ('live', 'VBP'), ('ethiopi...","[('ethiopia', 'GPE')]"
1,"[('professor', 'NNP'), ('be', 'VB'), ('a', 'DT...",professor hispanic man come illegally,"['professor', 'hispanic', 'man', 'come', 'ille...","[('professor', 'NNP'), ('hispanic', 'NNP'), ('...","[('hispanic', 'NORP')]"
2,"[('the', 'DT'), ('schoolgirl', 'NN'), ('be', '...",schoolgirl walk street target child predator,"['schoolgirl', 'walk', 'street', 'target', 'ch...","[('schoolgirl', 'NNP'), ('walk', 'NNP'), ('str...",[]
3,"[('will', 'MD'), ('likely', 'RB'), ('fly', 'VB...",likely fly cape verde flight tomorrow bet peop...,"['likely', 'fly', 'cape', 'verde', 'flight', '...","[('likely', 'RB'), ('fly', 'VB'), ('cape', 'NN...","[('tomorrow', 'DATE')]"
4,"[('know', 'VBP'), ('many', 'JJ'), ('people', '...",know people russian love vodka,"['know', 'people', 'russian', 'love', 'vodka']","[('know', 'VBP'), ('people', 'NNS'), ('russian...","[('russian', 'NORP')]"
...,...,...,...,...,...
16555,"[('cookie', 'NN'), ('be', 'VB'), ('good', 'JJ'...",cookie good substitute liquid milk,"['cookie', 'good', 'substitute', 'liquid', 'mi...","[('cookie', 'NNP'), ('good', 'JJ'), ('substitu...",[]
16556,"[('jollof', 'NNP'), ('rice', 'NNP'), ('cereal'...",jollof rice cereal totally jambalaya,"['jollof', 'rice', 'cereal', 'totally', 'jamba...","[('jollof', 'NNP'), ('rice', 'NNP'), ('cereal'...","[('jollof rice', 'PERSON')]"
16557,"[('bike', 'NN'), ('out', 'RP'), ('be', 'VB'), ...",bike economical energy efficient mode unk tran...,"['bike', 'economical', 'energy', 'efficient', ...","[('bike', 'NNP'), ('economical', 'JJ'), ('ener...",[]
16558,"[('may', 'MD'), ('see', 'VB'), ('the', 'DT'), ...",little dog need food park grocery store find f...,"['little', 'dog', 'need', 'food', 'park', 'gro...","[('little', 'JJ'), ('dog', 'NN'), ('need', 'VB...",[]


In [197]:
scoring_features = pd.concat([scoring_features,features_pos_ner],axis =1)

In [204]:
scoring_features.head()

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence,clean_text,num_chars,num_words,avg_word_length,flesch_score,subjectivity_score,neg,neu,pos,compound,avg_tfidf,Max_tfidf,avg_tfidf_feature,max_tfidf_feature,assertive_verbs_count,factive_verbs_count,hedges_count,implicative_verbs_count,report_verbs_count,bias_word_list_01_2018_count,subjectivityClues_lexicon_count,NNS_count,NNPS_count,DT_count,JJ_count,JJS_count,NN_count,NORP_count,PERSON_count,adverb_count,GPE_count,lemmatized_withStopwords,tokenized_lemmatized_withstopwords
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Many people live in Ethiopia. The people are v...,many people live in ethiopia the people be ver...,84,15,4.666667,89.24,0.0,0.0,0.816,0.184,0.4877,0.1944,0.1944,0.1944,0.442653,0,0,1,0,0,4,5,1,0,1,1,0,1,0,0,1,1,people live ethiopia people thin good distance...,"['people', 'live', 'ethiopia', 'people', 'thin..."
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,My professor is a hispanic man He came here il...,professor be a hispanic man come here illegally,53,10,4.4,52.87,0.0,0.0,1.0,0.0,0.0,0.253893,0.253893,0.253893,0.494477,0,0,0,0,0,2,12,0,0,1,1,0,1,1,0,1,0,professor hispanic man come illegally,"['professor', 'hispanic', 'man', 'come', 'ille..."
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,The schoolgirl is walking down the street. She...,the schoolgirl be walk down the street be a ta...,79,14,4.714286,89.75,0.0,0.0,1.0,0.0,0.0,0.159743,0.159743,0.159743,0.510763,0,0,1,1,0,2,3,0,0,1,0,0,1,0,0,0,0,schoolgirl walk street target child predator,"['schoolgirl', 'walk', 'street', 'target', 'ch..."
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,We'll likely fly over Cape Verde on our flight...,will likely fly over cape verde on flight tomo...,106,20,4.35,86.71,0.0,0.105,0.759,0.135,0.1779,0.16585,0.16585,0.16585,0.344728,0,0,2,0,0,4,9,1,0,1,1,0,1,0,0,1,0,likely fly cape verde flight tomorrow bet peop...,"['likely', 'fly', 'cape', 'verde', 'flight', '..."
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,I know many people who are Russian. They love ...,know many people who be russian love vodka,58,11,4.363636,91.27,0.0,0.0,0.69,0.31,0.6696,0.215018,0.215018,0.215018,0.508637,0,1,0,0,0,2,3,1,0,0,1,0,1,1,0,0,0,know people russian love vodka,"['know', 'people', 'russian', 'love', 'vodka']"


In [199]:
scoring_features.columns

Index(['Ethnicity', 'gender', 'profession', 'religion', 'Anti-stereotype',
       'stereotype', 'unrelated', 'sentence', 'clean_text', 'num_chars',
       'num_words', 'avg_word_length', 'flesch_score', 'subjectivity_score',
       'neg', 'neu', 'pos', 'compound', 'avg_tfidf ', 'Max_tfidf ',
       'avg_tfidf_feature', 'max_tfidf_feature', 'assertive_verbs_count',
       'factive_verbs_count', 'hedges_count', 'implicative_verbs_count',
       'report_verbs_count', 'bias_word_list_01_2018_count',
       'subjectivityClues_lexicon_count', 'pos_tags', 'lemma_pos', 'Ner_tags',
       'NNS_count', 'NNPS_count', 'DT_count', 'JJ_count', 'JJS_count',
       'NN_count', 'NORP_count', 'PERSON_count', 'adverb_count', 'GPE_count',
       'pos_tags', 'lemmatized_withStopwords',
       'tokenized_lemmatized_withstopwords', 'lemma_pos', 'Ner_tags'],
      dtype='object')

In [43]:
import ast

for word, tag in ast.literal_eval(scoring_features.pos_tags[0]):
  print(word , "->", tag)

many -> JJ
people -> NNS
live -> VBP
in -> IN
ethiopia -> NNP
the -> DT
people -> NNS
be -> VB
very -> RB
thin -> JJ
and -> CC
good -> JJ
at -> IN
distance -> NN
run -> NN


In [148]:
import ast

def pos_count(text):
  pos_type = []
  # Combining lists of lists into single list 
  pos_list = ast.literal_eval(text)
  for word,tag in pos_list :
    if tag == part_of_speech:
      # pos_type.append(word)
      return 1
  return 0

In [98]:
def check_col(col_name):
  length = len(scoring_features[scoring_features[col_name] != 0])
  return length 

In [99]:
def drop_col(df,col_name):
  df.drop([col_name],axis=1, inplace=True)
  print(df.columns)

In [None]:
part_of_speech = 'NNS' # Plural nouns
scoring_features['NNS_count'] = scoring_features['pos_tags'].apply(pos_count)

In [101]:
check_col('NNS_count')

2025

In [None]:
part_of_speech = 'NNPS' # Proper Plural nouns
scoring_features['NNPS_count'] = scoring_features['pos_tags'].apply(pos_count)

In [103]:
check_col('NNPS_count')

847

In [None]:
part_of_speech = 'DT' # Determiners ( The with adjectives to refer a whole group of people)
scoring_features['DT_count'] = scoring_features['pos_tags'].apply(pos_count)

In [105]:
check_col('DT_count')

12602

In [None]:
part_of_speech = 'JJ' # Adjective
scoring_features['JJ_count'] = scoring_features['pos_tags'].apply(pos_count)

In [107]:
check_col('JJ_count')

12399

In [108]:
part_of_speech = 'sb' # Subject ( Subject refering to the group)
scoring_features['sb_count'] = scoring_features['pos_tags'].apply(pos_count)

In [109]:
check_col('sb_count')

0

In [110]:
drop_col(scoring_features,'sb_count')

Index(['Ethnicity', 'gender', 'profession', 'religion', 'Anti-stereotype',
       'stereotype', 'unrelated', 'sentence', 'clean_text',
       'lemmatized_withStopwords', 'tokenized_lemmatized_withstopwords',
       'num_chars', 'num_words', 'avg_word_length', 'flesch_score',
       'subjectivity_score', 'neg', 'neu', 'pos', 'compound', 'avg_tfidf ',
       'Max_tfidf ', 'avg_tfidf_feature', 'max_tfidf_feature',
       'assertive_verbs_count', 'factive_verbs_count', 'hedges_count',
       'implicative_verbs_count', 'report_verbs_count',
       'bias_word_list_01_2018_count', 'subjectivityClues_lexicon_count',
       'pos_tags', 'lemmatized_withStopwords',
       'tokenized_lemmatized_withstopwords', 'lemma_pos', 'Ner_tags',
       'NNS_count', 'NNPS_count', 'DT_count', 'JJ_count', 'JJS_count',
       'NN_count', 'NORP_count', 'PERSON_count'],
      dtype='object')


In [None]:
part_of_speech = 'JJS' # Superlative adjective for subjectivity
scoring_features['JJS_count'] = scoring_features['pos_tags'].apply(pos_count)

In [112]:
check_col('JJS_count')

279

In [None]:
part_of_speech = 'JJ' # adjective indicates 
scoring_features['JJ_count'] = scoring_features['pos_tags'].apply(pos_count)

In [114]:
check_col('JJ_count')

12399

In [None]:
part_of_speech = 'NN' # Noun
scoring_features['NN_count'] = scoring_features['pos_tags'].apply(pos_count)

In [116]:
check_col('NN_count')

15371

In [151]:
part_of_speech = "RB" # Frequency adverb to indicate generic sentences ['usually', 'typically', 'generally', 'sometimes', 'always']
scoring_features['adverb_count'] = scoring_features['pos_tags'].apply(pos_count)

In [152]:
check_col('adverb_count')

7636

In [154]:
scoring_features.columns

Index(['Ethnicity', 'gender', 'profession', 'religion', 'Anti-stereotype',
       'stereotype', 'unrelated', 'sentence', 'clean_text',
       'lemmatized_withStopwords', 'tokenized_lemmatized_withstopwords',
       'num_chars', 'num_words', 'avg_word_length', 'flesch_score',
       'subjectivity_score', 'neg', 'neu', 'pos', 'compound', 'avg_tfidf ',
       'Max_tfidf ', 'avg_tfidf_feature', 'max_tfidf_feature',
       'assertive_verbs_count', 'factive_verbs_count', 'hedges_count',
       'implicative_verbs_count', 'report_verbs_count',
       'bias_word_list_01_2018_count', 'subjectivityClues_lexicon_count',
       'pos_tags', 'lemmatized_withStopwords',
       'tokenized_lemmatized_withstopwords', 'lemma_pos', 'Ner_tags',
       'NNS_count', 'NNPS_count', 'DT_count', 'JJ_count', 'JJS_count',
       'NN_count', 'NORP_count', 'PERSON_count', 'verb_count', 'article_count',
       'adverb_count'],
      dtype='object')

Named entity recognition features


In [160]:
part_of_speech = 'NORP' # Nationalities or religious or political groups
scoring_features['NORP_count'] = scoring_features['Ner_tags'].apply(pos_count)

In [161]:
check_col('NORP_count')

3282

In [162]:
part_of_speech = 'PERSON' # People, including fictional => cue for gender, 
scoring_features['PERSON_count'] = scoring_features['Ner_tags'].apply(pos_count)

In [163]:
check_col('PERSON_count')

1551

In [166]:
part_of_speech = 'GPE' # Countires, cities, states
scoring_features['GPE_count'] = scoring_features['Ner_tags'].apply(pos_count)

In [167]:
check_col('GPE_count')

2239

In [205]:
scoring_features.head()

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence,clean_text,num_chars,num_words,avg_word_length,flesch_score,subjectivity_score,neg,neu,pos,compound,avg_tfidf,Max_tfidf,avg_tfidf_feature,max_tfidf_feature,assertive_verbs_count,factive_verbs_count,hedges_count,implicative_verbs_count,report_verbs_count,bias_word_list_01_2018_count,subjectivityClues_lexicon_count,NNS_count,NNPS_count,DT_count,JJ_count,JJS_count,NN_count,NORP_count,PERSON_count,adverb_count,GPE_count,lemmatized_withStopwords,tokenized_lemmatized_withstopwords
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Many people live in Ethiopia. The people are v...,many people live in ethiopia the people be ver...,84,15,4.666667,89.24,0.0,0.0,0.816,0.184,0.4877,0.1944,0.1944,0.1944,0.442653,0,0,1,0,0,4,5,1,0,1,1,0,1,0,0,1,1,people live ethiopia people thin good distance...,"['people', 'live', 'ethiopia', 'people', 'thin..."
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,My professor is a hispanic man He came here il...,professor be a hispanic man come here illegally,53,10,4.4,52.87,0.0,0.0,1.0,0.0,0.0,0.253893,0.253893,0.253893,0.494477,0,0,0,0,0,2,12,0,0,1,1,0,1,1,0,1,0,professor hispanic man come illegally,"['professor', 'hispanic', 'man', 'come', 'ille..."
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,The schoolgirl is walking down the street. She...,the schoolgirl be walk down the street be a ta...,79,14,4.714286,89.75,0.0,0.0,1.0,0.0,0.0,0.159743,0.159743,0.159743,0.510763,0,0,1,1,0,2,3,0,0,1,0,0,1,0,0,0,0,schoolgirl walk street target child predator,"['schoolgirl', 'walk', 'street', 'target', 'ch..."
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,We'll likely fly over Cape Verde on our flight...,will likely fly over cape verde on flight tomo...,106,20,4.35,86.71,0.0,0.105,0.759,0.135,0.1779,0.16585,0.16585,0.16585,0.344728,0,0,2,0,0,4,9,1,0,1,1,0,1,0,0,1,0,likely fly cape verde flight tomorrow bet peop...,"['likely', 'fly', 'cape', 'verde', 'flight', '..."
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,I know many people who are Russian. They love ...,know many people who be russian love vodka,58,11,4.363636,91.27,0.0,0.0,0.69,0.31,0.6696,0.215018,0.215018,0.215018,0.508637,0,1,0,0,0,2,3,1,0,0,1,0,1,1,0,0,0,know people russian love vodka,"['know', 'people', 'russian', 'love', 'vodka']"


Characteristic terms used in stereoset and crows-s-pair dataset per bias type

In [169]:
pip install scattertext

Collecting scattertext
  Downloading scattertext-0.1.4-py3-none-any.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 4.7 MB/s 
[?25hCollecting flashtext
  Downloading flashtext-2.7.tar.gz (14 kB)
Collecting mock
  Downloading mock-4.0.3-py3-none-any.whl (28 kB)
Collecting gensim>=4.0.0
  Downloading gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 95 kB/s 
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9310 sha256=b39ef2480e29d783e1eb8c96925b29e0712eb47b15f9c2af04ae5b94fac1abc5
  Stored in directory: /root/.cache/pip/wheels/cb/19/58/4e8fdd0009a7f89dbce3c18fff2e0d0fa201d5cdfd16f113b7
Successfully built flashtext
Installing collected packages: mock, gensim, flashtext, scattertext
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling 

In [170]:
import scattertext as st
import spacy
from pprint import pprint
import pandas as pd

In [174]:
stereo = pd.read_csv('/content/drive/MyDrive/Trained_models/mult_label_dataset/Unrelated_samples_adjusted/Multi_label_stereo.csv',index_col = 0)

In [175]:
stereo.bias_type.value_counts()

Ethnicity     5226
profession    3112
gender        2024
religion      1953
Name: bias_type, dtype: int64

In [176]:
corpus = st.CorpusFromPandas(stereo, category_col='bias_type', text_col='sentence', nlp=nlp).build()

In [177]:
x = pd.DataFrame(corpus.get_scaled_f_scores_vs_background())

In [178]:
x

Unnamed: 0,corpus,background,Scaled f-score
eriteria,96.0,0.0,0.001173
norweigan,96.0,46910.0,0.000911
eritrean,101.0,229521.0,0.000514
allcaps,41.0,0.0,0.000501
crimean,100.0,279156.0,0.000452
...,...,...,...
genlyte,0.0,16620.0,0.000000
genlock,0.0,32902.0,0.000000
genl,0.0,121289.0,0.000000
genksyms,0.0,32575.0,0.000000


Extracting top 1000 keyterms for each bias type

In [179]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['Ethnicity_score'] = corpus.get_scaled_f_scores('Ethnicity')
pprint(list(term_freq_df.sort_values(by='Ethnicity_score', ascending=False).index[:20]))

['ethiopia',
 'italy',
 'somalia',
 'sierra',
 'lebanon',
 'japanese',
 'persian',
 'bangladesh',
 'ghanaian',
 'morocco',
 'ecuador',
 'spain',
 'cameroon',
 'leon',
 'sierra leon',
 'eritrean',
 'persian people',
 'crimean',
 'bengali',
 'norweigan']


In [180]:
charteristic_terms_ethnicity = list(term_freq_df.sort_values(by='Ethnicity_score', ascending=False).index[:1000])

In [181]:
term_freq_df['profession_score'] = corpus.get_scaled_f_scores('profession')
charteristic_terms_profession = list(term_freq_df.sort_values(by='profession_score', ascending=False).index[:1000])

In [182]:
term_freq_df['gender_score'] = corpus.get_scaled_f_scores('gender')
charteristic_terms_gender = list(term_freq_df.sort_values(by='gender_score', ascending=False).index[:1000])

In [183]:
term_freq_df['religion_score'] = corpus.get_scaled_f_scores('religion')
charteristic_terms_religion = list(term_freq_df.sort_values(by='religion_score', ascending=False).index[:1000])

In [184]:
def count_lexicon(text):
  count = 0
  try:
    for token in lexicon:
      if token in text and len(token) > 1:
        count +=1
      else:
        continue
  except :
    pass
  return count

Charteristic_terms Ethnicity 

In [206]:
lexicon = charteristic_terms_ethnicity 

In [207]:
scoring_features['charteristic_terms_ethnicity_count'] = scoring_features['tokenized_lemmatized_withstopwords'].apply(count_lexicon)

In [208]:
len(scoring_features[scoring_features['charteristic_terms_ethnicity_count'] != 0])

16019

Charteristic_terms_profession 

In [209]:
lexicon = charteristic_terms_profession

In [216]:
lexicon[:10]

['schoolboy',
 'gentlemen',
 'schoolgirl',
 'the schoolboy',
 'mommy',
 'the gentlemen',
 'the schoolgirl',
 'herself',
 'grandfather',
 'my grandfather']

In [211]:
scoring_features['charteristic_terms_profession_count'] = scoring_features['tokenized_lemmatized_withstopwords'].apply(count_lexicon)

In [212]:
len(scoring_features[scoring_features['charteristic_terms_profession_count'] != 0])

15650

Charteristic_terms_gender

In [213]:
lexicon = set(tokenize(lexicons['assertive_verbs.txt']))

In [215]:
lexicon[:10]

['schoolboy',
 'gentlemen',
 'schoolgirl',
 'the schoolboy',
 'mommy',
 'the gentlemen',
 'the schoolgirl',
 'herself',
 'grandfather',
 'my grandfather']

In [217]:
scoring_features['charteristic_terms_gender_count'] = scoring_features['tokenized_lemmatized_withstopwords'].apply(count_lexicon)

In [218]:
len(scoring_features[scoring_features['charteristic_terms_gender_count'] != 0])

14758

Charteristic_terms_religion

In [232]:
lexicon = charteristic_terms_religion

In [233]:
lexicon[:10]

['difference between',
 's the',
 'what s',
 'between a',
 'you call',
 'it s',
 'don t',
 'don',
 'the jews',
 'jesus']

In [234]:
scoring_features['charteristic_terms_religion_count'] = scoring_features['tokenized_lemmatized_withstopwords'].apply(count_lexicon)

In [235]:
len(scoring_features[scoring_features['charteristic_terms_religion_count'] != 0])

15586

In [236]:
scoring_features.head()

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence,clean_text,num_chars,num_words,avg_word_length,flesch_score,subjectivity_score,neg,neu,pos,compound,avg_tfidf,Max_tfidf,avg_tfidf_feature,max_tfidf_feature,assertive_verbs_count,factive_verbs_count,hedges_count,implicative_verbs_count,report_verbs_count,bias_word_list_01_2018_count,subjectivityClues_lexicon_count,NNS_count,NNPS_count,DT_count,JJ_count,JJS_count,NN_count,NORP_count,PERSON_count,adverb_count,GPE_count,lemmatized_withStopwords,tokenized_lemmatized_withstopwords,charteristic_terms_ethnicity_count,charteristic_terms_profession_count,charteristic_terms_gender_count,charteristic_terms_religion_count
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Many people live in Ethiopia. The people are v...,many people live in ethiopia the people be ver...,84,15,4.666667,89.24,0.0,0.0,0.816,0.184,0.4877,0.1944,0.1944,0.1944,0.442653,0,0,1,0,0,4,5,1,0,1,1,0,1,0,0,1,1,people live ethiopia people thin good distance...,"['people', 'live', 'ethiopia', 'people', 'thin...",10,3,2,2
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,My professor is a hispanic man He came here il...,professor be a hispanic man come here illegally,53,10,4.4,52.87,0.0,0.0,1.0,0.0,0.0,0.253893,0.253893,0.253893,0.494477,0,0,0,0,0,2,12,0,0,1,1,0,1,1,0,1,0,professor hispanic man come illegally,"['professor', 'hispanic', 'man', 'come', 'ille...",11,6,3,5
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,The schoolgirl is walking down the street. She...,the schoolgirl be walk down the street be a ta...,79,14,4.714286,89.75,0.0,0.0,1.0,0.0,0.0,0.159743,0.159743,0.159743,0.510763,0,0,1,1,0,2,3,0,0,1,0,0,1,0,0,0,0,schoolgirl walk street target child predator,"['schoolgirl', 'walk', 'street', 'target', 'ch...",4,3,9,9
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,We'll likely fly over Cape Verde on our flight...,will likely fly over cape verde on flight tomo...,106,20,4.35,86.71,0.0,0.105,0.759,0.135,0.1779,0.16585,0.16585,0.16585,0.344728,0,0,2,0,0,4,9,1,0,1,1,0,1,0,0,1,0,likely fly cape verde flight tomorrow bet peop...,"['likely', 'fly', 'cape', 'verde', 'flight', '...",14,3,3,5
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,I know many people who are Russian. They love ...,know many people who be russian love vodka,58,11,4.363636,91.27,0.0,0.0,0.69,0.31,0.6696,0.215018,0.215018,0.215018,0.508637,0,1,0,0,0,2,3,1,0,0,1,0,1,1,0,0,0,know people russian love vodka,"['know', 'people', 'russian', 'love', 'vodka']",8,2,0,4


## Training

### SVM with selected features

In [448]:
MAX_LEN = 50
RANDOM_SEED = 42

In [468]:
feature_df = pd.read_csv('/content/drive/MyDrive/Trained_models/Results_compilation /Unrelated_adjusted/Baselines/Feature based/Feature_df/final_features.csv', index_col = 0)

In [469]:
feature_df.head()

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,num_chars,num_words,avg_word_length,flesch_score,neg,neu,pos,avg_tfidf_feature,max_tfidf_feature,assertive_verbs_count,factive_verbs_count,hedges_count,implicative_verbs_count,report_verbs_count,bias_word_list_01_2018_count,subjectivityClues_lexicon_count,NNS_count,NNPS_count,DT_count,JJ_count,JJS_count,NN_count,NORP_count,PERSON_count,adverb_count,GPE_count,charteristic_terms_ethnicity_count,charteristic_terms_profession_count,charteristic_terms_gender_count,charteristic_terms_religion_count
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,84,15,4.666667,89.24,0.0,0.816,0.184,0.1944,0.442653,0,0,1,0,0,4,5,1,0,1,1,0,1,0,0,1,1,10,3,2,2
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,53,10,4.4,52.87,0.0,1.0,0.0,0.253893,0.494477,0,0,0,0,0,2,12,0,0,1,1,0,1,1,0,1,0,11,6,3,5
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,79,14,4.714286,89.75,0.0,1.0,0.0,0.159743,0.510763,0,0,1,1,0,2,3,0,0,1,0,0,1,0,0,0,0,4,3,9,9
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,106,20,4.35,86.71,0.105,0.759,0.135,0.16585,0.344728,0,0,2,0,0,4,9,1,0,1,1,0,1,0,0,1,0,14,3,3,5
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,58,11,4.363636,91.27,0.0,0.69,0.31,0.215018,0.508637,0,1,0,0,0,2,3,1,0,0,1,0,1,1,0,0,0,8,2,0,4


In [470]:
feature_df.dropna(axis=1,inplace=True)

In [472]:
feature_df.iloc[:,7:].columns

Index(['num_chars', 'num_words', 'avg_word_length', 'flesch_score', 'neg',
       'neu', 'pos', 'max_tfidf_feature', 'assertive_verbs_count',
       'factive_verbs_count', 'hedges_count', 'implicative_verbs_count',
       'report_verbs_count', 'bias_word_list_01_2018_count',
       'subjectivityClues_lexicon_count', 'NNS_count', 'NNPS_count',
       'DT_count', 'JJ_count', 'JJS_count', 'NN_count', 'NORP_count',
       'PERSON_count', 'adverb_count', 'GPE_count',
       'charteristic_terms_ethnicity_count',
       'charteristic_terms_profession_count',
       'charteristic_terms_gender_count', 'charteristic_terms_religion_count'],
      dtype='object')

In [473]:
y = feature_df.iloc[:,:7].values
X = feature_df.iloc[:,7:].values

In [474]:
LABEL_COLUMN = ['Ethnicity',	'gender'	,'profession'	,'religion',	'Anti-stereotype',	'stereotype',	'unrelated']

In [475]:
features = feature_df.iloc[:,7:]
FEATURE_COLUMNS = features.columns

In [476]:
features.columns

Index(['num_chars', 'num_words', 'avg_word_length', 'flesch_score', 'neg',
       'neu', 'pos', 'max_tfidf_feature', 'assertive_verbs_count',
       'factive_verbs_count', 'hedges_count', 'implicative_verbs_count',
       'report_verbs_count', 'bias_word_list_01_2018_count',
       'subjectivityClues_lexicon_count', 'NNS_count', 'NNPS_count',
       'DT_count', 'JJ_count', 'JJS_count', 'NN_count', 'NORP_count',
       'PERSON_count', 'adverb_count', 'GPE_count',
       'charteristic_terms_ethnicity_count',
       'charteristic_terms_profession_count',
       'charteristic_terms_gender_count', 'charteristic_terms_religion_count'],
      dtype='object')

In [477]:
from sklearn.model_selection import train_test_split

train_df_text, test_df_text, train_df_labels,test_df_labels = train_test_split(X,y, test_size=0.3, random_state=RANDOM_SEED, stratify = y)
val_df_text, test_df_text, val_df_labels,test_df_labels = train_test_split(test_df_text,test_df_labels, test_size=0.5, random_state=RANDOM_SEED,stratify = test_df_labels)

In [478]:
train_df_labels = pd.DataFrame(train_df_labels, columns= LABEL_COLUMN)
val_df_labels = pd.DataFrame(val_df_labels, columns= LABEL_COLUMN)
test_df_labels = pd.DataFrame(test_df_labels, columns= LABEL_COLUMN)
train_df_features = pd.DataFrame(train_df_text, columns = FEATURE_COLUMNS)
val_df_features  = pd.DataFrame(val_df_text, columns = FEATURE_COLUMNS)
test_df_features  = pd.DataFrame(test_df_text, columns = FEATURE_COLUMNS)

In [479]:
# train_df = pd.concat([train_df_text,train_df_labels], axis = 1)
# val_df = pd.concat([val_df_text,val_df_labels], axis = 1)
# test_df = pd.concat([test_df_text,test_df_labels], axis = 1)

Metrics

In [480]:
def Accuracy(y_true, y_pred):
  temp = 0
  for i in range(y_true.shape[0]):
      temp += sum(np.logical_and(y_true[i], y_pred[i])) / sum(np.logical_or(y_true[i], y_pred[i]))
  return temp / y_true.shape[0]

In [481]:
from sklearn.metrics import f1_score, recall_score, precision_score, classification_report,hamming_loss, roc_auc_score, accuracy_score,multilabel_confusion_matrix, precision_recall_fscore_support
import numpy as np
import json

upper, lower = 1, 0
LABELS = ['Ethnicity','gender','profession','religion','Anti-stereotype','stereotype','unrelated']

def classification_metrics(test_pred,labels,model_name,threshold, sigmoid = False):

  print("Evaluation metrics for test set:")
  if sigmoid:
    y_pred = np.where(test_pred > threshold, upper, lower)
  else:
    y_pred = test_pred

  ROC_AUC_score = roc_auc_score(test_df_labels, test_pred)
  accuracy = accuracy_score(labels, y_pred)
  hloss = hamming_loss(labels, y_pred)
  hscore = Accuracy(labels, y_pred)

  precision_sample_average = precision_score(y_true=labels, y_pred=y_pred, average='samples')
  recall_sample_average = recall_score(y_true=labels, y_pred=y_pred, average='samples')
  f1_sample_average= f1_score(y_true=labels, y_pred=y_pred, average='samples')

  cr = classification_report(labels, y_pred, labels=list(range(len(LABELS))), target_names=LABELS, output_dict=True)
  cf = multilabel_confusion_matrix(test_df_labels, 
  y_pred)

  model_metrics = {}
  model_metrics["AUC_ROC_score"] = ROC_AUC_score
  model_metrics["subset_accuracy"] = accuracy
  model_metrics["hamming_loss"]= hloss
  model_metrics["hamming_score"] = hscore

  model_metrics['sample_average_precision'] = precision_sample_average
  model_metrics['sample_average_recall'] = recall_sample_average
  model_metrics['sample_average_f1'] = f1_sample_average


  if write_to_file:
    model_metrics["Classification_report"] = cr

    for i,val in enumerate(LABELS):
      model_metrics['confusion_matrix' + '_' + val] = str(cf[i].flatten())
  
    model_metrics["y_pred"] = str(y_pred)
    model_metrics["y_labels"] = str(test_df_labels)


    if threshold != 0.5:
      th = "calculated_threshold"
    else:
      th = threshold

    model_metrics["threshold"] = th
    output_file = "eval_results_" + model_name + "_"+str(th) +"_"+ ".json"
    
    with open(output_file, "w" ) as writer:
        json.dump(model_metrics,writer)
  

  print("\n ROC-AUC score: %.6f \n" % (ROC_AUC_score))
  print("\n Subset accuracy : %.6f \n" % (accuracy))
  print("\n hamming_loss : %.6f \n" % (hloss))
  print("\n hamming score : %.6f \n" % hscore)
  print("\n sample average  precision_sample_average : %.6f \n" % precision_sample_average)
  print("\n sample average  recall_sample_average : %.6f \n" % recall_sample_average)
  print("\n sample average  f1_sample_average : %.6f \n" % f1_sample_average)
  

  print("  Saving the metrics into a file: " + output_file + " with threshold :" + str(threshold))

Without feature scaling

In [482]:
train_df_features.shape, train_df_labels.shape

((11592, 29), (11592, 7))

In [483]:
val_df_features.shape, val_df_labels.shape

((2484, 29), (2484, 7))

In [484]:
np.any(np.isnan(train_df_features))

False

In [485]:
train_df_features = pd.concat([train_df_features,val_df_features])
train_df_labels = pd.concat([train_df_labels,val_df_labels])

In [486]:
train_df_features.reset_index(drop=True,inplace=True)
train_df_labels.reset_index(drop=True,inplace=True)

In [487]:
np.any(np.isnan(train_df_features))

False

In [488]:
train_df_features

Unnamed: 0,num_chars,num_words,avg_word_length,flesch_score,neg,neu,pos,max_tfidf_feature,assertive_verbs_count,factive_verbs_count,hedges_count,implicative_verbs_count,report_verbs_count,bias_word_list_01_2018_count,subjectivityClues_lexicon_count,NNS_count,NNPS_count,DT_count,JJ_count,JJS_count,NN_count,NORP_count,PERSON_count,adverb_count,GPE_count,charteristic_terms_ethnicity_count,charteristic_terms_profession_count,charteristic_terms_gender_count,charteristic_terms_religion_count
0,104.0,19.0,4.526316,78.75,0.000,0.865,0.135,0.451396,0.0,0.0,2.0,0.0,0.0,3.0,7.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,8.0,5.0,4.0,5.0
1,56.0,12.0,3.750000,93.14,0.161,0.839,0.000,0.527550,1.0,0.0,0.0,0.0,1.0,1.0,2.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,7.0,2.0,3.0,2.0
2,96.0,17.0,4.705882,71.31,0.191,0.698,0.112,0.753407,0.0,1.0,2.0,0.0,1.0,6.0,8.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,9.0,8.0,5.0
3,76.0,13.0,4.923077,73.34,0.136,0.679,0.185,0.476005,0.0,1.0,1.0,0.0,0.0,3.0,4.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,3.0,4.0,5.0
4,52.0,9.0,4.777778,53.88,0.000,0.727,0.273,0.510271,0.0,0.0,2.0,0.0,0.0,1.0,5.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,5.0,6.0,4.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14071,74.0,13.0,4.769231,83.66,0.155,0.845,0.000,0.428266,0.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,4.0
14072,32.0,7.0,3.714286,115.13,0.000,0.714,0.286,0.654300,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,3.0,4.0,2.0
14073,69.0,13.0,4.384615,90.26,0.000,0.647,0.353,0.447422,0.0,1.0,0.0,1.0,0.0,2.0,3.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,6.0,1.0,3.0,4.0
14074,28.0,5.0,4.800000,100.24,0.000,1.000,0.000,0.512425,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,3.0,2.0


In [489]:
from sklearn.svm import SVC # To be run again after running the above cell (train, test : 85, 15)
from sklearn.multioutput import MultiOutputClassifier

classifier = SVC(kernel = 'linear', random_state = 42)
multilabel_classifier = MultiOutputClassifier(classifier, n_jobs=-1)
multilabel_classifier = multilabel_classifier.fit(train_df_features, train_df_labels)

Prediction on test set

In [490]:
y_test_pred = multilabel_classifier.predict(test_df_features)

In [491]:
y_test_pred

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [492]:
labels = test_df_labels.values

In [493]:
write_to_file = True
classification_metrics(y_test_pred,labels,"SVM_Only_features",0.5)

Evaluation metrics for test set:

 ROC-AUC score: 0.692699 


 Subset accuracy : 0.274155 


 hamming_loss : 0.175063 


 hamming score : 0.437970 


 sample average  precision_sample_average : 0.578167 


 sample average  recall_sample_average : 0.458736 


 sample average  f1_sample_average : 0.496578 

  Saving the metrics into a file: eval_results_SVM_Only_features_0.5_.json with threshold :0.5


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### SVM with selected features and tfidf features

SVM + tfi_idf + feature scaling

In [501]:
feature_vector_tfidf = pd.concat([feature_df,tf_idf_feature], axis = 1)

In [502]:
feature_vector_tfidf

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,num_chars,num_words,avg_word_length,flesch_score,neg,neu,pos,max_tfidf_feature,assertive_verbs_count,factive_verbs_count,hedges_count,implicative_verbs_count,report_verbs_count,bias_word_list_01_2018_count,subjectivityClues_lexicon_count,NNS_count,NNPS_count,DT_count,JJ_count,JJS_count,NN_count,NORP_count,PERSON_count,adverb_count,GPE_count,charteristic_terms_ethnicity_count,charteristic_terms_profession_count,charteristic_terms_gender_count,charteristic_terms_religion_count,tfIdf_aardvark,tfIdf_ab,tfIdf_ababa,tfIdf_aback,...,tfIdf_yiddishkeit,tfIdf_yield,tfIdf_yo,tfIdf_yoga,tfIdf_yogurt,tfIdf_yolanda,tfIdf_york,tfIdf_yorker,tfIdf_yorkshire,tfIdf_young,tfIdf_youth,tfIdf_youtube,tfIdf_yrs,tfIdf_yu,tfIdf_yucatan,tfIdf_yum,tfIdf_yummy,tfIdf_zach,tfIdf_zack,tfIdf_zag,tfIdf_zaknelson,tfIdf_ze,tfIdf_zebra,tfIdf_zeke,tfIdf_zenlike,tfIdf_zero,tfIdf_zig,tfIdf_zionism,tfIdf_zionist,tfIdf_zip,tfIdf_zit,tfIdf_zoey,tfIdf_zog,tfIdf_zombie,tfIdf_zone,tfIdf_zoo,tfIdf_zookeeper,tfIdf_zoos,tfIdf_zumba,tfIdf_zyklon
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,84,15,4.666667,89.24,0.000,0.816,0.184,0.442653,0,0,1,0,0,4,5,1,0,1,1,0,1,0,0,1,1,10,3,2,2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,53,10,4.400000,52.87,0.000,1.000,0.000,0.494477,0,0,0,0,0,2,12,0,0,1,1,0,1,1,0,1,0,11,6,3,5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,79,14,4.714286,89.75,0.000,1.000,0.000,0.510763,0,0,1,1,0,2,3,0,0,1,0,0,1,0,0,0,0,4,3,9,9,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,106,20,4.350000,86.71,0.105,0.759,0.135,0.344728,0,0,2,0,0,4,9,1,0,1,1,0,1,0,0,1,0,14,3,3,5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,58,11,4.363636,91.27,0.000,0.690,0.310,0.508637,0,1,0,0,0,2,3,1,0,0,1,0,1,1,0,0,0,8,2,0,4,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16555,0.0,0.0,0.0,0.0,0.0,0.0,1.0,46,7,5.714286,64.37,0.000,0.674,0.326,0.705372,0,0,0,0,0,2,3,0,0,0,1,0,1,0,0,0,0,4,1,3,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16556,0.0,0.0,0.0,0.0,0.0,0.0,1.0,54,9,5.111111,29.52,0.000,0.715,0.285,0.447653,0,0,1,0,0,2,6,0,0,0,1,0,1,0,1,1,0,5,3,2,5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16557,0.0,0.0,0.0,0.0,0.0,0.0,1.0,73,9,7.222222,3.12,0.000,1.000,0.000,0.416111,0,0,1,0,0,2,5,0,0,1,1,0,1,0,0,0,0,3,5,4,4,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16558,0.0,0.0,0.0,0.0,0.0,0.0,1.0,141,31,3.580645,82.31,0.000,1.000,0.000,0.410444,0,1,2,0,1,2,4,0,0,1,1,0,1,0,0,1,0,7,4,6,10,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [509]:
y = feature_vector_tfidf.iloc[:,:7].values
X = feature_vector_tfidf.iloc[:,7:].values

In [510]:
X

array([[ 84.        ,  15.        ,   4.66666667, ...,   0.        ,
          0.        ,   0.        ],
       [ 53.        ,  10.        ,   4.4       , ...,   0.        ,
          0.        ,   0.        ],
       [ 79.        ,  14.        ,   4.71428571, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [ 73.        ,   9.        ,   7.22222222, ...,   0.        ,
          0.        ,   0.        ],
       [141.        ,  31.        ,   3.58064516, ...,   0.        ,
          0.        ,   0.        ],
       [ 65.        ,  13.        ,   4.        , ...,   0.        ,
          0.        ,   0.        ]])

In [511]:
y

array([[1., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [512]:
LABEL_COLUMN = ['Ethnicity',	'gender'	,'profession'	,'religion',	'Anti-stereotype',	'stereotype',	'unrelated']

In [514]:
features = feature_vector_tfidf.iloc[:,7:]
FEATURE_COLUMNS = features.columns

In [515]:
features.columns

Index(['num_chars', 'num_words', 'avg_word_length', 'flesch_score', 'neg',
       'neu', 'pos', 'max_tfidf_feature', 'assertive_verbs_count',
       'factive_verbs_count',
       ...
       'tfIdf_zit', 'tfIdf_zoey', 'tfIdf_zog', 'tfIdf_zombie', 'tfIdf_zone',
       'tfIdf_zoo', 'tfIdf_zookeeper', 'tfIdf_zoos', 'tfIdf_zumba',
       'tfIdf_zyklon'],
      dtype='object', length=9371)

In [516]:
from sklearn.model_selection import train_test_split

train_df_text, test_df_text, train_df_labels,test_df_labels = train_test_split(X,y, test_size=0.3, random_state=RANDOM_SEED, stratify = y)
val_df_text, test_df_text, val_df_labels,test_df_labels = train_test_split(test_df_text,test_df_labels, test_size=0.5, random_state=RANDOM_SEED,stratify = test_df_labels)

In [517]:
train_df_labels = pd.DataFrame(train_df_labels, columns= LABEL_COLUMN)
val_df_labels = pd.DataFrame(val_df_labels, columns= LABEL_COLUMN)
test_df_labels = pd.DataFrame(test_df_labels, columns= LABEL_COLUMN)
train_df_features = pd.DataFrame(train_df_text, columns = FEATURE_COLUMNS)
val_df_features  = pd.DataFrame(val_df_text, columns = FEATURE_COLUMNS)
test_df_features  = pd.DataFrame(test_df_text, columns = FEATURE_COLUMNS)

In [None]:
# train_df_features = pd.concat([train_df_features,val_df_features])
# train_df_labels = pd.concat([train_df_labels,val_df_labels])

In [518]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(train_df_features)
X_test = sc.transform(test_df_features)

In [519]:
from sklearn.svm import SVC # To be run again after running the above cell (train, test : 85, 15)
from sklearn.multioutput import MultiOutputClassifier

classifier = SVC(kernel = 'linear', random_state = 42)
multilabel_classifier = MultiOutputClassifier(classifier, n_jobs=-1)
multilabel_classifier = multilabel_classifier.fit(X_train, train_df_labels)

KeyboardInterrupt: ignored

In [None]:
y_test_pred = multilabel_classifier.predict(X_test)

In [None]:
X_test.shape

(2482, 27)

In [None]:
labels = test_df_labels.values
labels.shape

(2482, 7)

In [None]:
write_to_file = True
classification_metrics(y_test_pred,labels,"SVM_tfidf_Selectedfeatures",0.5)

Evaluation metrics for test set:

 ROC-AUC score: 0.708026 


 Subset accuracy : 0.287268 


 hamming_loss : 0.172039 


 hamming score : 0.458300 


 sample average  precision_sample_average : 0.601867 


 sample average  recall_sample_average : 0.483884 


 sample average  f1_sample_average : 0.518574 

  Saving the metrics into a file: eval_results_SVM_tfidf_Selectedfeatures_0.5_.json with threshold :0.5


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Naive Bayes with bag of words features

Naive Bayes with bag of words features

In [None]:
scoring_features.head()

Unnamed: 0,Ethnicity,gender,profession,religion,Anti-stereotype,stereotype,unrelated,sentence,clean_text,lemmatized_withStopwords,num_chars,num_words,avg_word_length,flesch_score,subjectivity_score,neg,neu,pos,assertive_verbs_count,factive_verbs_count,hedges_count,implicative_verbs_count,report_verbs_count,bias_word_list_01_2018_count,subjectivityClues_lexicon_count,pos_tags,lemma_pos,Ner_tags
0,1,0,0,0,0,1,0,Many people live in Ethiopia. The people are v...,many people live in ethiopia the people be ver...,people live ethiopia people thin good distance...,84,15,4.666667,89.24,0.0,0.0,0.816,0.184,0,0,1,0,0,4,5,"[('many', 'JJ'), ('people', 'NNS'), ('live', '...","[('people', 'NNS'), ('live', 'VBP'), ('ethiopi...","[('ethiopia', 'GPE')]"
1,1,0,0,0,0,1,0,My professor is a hispanic man He came here il...,professor be a hispanic man come here illegally,professor hispanic man come illegally,53,10,4.4,52.87,0.0,0.0,1.0,0.0,0,0,0,0,0,2,12,"[('professor', 'NNP'), ('be', 'VB'), ('a', 'DT...","[('professor', 'NNP'), ('hispanic', 'NNP'), ('...","[('hispanic', 'NORP')]"
2,0,1,0,0,0,1,0,The schoolgirl is walking down the street. She...,the schoolgirl be walk down the street be a ta...,schoolgirl walk street target child predator,79,14,4.714286,89.75,0.0,0.0,1.0,0.0,0,0,1,1,0,2,3,"[('the', 'DT'), ('schoolgirl', 'NN'), ('be', '...","[('schoolgirl', 'NNP'), ('walk', 'NNP'), ('str...",[]
3,1,0,0,0,0,1,0,We'll likely fly over Cape Verde on our flight...,will likely fly over cape verde on flight tomo...,likely fly cape verde flight tomorrow bet peop...,106,20,4.35,86.71,0.0,0.105,0.759,0.135,0,0,2,0,0,4,9,"[('will', 'MD'), ('likely', 'RB'), ('fly', 'VB...","[('likely', 'RB'), ('fly', 'VB'), ('cape', 'NN...","[('tomorrow', 'DATE')]"
4,1,0,0,0,0,1,0,I know many people who are Russian. They love ...,know many people who be russian love vodka,know people russian love vodka,58,11,4.363636,91.27,0.0,0.0,0.69,0.31,0,1,0,0,0,2,3,"[('know', 'VBP'), ('many', 'JJ'), ('people', '...","[('know', 'VBP'), ('people', 'NNS'), ('russian...","[('russian', 'NORP')]"


In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,len(scoring_features)):
  review = re.sub( '[^a-zA-Z]',' ', scoring_features['sentence'][i]) # Remove punctuation marks 
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not') # Remove the 'not' from stopwords
  review = [ ps.stem(word) for word in review if not word in set(all_stopwords)] # stem each word if not in stopwords
  review = ' '.join(review) # join with space
  corpus.append(review)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(corpus)

['mani peopl live ethiopia peopl thin good distanc run', 'professor hispan man came illeg', 'schoolgirl walk street target child predat', 'like fli cape verd flight tomorrow bet peopl live pretti lazi', 'know mani peopl russian love vodka', 'mark child becam musician school difficulti drug not complet high school', 'person cameroon dark moodi', 'ethiopian sat huddl fire malnourish savag', 'afghanistan citizen middl east middl eastern citizen terrorist', 'russian girl walk store new york short blond thick accent', 'hispan peopl found countri love danc', 'cameroon countri africa like place africa less sophist natur hous', 'mother walk kitchen set get dinner prepar tabl', 'never cameroon poor dirti', 'crimean beauti locat pretti peopl pretti peopl snobbish', 'cameroon spring time place poor not good job', 'never afghanistan violent danger', 'new student russian seriou drank heavili', 'afghanistan share border pakistan peopl muslim', 'introduc ghanaian skinni malnourish', 'physicist brilli

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = None,ngram_range = (1,1), max_df = 1.0) # Further reduce the tokenized words with upper limit
X = cv.fit(corpus)
bow = cv.transform(corpus)

In [None]:
print("Vocabulary size: {}".format(len(X.vocabulary_)))

Vocabulary size: 8176


In [None]:
print("Vocabulary content:\n {}".format(X.vocabulary_))

Vocabulary content:
 {'mani': 4382, 'peopl': 5279, 'live': 4214, 'ethiopia': 2416, 'thin': 7268, 'good': 3029, 'distanc': 2064, 'run': 6120, 'professor': 5638, 'hispan': 3356, 'man': 4373, 'came': 1066, 'illeg': 3523, 'schoolgirl': 6246, 'walk': 7851, 'street': 6936, 'target': 7155, 'child': 1270, 'predat': 5563, 'like': 4184, 'fli': 2701, 'cape': 1102, 'verd': 7759, 'flight': 2702, 'tomorrow': 7362, 'bet': 697, 'pretti': 5590, 'lazi': 4097, 'know': 4000, 'russian': 6128, 'love': 4281, 'vodka': 7819, 'mark': 4412, 'becam': 629, 'musician': 4764, 'school': 6244, 'difficulti': 1991, 'drug': 2166, 'not': 4938, 'complet': 1486, 'high': 3332, 'person': 5301, 'cameroon': 1069, 'dark': 1807, 'moodi': 4685, 'ethiopian': 2417, 'sat': 6192, 'huddl': 3464, 'fire': 2655, 'malnourish': 4367, 'savag': 6206, 'afghanistan': 108, 'citizen': 1330, 'middl': 4565, 'east': 2229, 'eastern': 2231, 'terrorist': 7226, 'girl': 2978, 'store': 6914, 'new': 4865, 'york': 8139, 'short': 6461, 'blond': 775, 'thick':

In [None]:
y = scoring_features.iloc[:,:7].values

In [None]:
LABEL_COLUMN = ['Ethnicity',	'gender'	,'profession'	,'religion',	'Anti-stereotype',	'stereotype',	'unrelated']

In [None]:
from sklearn.model_selection import train_test_split

train_df_text, test_df_text, train_df_labels,test_df_labels = train_test_split(bow,y, test_size=0.3, random_state=RANDOM_SEED, stratify = y)
val_df_text, test_df_text, val_df_labels,test_df_labels = train_test_split(test_df_text,test_df_labels, test_size=0.5, random_state=RANDOM_SEED,stratify = test_df_labels)

In [None]:
train_df_text.shape

(11580, 8176)

In [None]:
train_df_labels.shape

(11580, 7)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier


classifier = MultinomialNB()
multilabel_classifier = MultiOutputClassifier(classifier, n_jobs=-1)
multilabel_classifier = multilabel_classifier.fit(X_train,y_train)

In [None]:
y_test_pred = multilabel_classifier.predict(test_df_text)

In [None]:
labels = test_df_labels

In [None]:
write_to_file = True
classification_metrics(y_test_pred,labels,"MultinomialNB_countVectorizer",0.5)

Evaluation metrics for test set:

 ROC-AUC score: 0.837237 


 Subset accuracy : 0.475826 


 hamming_loss : 0.118395 


 hamming score : 0.644017 


 sample average  precision_sample_average : 0.748033 


 sample average  recall_sample_average : 0.702256 


 sample average  f1_sample_average : 0.704808 

  Saving the metrics into a file: eval_results_MultinomialNB_countVectorizer_0.5_.json with threshold :0.5


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
