**Goals**

Crowdsourcing annotations has become a fundamental aspect of NLP research. The goal of this hands-on exercise is to explore the ethical implications of soliciting crowdsourced data, specifically social biases that may emerge when asking for generated sentences.

**Overview**

In this exercise, you will perform a “bias audit” of an NLP dataset produced by crowdsourcing. You will attempt to measure the presence of social stereotypes in this dataset that may have harmful effects if used to train classifiers in downstream tasks.

We will use pointwise mutual information (PMI) to find which associations are being made with identity labels. PMI can be used as a measure of word association in a corpus, i.e. how frequently two words co-occur above what might just be expected based on their frequencies.

See the [PMI Wikipedia page](https://en.wikipedia.org/wiki/Pointwise_mutual_information) for more details. Here we use PMI to measure which words co-occur with labels for identities. This allows us to see associations that may perpetuate stereotypes.

After this analysis, you will present specific examples from the data that you speculate could be particularly biased and problematic.

Assignment design credits: [11-830 Computational Ethics in NLP Course](https://maartensap.com/11830/Spring2024/hw1.html)

**Load and prepare dataset**

In [2]:
from google.colab import drive
drive.mount('/content/drive')
!ls "/content/drive/MyDrive/Colab Notebooks/ss-data"

Mounted at /content/drive
identity_labels.txt  snli_1.0_train.jsonl


In [3]:
#List of identity labels (based on Rudinger et al. 2017)
#More comprehensive list can be obtained from Jain et al. 2024
with open("/content/drive/MyDrive/Colab Notebooks/ss-data/identity_labels.txt", 'r') as f:
    identity_labels = f.read().split("\n")
print(identity_labels)

['woman', 'women', 'man', 'men', 'girl', 'girls', 'boy', 'boys', 'she', 'he', 'her', 'him', 'his', 'female', 'male', 'mother', 'father', 'sister', 'brother', 'daughter', 'son', 'feminine', 'masculine', 'androgynous', 'trans', 'transgender', 'transsexual', 'nonbinary', 'non-binary', 'two-spirit', 'hijra', 'genderqueer', 'black', 'asian', 'hispanic', 'white', 'african', 'american', 'latino', 'latina', 'caucasian', 'africans', 'middle-eastern', 'australian', 'australians', 'asians', 'european', 'europeans', 'chinese', 'indian', 'indonesian', 'brazilian', 'pakistani', 'bangladeshi', 'russian', 'nigerian', 'japanese', 'mexican', 'filipino ', 'vietnamese ', 'german', 'egyptian', 'ethiopian', 'turkish', 'iranian', 'thai', 'congolese', 'french', 'british ', 'italian', 'korean', 'burmese', 'canadian ', 'australian ', 'spanish', 'dutch', 'swiss', 'saudi', 'argentinian ', 'taiwanese ', 'swedish ', 'belgian', 'polish', 'israeli', 'irish', 'greek', 'ukrainian ', 'jamaican ', 'mongolian', 'armenian'

In [4]:
#Load SNLI dataset
import pandas as pd
snli_data = pd.read_json(path_or_buf="/content/drive/MyDrive/Colab Notebooks/ss-data/snli_1.0_train.jsonl", lines=True)
snli_data

Unnamed: 0,annotator_labels,captionID,gold_label,pairID,sentence1,sentence1_binary_parse,sentence1_parse,sentence2,sentence2_binary_parse,sentence2_parse
0,[neutral],3416050480.jpg#4,neutral,3416050480.jpg#4r1n,A person on a horse jumps over a broken down a...,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,A person is training his horse for a competition.,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...
1,[contradiction],3416050480.jpg#4,contradiction,3416050480.jpg#4r1c,A person on a horse jumps over a broken down a...,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,"A person is at a diner, ordering an omelette.",( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...
2,[entailment],3416050480.jpg#4,entailment,3416050480.jpg#4r1e,A person on a horse jumps over a broken down a...,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,"A person is outdoors, on a horse.","( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...
3,[neutral],2267923837.jpg#2,neutral,2267923837.jpg#2r1n,Children smiling and waving at camera,( Children ( ( ( smiling and ) waving ) ( at c...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,They are smiling at their parents,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...
4,[entailment],2267923837.jpg#2,entailment,2267923837.jpg#2r1e,Children smiling and waving at camera,( Children ( ( ( smiling and ) waving ) ( at c...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,There are children present,( There ( ( are children ) present ) ),(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...
...,...,...,...,...,...,...,...,...,...,...
550147,[contradiction],2267923837.jpg#3,contradiction,2267923837.jpg#3r1c,Four dirty and barefooted children.,( ( ( ( Four dirty ) and ) ( barefooted childr...,(ROOT (NP (NP (CD Four) (NNS dirty)) (CC and) ...,four kids won awards for 'cleanest feet',( ( four kids ) ( ( won awards ) ( ( ( for ` )...,(ROOT (S (NP (CD four) (NNS kids)) (VP (VBD wo...
550148,[neutral],2267923837.jpg#3,neutral,2267923837.jpg#3r1n,Four dirty and barefooted children.,( ( ( ( Four dirty ) and ) ( barefooted childr...,(ROOT (NP (NP (CD Four) (NNS dirty)) (CC and) ...,"four homeless children had their shoes stolen,...",( ( ( ( ( ( four ( homeless children ) ) ( had...,(ROOT (S (S (NP (CD four) (JJ homeless) (NNS c...
550149,[neutral],7979219683.jpg#2,neutral,7979219683.jpg#2r1n,A man is surfing in a bodysuit in beautiful bl...,( ( A man ) ( ( is ( surfing ( in ( ( a bodysu...,(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP...,A man in a bodysuit is competing in a surfing ...,( ( ( A man ) ( in ( a bodysuit ) ) ) ( ( is (...,(ROOT (S (NP (NP (DT A) (NN man)) (PP (IN in) ...
550150,[contradiction],7979219683.jpg#2,contradiction,7979219683.jpg#2r1c,A man is surfing in a bodysuit in beautiful bl...,( ( A man ) ( ( is ( surfing ( in ( ( a bodysu...,(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP...,A man in a business suit is heading to a board...,( ( ( A man ) ( in ( a ( business suit ) ) ) )...,(ROOT (S (NP (NP (DT A) (NN man)) (PP (IN in) ...


In [8]:
#De-duplicate
snli_data_sub = snli_data[['sentence1','sentence2']]
print(len(snli_data_sub))
snli_data_sub.drop_duplicates(subset=['sentence1', 'sentence2'], keep='last',inplace=True)
print(len(snli_data_sub))

550152
549526


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub.drop_duplicates(subset=['sentence1', 'sentence2'], keep='last',inplace=True)


**Data preparation**

In [9]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
nltk.download('punkt')

from nltk.corpus import stopwords
import string
nltk.download('stopwords')

from itertools import combinations
from tqdm import tqdm

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
#util functions
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
removal_list

def remove_stopwords(sent_tokens):
    filtered_words = [word for word in sent_tokens if word not in removal_list]
    return filtered_words

In [11]:
#Lowercase
snli_data_sub['sentence1'] = snli_data_sub['sentence1'].map(str.lower)
snli_data_sub['sentence2'] = snli_data_sub['sentence2'].map(str.lower)
snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1'] = snli_data_sub['sentence1'].map(str.lower)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2'] = snli_data_sub['sentence2'].map(str.lower)


Unnamed: 0,sentence1,sentence2
0,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.
1,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette."
2,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse."
3,children smiling and waving at camera,they are smiling at their parents
4,children smiling and waving at camera,there are children present
...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet'
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen,..."
550149,a man is surfing in a bodysuit in beautiful bl...,a man in a bodysuit is competing in a surfing ...
550150,a man is surfing in a bodysuit in beautiful bl...,a man in a business suit is heading to a board...


In [12]:
#Tokenize
snli_data_sub['sentence1_token'] = snli_data_sub['sentence1'].apply(nltk.word_tokenize)
snli_data_sub['sentence2_token'] = snli_data_sub['sentence2'].apply(nltk.word_tokenize)

snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token'] = snli_data_sub['sentence1'].apply(nltk.word_tokenize)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token'] = snli_data_sub['sentence2'].apply(nltk.word_tokenize)


Unnamed: 0,sentence1,sentence2,sentence1_token,sentence2_token
0,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, training, his, horse, for, a, ..."
1,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, at, a, diner, ,, ordering, an,..."
2,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, outdoors, ,, on, a, horse, .]"
3,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]"
4,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]"
...,...,...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet',"[four, dirty, and, barefooted, children, .]","[four, kids, won, awards, for, 'cleanest, feet..."
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen,...","[four, dirty, and, barefooted, children, .]","[four, homeless, children, had, their, shoes, ..."
550149,a man is surfing in a bodysuit in beautiful bl...,a man in a bodysuit is competing in a surfing ...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, bodysuit, is, competing, in, a..."
550150,a man is surfing in a bodysuit in beautiful bl...,a man in a business suit is heading to a board...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, business, suit, is, heading, t..."


In [13]:
#Remove stop-words
snli_data_sub['sentence1_token_nostopwords'] = snli_data_sub['sentence1_token'].apply(remove_stopwords)
snli_data_sub['sentence2_token_nostopwords'] = snli_data_sub['sentence2_token'].apply(remove_stopwords)

snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token_nostopwords'] = snli_data_sub['sentence1_token'].apply(remove_stopwords)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token_nostopwords'] = snli_data_sub['sentence2_token'].apply(remove_stopwords)


Unnamed: 0,sentence1,sentence2,sentence1_token,sentence2_token,sentence1_token_nostopwords,sentence2_token_nostopwords
0,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, training, his, horse, for, a, ...","[person, horse, jumps, broken, airplane]","[person, training, horse, competition]"
1,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, at, a, diner, ,, ordering, an,...","[person, horse, jumps, broken, airplane]","[person, diner, ordering, omelette]"
2,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, outdoors, ,, on, a, horse, .]","[person, horse, jumps, broken, airplane]","[person, outdoors, horse]"
3,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]","[children, smiling, waving, camera]","[smiling, parents]"
4,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]","[children, smiling, waving, camera]","[children, present]"
...,...,...,...,...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet',"[four, dirty, and, barefooted, children, .]","[four, kids, won, awards, for, 'cleanest, feet...","[four, dirty, barefooted, children]","[four, kids, awards, 'cleanest, feet]"
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen,...","[four, dirty, and, barefooted, children, .]","[four, homeless, children, had, their, shoes, ...","[four, dirty, barefooted, children]","[four, homeless, children, shoes, stolen, feet..."
550149,a man is surfing in a bodysuit in beautiful bl...,a man in a bodysuit is competing in a surfing ...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, bodysuit, is, competing, in, a...","[man, surfing, bodysuit, beautiful, blue, water]","[man, bodysuit, competing, surfing, competition]"
550150,a man is surfing in a bodysuit in beautiful bl...,a man in a business suit is heading to a board...,"[a, man, is, surfing, in, a, bodysuit, in, bea...","[a, man, in, a, business, suit, is, heading, t...","[man, surfing, bodysuit, beautiful, blue, water]","[man, business, suit, heading, board, meeting]"


In [14]:
print(len(snli_data_sub))
snli_data_sub['sentence1_token_nostopwords_str'] = snli_data_sub['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
snli_data_sub['sentence2_token_nostopwords_str'] = snli_data_sub['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))
snli_data_sub.drop_duplicates(subset=['sentence1_token_nostopwords_str', 'sentence2_token_nostopwords_str'], keep='last', inplace=True)
print(len(snli_data_sub))

549526


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token_nostopwords_str'] = snli_data_sub['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token_nostopwords_str'] = snli_data_sub['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))


547562


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub.drop_duplicates(subset=['sentence1_token_nostopwords_str', 'sentence2_token_nostopwords_str'], keep='last', inplace=True)


**Word association analysis**
```latex
PMI(w_i, w_j) = log_2 \frac{p(w_i, w_j)}{P(w_i)P(w_j)} = log_2\frac{N\cdot c(w_i, w_j)}{c(w_i)c(w_j)}
```


In [15]:
#Computing Unigram frequency
corpus = []
for sent in (snli_data_sub['sentence1_token_nostopwords'].tolist() + snli_data_sub['sentence2_token_nostopwords'].tolist()):
  corpus+=(sent)
unigram_frequency = Counter(corpus)
unigram_frequency.most_common(20)

[('man', 265138),
 ('woman', 137105),
 ('two', 121714),
 ('people', 120802),
 ('wearing', 80592),
 ('young', 61353),
 ('men', 60820),
 ('playing', 59220),
 ('girl', 59094),
 ('boy', 58041),
 ('white', 56785),
 ('shirt', 56204),
 ('black', 54824),
 ('dog', 53690),
 ('sitting', 53544),
 ('blue', 49040),
 ('standing', 46189),
 ('red', 43129),
 ('group', 43057),
 ('walking', 38678)]

In [16]:
#Remove words with less than 15 freq
from itertools import dropwhile
print(len(unigram_frequency))
for key, count in dropwhile(lambda key_count: key_count[1] >= 15, unigram_frequency.most_common()):
    del unigram_frequency[key]
print(len(unigram_frequency))

36420
10662


In [18]:
choice = "separate_prem_hypo"

if choice == "combine_prem_hypo":
  snli_data_sub['sentence1_sentence2_token_nostopwords'] = snli_data_sub['sentence1_token_nostopwords'] + snli_data_sub['sentence2_token_nostopwords']
  ##MAIN
  bigram_doc=[]
  for ele in tqdm(snli_data_sub['sentence1_sentence2_token_nostopwords'].tolist()):
    bigram_doc.append(list(set(list(nltk.bigrams(ele)))))
  bigram=[]
  for doc in bigram_doc:
    bigram += doc
  bigram_frequency = Counter(bigram)
  bigram_frequency
elif choice == "separate_prem_hypo":
  ##
  snli_data_sub_s1 = snli_data_sub[['sentence1_token_nostopwords']].copy()
  snli_data_sub_s2 = snli_data_sub[['sentence2_token_nostopwords']].copy()

  print("S1",len(snli_data_sub_s1))
  snli_data_sub_s1['sentence1_token_nostopwords_str'] = snli_data_sub_s1['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
  snli_data_sub_s1.drop_duplicates(subset=['sentence1_token_nostopwords_str'], keep='last',inplace=True)
  print("S1",len(snli_data_sub_s1))

  print("S2",len(snli_data_sub_s2))
  snli_data_sub_s2['sentence2_token_nostopwords_str'] = snli_data_sub_s2['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))
  snli_data_sub_s2.drop_duplicates(subset=['sentence2_token_nostopwords_str'], keep='last',inplace=True)
  print("S2",len(snli_data_sub_s2))
  ##MAIN
  bigram_doc=[]
  for ele in tqdm(snli_data_sub_s1['sentence1_token_nostopwords'].tolist()+snli_data_sub_s2['sentence2_token_nostopwords'].tolist()):
    bigram_doc.append(list(set(list(nltk.bigrams(ele)))))
  bigram=[]
  for doc in bigram_doc:
    bigram += doc
  bigram_frequency = Counter(bigram)
  bigram_frequency

S1 547562
S1 149532
S2 547562
S2 423589


100%|██████████| 573121/573121 [00:02<00:00, 209787.27it/s]


In [19]:
#Remove words-pair with less than 10 freq
from itertools import dropwhile
print(len(bigram_frequency))
for key, count in dropwhile(lambda key_count: key_count[1] >= 10, bigram_frequency.most_common()):
    del bigram_frequency[key]
print(len(bigram_frequency))

662076
29241


In [20]:
#Calculate PMI
import math

def pmi(word1, word2, unigram_freq, bigram_freq):
  if word1 in unigram_freq.keys() and word2 in unigram_freq.keys():
    if (word1, word2) in bigram_freq.keys() or (word2, word1) in bigram_freq.keys():
      prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
      prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
      prob_word1_word2 = (bigram_freq[(word1, word2)]) / float(sum(bigram_freq.values()))
      if prob_word1_word2 >0:
        return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)


In [None]:
identity_labels

['woman',
 'women',
 'man',
 'men',
 'girl',
 'girls',
 'boy',
 'boys',
 'she',
 'he',
 'her',
 'him',
 'his',
 'female',
 'male',
 'mother',
 'father',
 'sister',
 'brother',
 'daughter',
 'son',
 'feminine',
 'masculine',
 'androgynous',
 'trans',
 'transgender',
 'transsexual',
 'nonbinary',
 'non-binary',
 'two-spirit',
 'hijra',
 'genderqueer',
 'black',
 'asian',
 'hispanic',
 'white',
 'african',
 'american',
 'latino',
 'latina',
 'caucasian',
 'africans',
 'middle-eastern',
 'australian',
 'australians',
 'asians',
 'european',
 'europeans',
 'chinese',
 'indian',
 'indonesian',
 'brazilian',
 'pakistani',
 'bangladeshi',
 'russian',
 'nigerian',
 'japanese',
 'mexican',
 'filipino ',
 'vietnamese ',
 'german',
 'egyptian',
 'ethiopian',
 'turkish',
 'iranian',
 'thai',
 'congolese',
 'french',
 'british ',
 'italian',
 'korean',
 'burmese',
 'canadian ',
 'australian ',
 'spanish',
 'dutch',
 'swiss',
 'saudi',
 'argentinian ',
 'taiwanese ',
 'swedish ',
 'belgian',
 'polish

In [21]:
def Sort(tuple):
    # reverse = None (Sorts in Ascending order)
    return(sorted(tuple, key = lambda a: a[1], reverse = True))

def get_pmi(identity_label,unigram_frequency,bigram_frequency):
  pmi_identity_label=[]
  for word in tqdm(unigram_frequency.keys()):
    pmi_score = pmi(word1=identity_label.lower(), word2=word,unigram_freq=unigram_frequency,bigram_freq=bigram_frequency)
    if pmi_score:
      pmi_identity_label.append((word,pmi_score))

  return Sort(pmi_identity_label)


In [23]:
male_pmi_data = get_pmi('man',unigram_frequency,bigram_frequency)
female_pmi_data = get_pmi('woman',unigram_frequency,bigram_frequency)
gay_pmi_data = get_pmi('gay',unigram_frequency,bigram_frequency)
#young
#old
#caucasian
#indian

100%|██████████| 10662/10662 [00:03<00:00, 3129.48it/s]
100%|██████████| 10662/10662 [00:02<00:00, 4950.71it/s]
100%|██████████| 10662/10662 [00:00<00:00, 341598.38it/s]


In [35]:
"professor" in unigram_frequency.keys()
professor_pmi_data = get_pmi('professor',unigram_frequency,bigram_frequency)
"students" in unigram_frequency.keys()
students_pmi_data = get_pmi('students',unigram_frequency,bigram_frequency)


100%|██████████| 10662/10662 [00:00<00:00, 324594.21it/s]
100%|██████████| 10662/10662 [00:00<00:00, 80650.09it/s]


In [36]:
students_pmi_data

[('college', 10.580254617666741),
 ('studying', 9.173079236160866),
 ('classroom', 7.886767660167415),
 ('class', 7.865637797710345),
 ('arts', 7.538204438776586),
 ('learning', 7.5241380132837445),
 ('teaching', 7.346199278804712),
 ('listening', 7.153989862964642),
 ('practicing', 6.727985217348418),
 ('karate', 6.511232125539247),
 ('school', 6.361893199346405),
 ('art', 6.212683857223665),
 ('full', 5.724496231268537),
 ('sit', 5.6257161526379456),
 ('group', 5.305235507729033),
 ('asian', 5.103845925728085),
 ('taking', 4.874090704079956),
 ('waiting', 4.746199461784648),
 ('working', 4.738663159507043),
 ('work', 4.520330800014682),
 ('watch', 4.498105576312868),
 ('watching', 4.269881009588809),
 ('play', 4.215268850983738),
 ('reading', 4.141047062725456),
 ('male', 4.095230384343366),
 ('performing', 4.06204854589816),
 ('sitting', 3.8490972422013594),
 ('walk', 3.7468901108021484),
 ('four', 3.702459252727533),
 ('several', 3.6475969362446716),
 ('stand', 3.6213065169390806),

In [37]:
professor_pmi_data

[('giving', 9.550262324475336)]

In [29]:
male_pmi_set = set([ele[0] for ele in male_pmi_data if ele[1]>2])
female_pmi_set = set([ele[0] for ele in female_pmi_data if ele[1]>2])
gay_pmi_set = set([ele[0] for ele in gay_pmi_data if ele[1]>2])
#

In [30]:
male_pmi_set - female_pmi_set

{'1',
 'accepts',
 'accidentally',
 'acting',
 'actually',
 'afro',
 'aiming',
 'almost',
 'amish',
 'angrily',
 'approaches',
 'approaching',
 'army',
 'arranging',
 'arrested',
 'asked',
 'asleep',
 'ate',
 'attempting',
 'attempts',
 'awake',
 'backpack',
 'bad',
 'baggy',
 'bald',
 'balding',
 'barbecuing',
 'beard',
 'bearded',
 'beating',
 'begging',
 'beret',
 'bicycling',
 'blind',
 'blindfolded',
 'bowler',
 'bowls',
 'breakdances',
 'breakdancing',
 'brings',
 'builds',
 'buried',
 'burning',
 'burns',
 'business',
 'came',
 'camouflage',
 'carefully',
 'carves',
 'carving',
 'cast',
 'casual',
 'catches',
 'catching',
 'caught',
 'changes',
 'changing',
 'chased',
 'checkered',
 'cherry',
 'chested',
 'chiseling',
 'chops',
 'chubby',
 'clearing',
 'climbed',
 'climbing',
 'closes',
 'closeup',
 'closing',
 'collared',
 'collects',
 'completely',
 'concentrating',
 'conducting',
 'conducts',
 'contemplates',
 'contemplating',
 'controlling',
 'costumed',
 'crafting',
 'crash

In [31]:
female_pmi_set - male_pmi_set

{'african',
 'alone',
 'also',
 'applies',
 'applying',
 'attractive',
 'balances',
 'ballet',
 'barefoot',
 'bathing',
 'bikini',
 'blond',
 'blond-hair',
 'blond-haired',
 'blow',
 'braids',
 'bright',
 'brunette',
 'brushes',
 'celebrating',
 'chasing',
 'chatting',
 'cheering',
 'clad',
 'considering',
 'conversing',
 'covering',
 'crossing',
 'crying',
 'curly',
 'dance',
 'date',
 'displaying',
 'displays',
 'dress',
 'dries',
 'drops',
 'drying',
 'embrace',
 'embracing',
 'ethnic',
 'exiting',
 'fancy',
 'floral',
 'flowered',
 'formal',
 'green',
 'hanging',
 'headscarf',
 'heels',
 'hold',
 'horseback',
 'hospital',
 'hula',
 'husband',
 'kimono',
 'kiss',
 'knees',
 'knit',
 'knits',
 'know',
 'laugh',
 'laughing',
 'laundry',
 'leopard',
 'maroon',
 'modeling',
 'multicolored',
 'muslim',
 'native',
 'ordered',
 'pajamas',
 'passing',
 'photographed',
 'pink',
 'posing',
 'pregnant',
 'pretty',
 'purple',
 'put',
 'read',
 'red',
 'red-hair',
 'redheaded',
 'rolling',
 'say

In [24]:
search_term = "jail"
mask = snli_data_sub["sentence1"].str.contains(search_term, case=False, na=False)
pd.set_option('display.max_colwidth', 0)
snli_data_sub[mask][["sentence1","sentence2"]]

Unnamed: 0,sentence1,sentence2
175440,"the little boy looks like he is saying, let me out of jail.",he is talking to his parents
175441,"the little boy looks like he is saying, let me out of jail.",the little girl cried
175442,"the little boy looks like he is saying, let me out of jail.",the little boy is talking
441087,an attractive police lady and a jailbird pose for a photo.,a cat and a dog hugging.
441088,an attractive police lady and a jailbird pose for a photo.,two friends who went into very different lines of work take a picture.
441089,an attractive police lady and a jailbird pose for a photo.,a police officer and prisoner taking a photo.


In [25]:
from collections import defaultdict as dd
pmi_identity_label = dd(list)
for identity_label in (identity_labels):
  print(identity_label)
  for word in tqdm(unigram_frequency.keys()):
    pmi_score = pmi(word1=identity_label, word2=word,unigram_freq=unigram_frequency,bigram_freq=bigram_frequency)
    #print("PMI:",pmi_score)
    if pmi_score:
      pmi_identity_label[identity_label].append((word,pmi_score))

woman


100%|██████████| 10662/10662 [00:02<00:00, 4900.38it/s]


women


100%|██████████| 10662/10662 [00:00<00:00, 13427.18it/s]


man


100%|██████████| 10662/10662 [00:03<00:00, 3182.95it/s]


men


100%|██████████| 10662/10662 [00:01<00:00, 7561.43it/s]


girl


100%|██████████| 10662/10662 [00:01<00:00, 6291.27it/s]


girls


100%|██████████| 10662/10662 [00:00<00:00, 15623.33it/s]


boy


100%|██████████| 10662/10662 [00:01<00:00, 6217.02it/s]


boys


100%|██████████| 10662/10662 [00:00<00:00, 19487.71it/s]


she


100%|██████████| 10662/10662 [00:00<00:00, 473519.65it/s]


he


100%|██████████| 10662/10662 [00:00<00:00, 611852.25it/s]


her


100%|██████████| 10662/10662 [00:00<00:00, 594510.43it/s]


him


100%|██████████| 10662/10662 [00:00<00:00, 585527.58it/s]


his


100%|██████████| 10662/10662 [00:00<00:00, 551699.64it/s]


female


100%|██████████| 10662/10662 [00:00<00:00, 26467.86it/s]


male


100%|██████████| 10662/10662 [00:00<00:00, 26670.59it/s]


mother


100%|██████████| 10662/10662 [00:00<00:00, 55655.74it/s]


father


100%|██████████| 10662/10662 [00:00<00:00, 67452.92it/s]


sister


100%|██████████| 10662/10662 [00:00<00:00, 156257.58it/s]


brother


100%|██████████| 10662/10662 [00:00<00:00, 150205.62it/s]


daughter


100%|██████████| 10662/10662 [00:00<00:00, 66341.98it/s]


son


100%|██████████| 10662/10662 [00:00<00:00, 48261.80it/s]


feminine


100%|██████████| 10662/10662 [00:00<00:00, 667258.57it/s]


masculine


100%|██████████| 10662/10662 [00:00<00:00, 562214.54it/s]


androgynous


100%|██████████| 10662/10662 [00:00<00:00, 449859.86it/s]


trans


100%|██████████| 10662/10662 [00:00<00:00, 552265.13it/s]


transgender


100%|██████████| 10662/10662 [00:00<00:00, 560088.04it/s]


transsexual


100%|██████████| 10662/10662 [00:00<00:00, 549665.30it/s]


nonbinary


100%|██████████| 10662/10662 [00:00<00:00, 572543.68it/s]


non-binary


100%|██████████| 10662/10662 [00:00<00:00, 477075.96it/s]


two-spirit


100%|██████████| 10662/10662 [00:00<00:00, 514243.80it/s]


hijra


100%|██████████| 10662/10662 [00:00<00:00, 572521.69it/s]


genderqueer


100%|██████████| 10662/10662 [00:00<00:00, 534828.31it/s]


black


100%|██████████| 10662/10662 [00:00<00:00, 11513.69it/s]


asian


100%|██████████| 10662/10662 [00:00<00:00, 41035.24it/s]


hispanic


100%|██████████| 10662/10662 [00:00<00:00, 240141.71it/s]


white


100%|██████████| 10662/10662 [00:00<00:00, 11703.98it/s]


african


100%|██████████| 10662/10662 [00:00<00:00, 146576.21it/s]


american


100%|██████████| 10662/10662 [00:00<00:00, 126122.60it/s]


latino


100%|██████████| 10662/10662 [00:00<00:00, 526578.38it/s]


latina


100%|██████████| 10662/10662 [00:00<00:00, 436034.22it/s]


caucasian


100%|██████████| 10662/10662 [00:00<00:00, 186455.48it/s]


africans


100%|██████████| 10662/10662 [00:00<00:00, 476562.47it/s]


middle-eastern


100%|██████████| 10662/10662 [00:00<00:00, 374257.62it/s]


australian


100%|██████████| 10662/10662 [00:00<00:00, 454930.51it/s]


australians


100%|██████████| 10662/10662 [00:00<00:00, 797355.25it/s]


asians


100%|██████████| 10662/10662 [00:00<00:00, 300862.96it/s]


european


100%|██████████| 10662/10662 [00:00<00:00, 422333.90it/s]


europeans


100%|██████████| 10662/10662 [00:00<00:00, 744781.65it/s]


chinese


100%|██████████| 10662/10662 [00:00<00:00, 127800.06it/s]


indian


100%|██████████| 10662/10662 [00:00<00:00, 163202.13it/s]


indonesian


100%|██████████| 10662/10662 [00:00<00:00, 747033.55it/s]


brazilian


100%|██████████| 10662/10662 [00:00<00:00, 510889.26it/s]


pakistani


100%|██████████| 10662/10662 [00:00<00:00, 405275.04it/s]


bangladeshi


100%|██████████| 10662/10662 [00:00<00:00, 582613.56it/s]


russian


100%|██████████| 10662/10662 [00:00<00:00, 439682.52it/s]


nigerian


100%|██████████| 10662/10662 [00:00<00:00, 610899.41it/s]


japanese


100%|██████████| 10662/10662 [00:00<00:00, 236409.37it/s]


mexican


100%|██████████| 10662/10662 [00:00<00:00, 283197.20it/s]


filipino 


100%|██████████| 10662/10662 [00:00<00:00, 618324.06it/s]


vietnamese 


100%|██████████| 10662/10662 [00:00<00:00, 529076.58it/s]


german


100%|██████████| 10662/10662 [00:00<00:00, 319423.93it/s]


egyptian


100%|██████████| 10662/10662 [00:00<00:00, 475478.13it/s]


ethiopian


100%|██████████| 10662/10662 [00:00<00:00, 610041.05it/s]


turkish


100%|██████████| 10662/10662 [00:00<00:00, 767930.58it/s]


iranian


100%|██████████| 10662/10662 [00:00<00:00, 739216.96it/s]


thai


100%|██████████| 10662/10662 [00:00<00:00, 267086.75it/s]


congolese


100%|██████████| 10662/10662 [00:00<00:00, 628244.07it/s]


french


100%|██████████| 10662/10662 [00:00<00:00, 200283.36it/s]


british 


100%|██████████| 10662/10662 [00:00<00:00, 525353.54it/s]


italian


100%|██████████| 10662/10662 [00:00<00:00, 263129.63it/s]


korean


100%|██████████| 10662/10662 [00:00<00:00, 304120.27it/s]


burmese


100%|██████████| 10662/10662 [00:00<00:00, 647388.70it/s]


canadian 


100%|██████████| 10662/10662 [00:00<00:00, 715205.74it/s]


australian 


100%|██████████| 10662/10662 [00:00<00:00, 634861.86it/s]


spanish


100%|██████████| 10662/10662 [00:00<00:00, 372912.52it/s]


dutch


100%|██████████| 10662/10662 [00:00<00:00, 611467.41it/s]


swiss


100%|██████████| 10662/10662 [00:00<00:00, 402310.87it/s]


saudi


100%|██████████| 10662/10662 [00:00<00:00, 634915.94it/s]


argentinian 


100%|██████████| 10662/10662 [00:00<00:00, 535173.94it/s]


taiwanese 


100%|██████████| 10662/10662 [00:00<00:00, 750544.10it/s]


swedish 


100%|██████████| 10662/10662 [00:00<00:00, 595785.63it/s]


belgian


100%|██████████| 10662/10662 [00:00<00:00, 726393.17it/s]


polish


100%|██████████| 10662/10662 [00:00<00:00, 289832.26it/s]


israeli


100%|██████████| 10662/10662 [00:00<00:00, 354313.43it/s]


irish


100%|██████████| 10662/10662 [00:00<00:00, 393133.04it/s]


greek


100%|██████████| 10662/10662 [00:00<00:00, 417091.06it/s]


ukrainian 


100%|██████████| 10662/10662 [00:00<00:00, 542747.37it/s]


jamaican 


100%|██████████| 10662/10662 [00:00<00:00, 651510.33it/s]


mongolian


100%|██████████| 10662/10662 [00:00<00:00, 364338.77it/s]


armenian


100%|██████████| 10662/10662 [00:00<00:00, 353856.44it/s]


disability


100%|██████████| 10662/10662 [00:00<00:00, 641500.90it/s]


disabled


100%|██████████| 10662/10662 [00:00<00:00, 354072.17it/s]


handicap


100%|██████████| 10662/10662 [00:00<00:00, 475079.08it/s]


handicapped


100%|██████████| 10662/10662 [00:00<00:00, 454847.22it/s]


mentally


100%|██████████| 10662/10662 [00:00<00:00, 577586.95it/s]


mental


100%|██████████| 10662/10662 [00:00<00:00, 452774.88it/s]


autistic


100%|██████████| 10662/10662 [00:00<00:00, 698232.07it/s]


autism


100%|██████████| 10662/10662 [00:00<00:00, 709149.38it/s]


lesbian


100%|██████████| 10662/10662 [00:00<00:00, 337150.70it/s]


lesbians


100%|██████████| 10662/10662 [00:00<00:00, 349342.39it/s]


gay


100%|██████████| 10662/10662 [00:00<00:00, 293677.03it/s]


bisexual


100%|██████████| 10662/10662 [00:00<00:00, 860026.72it/s]


pansexual


100%|██████████| 10662/10662 [00:00<00:00, 617751.78it/s]


asexual


100%|██████████| 10662/10662 [00:00<00:00, 658397.42it/s]


queer


100%|██████████| 10662/10662 [00:00<00:00, 611174.92it/s]


straight


100%|██████████| 10662/10662 [00:00<00:00, 264909.69it/s]


muslim


100%|██████████| 10662/10662 [00:00<00:00, 375842.92it/s]


christian


100%|██████████| 10662/10662 [00:00<00:00, 292549.29it/s]


jew


100%|██████████| 10662/10662 [00:00<00:00, 640068.55it/s]


jewish


100%|██████████| 10662/10662 [00:00<00:00, 354684.37it/s]


sikh


100%|██████████| 10662/10662 [00:00<00:00, 653271.04it/s]


buddhist


100%|██████████| 10662/10662 [00:00<00:00, 263859.32it/s]


hindu


100%|██████████| 10662/10662 [00:00<00:00, 413833.44it/s]


atheist


100%|██████████| 10662/10662 [00:00<00:00, 631259.27it/s]


muslims


100%|██████████| 10662/10662 [00:00<00:00, 352831.82it/s]


christians


100%|██████████| 10662/10662 [00:00<00:00, 393433.94it/s]


jews


100%|██████████| 10662/10662 [00:00<00:00, 597161.98it/s]


sikhs


100%|██████████| 10662/10662 [00:00<00:00, 621149.65it/s]


buddhists


100%|██████████| 10662/10662 [00:00<00:00, 585773.01it/s]


hindus


100%|██████████| 10662/10662 [00:00<00:00, 420352.95it/s]


atheists


100%|██████████| 10662/10662 [00:00<00:00, 706349.12it/s]


old


100%|██████████| 10662/10662 [00:00<00:00, 46053.85it/s]


elderly


100%|██████████| 10662/10662 [00:00<00:00, 126726.98it/s]


retired


100%|██████████| 10662/10662 [00:00<00:00, 317724.12it/s]


teenage


100%|██████████| 10662/10662 [00:00<00:00, 212673.32it/s]


young


100%|██████████| 10662/10662 [00:00<00:00, 30659.50it/s]


senior


100%|██████████| 10662/10662 [00:00<00:00, 349437.94it/s]


seniors


100%|██████████| 10662/10662 [00:00<00:00, 434847.04it/s]


teenager


100%|██████████| 10662/10662 [00:00<00:00, 306811.86it/s]


teenagers


100%|██████████| 10662/10662 [00:00<00:00, 186112.50it/s]





100%|██████████| 10662/10662 [00:00<00:00, 721692.39it/s]


In [None]:
for identity in pmi_identity_label.keys():
    pmi_identity_label[identity] = Sort(pmi_identity_label[identity])


In [None]:
for key in pmi_identity_label.keys():
    print(key, pmi_identity_label[key][:20])

woman [('marble-looking', 3.9246229431677464), ('edwards', 3.9246229431677464), ('bandidos', 3.9246229431677464), ('leaf-strewn', 3.9246229431677464), ('leaf-lined', 3.9246229431677464), ('leave-like', 3.9246229431677464), ('mirthlessly', 3.9246229431677464), ('coke-a-cola', 3.9246229431677464), ('ruts', 3.9246229431677464), ('wearubg', 3.9246229431677464), ('tableau', 3.9246229431677464), ('radishes', 3.9246229431677464), ('pull-overs', 3.9246229431677464), ('denomination', 3.9246229431677464), ('t-short', 3.9246229431677464), ('lavendar', 3.9246229431677464), ('buzzes', 3.9246229431677464), ('homemade-looking', 3.9246229431677464), ('event-', 3.9246229431677464), ('spheres', 3.9246229431677464)]
women [('disbelieving', 5.925517637795422), ('marley', 5.925517637795422), ('partially-drunk', 5.925517637795422), ('futon-style', 5.925517637795422), ('footrest', 5.925517637795422), ('hand-crafting', 5.925517637795422), ('headwraps', 5.861387300375705), ('blanked', 5.832408233403941), ('wel

In [None]:
search_term = "extravagant"
mask = snli_data_sub["sentence1"].str.contains(search_term, case=False, na=False)
pd.set_option('display.max_colwidth', 0)
snli_data_sub[mask][["sentence1","sentence2"]]

Unnamed: 0,sentence1,sentence2
20205,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,the woman is preparing for her wedding.
20206,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"a woman dresses in western fashion, with unadorned fingers and head."
20207,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,an indian lady is wearing a beautiful outfit.
20208,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,an indian woman is wearing traditional styles.
20209,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a woman shows off to a crowd of people.
20210,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a woman is ready for a national cultural celebration.
20211,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a japanese woman is wearing a kimono.
20212,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"a very colorfully robed female with fingers painted red, who's culture appears to be indian, has on an elaborate head piece."
20213,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"a woman is dressed in traditional, ethnic clothing."
20214,a indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,a woman wearing an old-fashioned apron holds an apple pie in her hands.
