**Goals**

Crowdsourcing annotations has become a fundamental aspect of NLP research. The goal of this hands-on exercise is to explore the ethical implications of soliciting crowdsourced data, specifically social biases that may emerge when asking for generated sentences.

**Overview**

In this exercise, you will perform a “bias audit” of an NLP dataset produced by crowdsourcing. You will attempt to measure the presence of social stereotypes in this dataset that may have harmful effects if used to train classifiers in downstream tasks.

We will use pointwise mutual information (PMI) to find which associations are being made with identity labels. PMI can be used as a measure of word association in a corpus, i.e. how frequently two words co-occur above what might just be expected based on their frequencies.

See the [PMI Wikipedia page](https://en.wikipedia.org/wiki/Pointwise_mutual_information) for more details. Here we use PMI to measure which words co-occur with labels for identities. This allows us to see associations that may perpetuate stereotypes.

After this analysis, you will present specific examples from the data that you speculate could be particularly biased and problematic.

*Assignment design credits:* [11-830 Computational Ethics in NLP Course](https://maartensap.com/11830/Spring2024/hw1.html)

**Load and prepare dataset**

In [1]:
#List of identity labels (based on Rudinger et al. 2017)
#More comprehensive list can be obtained from Jain et al. 2024
with open("ss-data/identity_labels.txt", 'r') as f:
    identity_labels = f.read().split("\n")
print(identity_labels)

['woman', 'women', 'man', 'men', 'girl', 'girls', 'boy', 'boys', 'she', 'he', 'her', 'him', 'his', 'female', 'male', 'mother', 'father', 'sister', 'brother', 'daughter', 'son', 'feminine', 'masculine', 'androgynous', 'trans', 'transgender', 'transsexual', 'nonbinary', 'non-binary', 'two-spirit', 'hijra', 'genderqueer', 'black', 'asian', 'hispanic', 'white', 'african', 'american', 'latino', 'latina', 'caucasian', 'africans', 'middle-eastern', 'australian', 'australians', 'asians', 'european', 'europeans', 'chinese', 'indian', 'indonesian', 'brazilian', 'pakistani', 'bangladeshi', 'russian', 'nigerian', 'japanese', 'mexican', 'filipino ', 'vietnamese ', 'german', 'egyptian', 'ethiopian', 'turkish', 'iranian', 'thai', 'congolese', 'french', 'british ', 'italian', 'korean', 'burmese', 'canadian ', 'australian ', 'spanish', 'dutch', 'swiss', 'saudi', 'argentinian ', 'taiwanese ', 'swedish ', 'belgian', 'polish', 'israeli', 'irish', 'greek', 'ukrainian ', 'jamaican ', 'mongolian', 'armenian'

In [2]:
#Load SNLI dataset
import pandas as pd
snli_data = pd.read_json(path_or_buf="ss-data/snli_1.0_train.jsonl", lines=True)
pd.set_option('display.max_colwidth', 0)

snli_data

Unnamed: 0,annotator_labels,captionID,gold_label,pairID,sentence1,sentence1_binary_parse,sentence1_parse,sentence2,sentence2_binary_parse,sentence2_parse
0,[neutral],3416050480.jpg#4,neutral,3416050480.jpg#4r1n,A person on a horse jumps over a broken down airplane.,( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) ),(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .))),A person is training his horse for a competition.,( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) ),(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))
1,[contradiction],3416050480.jpg#4,contradiction,3416050480.jpg#4r1c,A person on a horse jumps over a broken down airplane.,( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) ),(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .))),"A person is at a diner, ordering an omelette.","( ( A person ) ( ( ( ( is ( at ( a diner ) ) ) , ) ( ordering ( an omelette ) ) ) . ) )","(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (PP (IN at) (NP (DT a) (NN diner))) (, ,) (S (VP (VBG ordering) (NP (DT an) (NN omelette))))) (. .)))"
2,[entailment],3416050480.jpg#4,entailment,3416050480.jpg#4r1e,A person on a horse jumps over a broken down airplane.,( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) ),(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .))),"A person is outdoors, on a horse.","( ( A person ) ( ( ( ( is outdoors ) , ) ( on ( a horse ) ) ) . ) )","(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (ADVP (RB outdoors)) (, ,) (PP (IN on) (NP (DT a) (NN horse)))) (. .)))"
3,[neutral],2267923837.jpg#2,neutral,2267923837.jpg#2r1n,Children smiling and waving at camera,( Children ( ( ( smiling and ) waving ) ( at camera ) ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smiling) (CC and) (VBG waving) (PP (IN at) (NP (NN camera))))))),They are smiling at their parents,( They ( are ( smiling ( at ( their parents ) ) ) ) ),(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VBG smiling) (PP (IN at) (NP (PRP$ their) (NNS parents)))))))
4,[entailment],2267923837.jpg#2,entailment,2267923837.jpg#2r1e,Children smiling and waving at camera,( Children ( ( ( smiling and ) waving ) ( at camera ) ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smiling) (CC and) (VBG waving) (PP (IN at) (NP (NN camera))))))),There are children present,( There ( ( are children ) present ) ),(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NNS children)) (ADVP (RB present)))))
...,...,...,...,...,...,...,...,...,...,...
550147,[contradiction],2267923837.jpg#3,contradiction,2267923837.jpg#3r1c,Four dirty and barefooted children.,( ( ( ( Four dirty ) and ) ( barefooted children ) ) . ),(ROOT (NP (NP (CD Four) (NNS dirty)) (CC and) (NP (VBN barefooted) (NNS children)) (. .))),four kids won awards for 'cleanest feet',( ( four kids ) ( ( won awards ) ( ( ( for ` ) ( cleanest feet ) ) ' ) ) ),(ROOT (S (NP (CD four) (NNS kids)) (VP (VBD won) (NP (NNS awards)) (PP (IN for) (`` `) (NP (JJ cleanest) (NNS feet)) ('' ')))))
550148,[neutral],2267923837.jpg#3,neutral,2267923837.jpg#3r1n,Four dirty and barefooted children.,( ( ( ( Four dirty ) and ) ( barefooted children ) ) . ),(ROOT (NP (NP (CD Four) (NNS dirty)) (CC and) (NP (VBN barefooted) (NNS children)) (. .))),"four homeless children had their shoes stolen, so their feet are dirty.","( ( ( ( ( ( four ( homeless children ) ) ( had ( ( their shoes ) stolen ) ) ) , ) so ) ( ( their feet ) ( are dirty ) ) ) . )","(ROOT (S (S (NP (CD four) (JJ homeless) (NNS children)) (VP (VBD had) (NP (NP (PRP$ their) (NNS shoes)) (VP (VBN stolen))))) (, ,) (IN so) (S (NP (PRP$ their) (NNS feet)) (VP (VBP are) (ADJP (JJ dirty)))) (. .)))"
550149,[neutral],7979219683.jpg#2,neutral,7979219683.jpg#2r1n,A man is surfing in a bodysuit in beautiful blue water.,( ( A man ) ( ( is ( surfing ( in ( ( a bodysuit ) ( in ( beautiful ( blue water ) ) ) ) ) ) ) . ) ),(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP (VBG surfing) (PP (IN in) (NP (NP (DT a) (NN bodysuit)) (PP (IN in) (NP (JJ beautiful) (JJ blue) (NN water))))))) (. .))),A man in a bodysuit is competing in a surfing competition.,( ( ( A man ) ( in ( a bodysuit ) ) ) ( ( is ( competing ( in ( a ( surfing competition ) ) ) ) ) . ) ),(ROOT (S (NP (NP (DT A) (NN man)) (PP (IN in) (NP (DT a) (NN bodysuit)))) (VP (VBZ is) (VP (VBG competing) (PP (IN in) (NP (DT a) (VBG surfing) (NN competition))))) (. .)))
550150,[contradiction],7979219683.jpg#2,contradiction,7979219683.jpg#2r1c,A man is surfing in a bodysuit in beautiful blue water.,( ( A man ) ( ( is ( surfing ( in ( ( a bodysuit ) ( in ( beautiful ( blue water ) ) ) ) ) ) ) . ) ),(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP (VBG surfing) (PP (IN in) (NP (NP (DT a) (NN bodysuit)) (PP (IN in) (NP (JJ beautiful) (JJ blue) (NN water))))))) (. .))),A man in a business suit is heading to a board meeting.,( ( ( A man ) ( in ( a ( business suit ) ) ) ) ( ( is ( heading ( to ( a ( board meeting ) ) ) ) ) . ) ),(ROOT (S (NP (NP (DT A) (NN man)) (PP (IN in) (NP (DT a) (NN business) (NN suit)))) (VP (VBZ is) (VP (VBG heading) (PP (TO to) (NP (DT a) (NN board) (NN meeting))))) (. .)))


In [3]:
#De-duplicate
snli_data_sub = snli_data[['sentence1','sentence2']]
print(len(snli_data_sub))
snli_data_sub.drop_duplicates(subset=['sentence1', 'sentence2'], keep='last',inplace=True)
print(len(snli_data_sub))

550152
549526


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub.drop_duplicates(subset=['sentence1', 'sentence2'], keep='last',inplace=True)


**Data preparation**

In [4]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
nltk.download('punkt')

from nltk.corpus import stopwords
import string
nltk.download('stopwords')

from itertools import combinations
from tqdm import tqdm

[nltk_data] Downloading package punkt to
[nltk_data]     /home/t-jainprachi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/t-jainprachi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
#util functions
def get_removal_list():
  stop_words = set(stopwords.words('english'))
  string.punctuation = string.punctuation +'"'+'-'+'''+'''+'—'
  string.punctuation
  removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
  return removal_list

removal_list = get_removal_list()

def remove_stopwords(sent_tokens):
    for punct in list(string.punctuation):
        sent_tokens = [word.strip(punct).strip() for word in sent_tokens]

    #removal_list = get_removal_list()
    filtered_words = [word.strip() for word in sent_tokens if word not in removal_list]

    filtered_words = [ele for ele in filtered_words if ele]
    return filtered_words

# def remove_stopwords(sent_tokens):
#     filtered_words = [word for word in sent_tokens if word not in removal_list]
#     return filtered_words

# def remove_stopwords(sent_tokens):
#     # for punct in list(string.punctuation):
#     #     sent_tokens = [word.strip(punct).strip() for word in sent_tokens]

#     #removal_list = get_removal_list()
#     filtered_words = [word.strip().strip("'").strip('"').strip(".").strip("!").strip() for word in sent_tokens if word not in removal_list]

#     filtered_words = [ele for ele in filtered_words if ele]
#     return filtered_words

In [6]:
#Lowercase
snli_data_sub['sentence1'] = snli_data_sub['sentence1'].map(str.lower)
snli_data_sub['sentence2'] = snli_data_sub['sentence2'].map(str.lower)
snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1'] = snli_data_sub['sentence1'].map(str.lower)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2'] = snli_data_sub['sentence2'].map(str.lower)


Unnamed: 0,sentence1,sentence2
0,a person on a horse jumps over a broken down airplane.,a person is training his horse for a competition.
1,a person on a horse jumps over a broken down airplane.,"a person is at a diner, ordering an omelette."
2,a person on a horse jumps over a broken down airplane.,"a person is outdoors, on a horse."
3,children smiling and waving at camera,they are smiling at their parents
4,children smiling and waving at camera,there are children present
...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet'
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen, so their feet are dirty."
550149,a man is surfing in a bodysuit in beautiful blue water.,a man in a bodysuit is competing in a surfing competition.
550150,a man is surfing in a bodysuit in beautiful blue water.,a man in a business suit is heading to a board meeting.


In [7]:
#Tokenize
snli_data_sub['sentence1_token'] = snli_data_sub['sentence1'].apply(nltk.word_tokenize)
snli_data_sub['sentence2_token'] = snli_data_sub['sentence2'].apply(nltk.word_tokenize)

snli_data_sub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token'] = snli_data_sub['sentence1'].apply(nltk.word_tokenize)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token'] = snli_data_sub['sentence2'].apply(nltk.word_tokenize)


Unnamed: 0,sentence1,sentence2,sentence1_token,sentence2_token
0,a person on a horse jumps over a broken down airplane.,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, broken, down, airplane, .]","[a, person, is, training, his, horse, for, a, competition, .]"
1,a person on a horse jumps over a broken down airplane.,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, broken, down, airplane, .]","[a, person, is, at, a, diner, ,, ordering, an, omelette, .]"
2,a person on a horse jumps over a broken down airplane.,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, broken, down, airplane, .]","[a, person, is, outdoors, ,, on, a, horse, .]"
3,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]"
4,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]"
...,...,...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet',"[four, dirty, and, barefooted, children, .]","[four, kids, won, awards, for, 'cleanest, feet, ']"
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen, so their feet are dirty.","[four, dirty, and, barefooted, children, .]","[four, homeless, children, had, their, shoes, stolen, ,, so, their, feet, are, dirty, .]"
550149,a man is surfing in a bodysuit in beautiful blue water.,a man in a bodysuit is competing in a surfing competition.,"[a, man, is, surfing, in, a, bodysuit, in, beautiful, blue, water, .]","[a, man, in, a, bodysuit, is, competing, in, a, surfing, competition, .]"
550150,a man is surfing in a bodysuit in beautiful blue water.,a man in a business suit is heading to a board meeting.,"[a, man, is, surfing, in, a, bodysuit, in, beautiful, blue, water, .]","[a, man, in, a, business, suit, is, heading, to, a, board, meeting, .]"


In [8]:
#Remove stop-words
from time import time
from tqdm import tqdm
tqdm.pandas()

snli_data_sub['sentence1_token_nostopwords'] = snli_data_sub['sentence1_token'].progress_apply(remove_stopwords)
snli_data_sub['sentence2_token_nostopwords'] = snli_data_sub['sentence2_token'].progress_apply(remove_stopwords)

snli_data_sub

100%|██████████| 549526/549526 [00:22<00:00, 23915.07it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token_nostopwords'] = snli_data_sub['sentence1_token'].progress_apply(remove_stopwords)
100%|██████████| 549526/549526 [00:14<00:00, 37139.04it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token_nostopwords'] = snli_data_sub['sentence2_token'].progress_apply(remove_stopwords)


Unnamed: 0,sentence1,sentence2,sentence1_token,sentence2_token,sentence1_token_nostopwords,sentence2_token_nostopwords
0,a person on a horse jumps over a broken down airplane.,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, broken, down, airplane, .]","[a, person, is, training, his, horse, for, a, competition, .]","[person, horse, jumps, broken, airplane]","[person, training, horse, competition]"
1,a person on a horse jumps over a broken down airplane.,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, broken, down, airplane, .]","[a, person, is, at, a, diner, ,, ordering, an, omelette, .]","[person, horse, jumps, broken, airplane]","[person, diner, ordering, omelette]"
2,a person on a horse jumps over a broken down airplane.,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, broken, down, airplane, .]","[a, person, is, outdoors, ,, on, a, horse, .]","[person, horse, jumps, broken, airplane]","[person, outdoors, horse]"
3,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]","[children, smiling, waving, camera]","[smiling, parents]"
4,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]","[children, smiling, waving, camera]","[children, present]"
...,...,...,...,...,...,...
550147,four dirty and barefooted children.,four kids won awards for 'cleanest feet',"[four, dirty, and, barefooted, children, .]","[four, kids, won, awards, for, 'cleanest, feet, ']","[four, dirty, barefooted, children]","[four, kids, awards, cleanest, feet]"
550148,four dirty and barefooted children.,"four homeless children had their shoes stolen, so their feet are dirty.","[four, dirty, and, barefooted, children, .]","[four, homeless, children, had, their, shoes, stolen, ,, so, their, feet, are, dirty, .]","[four, dirty, barefooted, children]","[four, homeless, children, shoes, stolen, feet, dirty]"
550149,a man is surfing in a bodysuit in beautiful blue water.,a man in a bodysuit is competing in a surfing competition.,"[a, man, is, surfing, in, a, bodysuit, in, beautiful, blue, water, .]","[a, man, in, a, bodysuit, is, competing, in, a, surfing, competition, .]","[man, surfing, bodysuit, beautiful, blue, water]","[man, bodysuit, competing, surfing, competition]"
550150,a man is surfing in a bodysuit in beautiful blue water.,a man in a business suit is heading to a board meeting.,"[a, man, is, surfing, in, a, bodysuit, in, beautiful, blue, water, .]","[a, man, in, a, business, suit, is, heading, to, a, board, meeting, .]","[man, surfing, bodysuit, beautiful, blue, water]","[man, business, suit, heading, board, meeting]"


In [9]:
print(len(snli_data_sub))

snli_data_sub['sentence1_token_nostopwords_str'] = snli_data_sub['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
snli_data_sub['sentence2_token_nostopwords_str'] = snli_data_sub['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))

snli_data_sub.drop_duplicates(subset=['sentence1_token_nostopwords_str', 'sentence2_token_nostopwords_str'], keep='last', inplace=True)
print(len(snli_data_sub))

549526


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence1_token_nostopwords_str'] = snli_data_sub['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub['sentence2_token_nostopwords_str'] = snli_data_sub['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))


547558


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  snli_data_sub.drop_duplicates(subset=['sentence1_token_nostopwords_str', 'sentence2_token_nostopwords_str'], keep='last', inplace=True)


**Word association analysis**

$PMI(w_i, w_j)$ = $log_2 \frac{p(w_i, w_j)}{P(w_i)P(w_j)}$ = $log_2\frac{N\cdot c(w_i, w_j)}{c(w_i)c(w_j)}$


In [10]:
#Computing Unigram frequency
corpus = []
for sent in (snli_data_sub['sentence1_token_nostopwords'].tolist() + snli_data_sub['sentence2_token_nostopwords'].tolist()):
  corpus+=(sent)
unigram_frequency = Counter(corpus)
unigram_frequency.most_common(20)

[('man', 265138),
 ('woman', 137105),
 ('two', 121714),
 ('people', 120803),
 ('wearing', 80592),
 ('young', 61353),
 ('men', 60853),
 ('playing', 59219),
 ('girl', 59094),
 ('boy', 58042),
 ('white', 56793),
 ('shirt', 56204),
 ('black', 54824),
 ('dog', 53688),
 ('sitting', 53544),
 ('blue', 49040),
 ('standing', 46189),
 ('red', 43130),
 ('group', 43061),
 ('walking', 38681)]

In [11]:
#Remove words with less than 15 freq
from itertools import dropwhile
print(len(unigram_frequency))
for key, count in dropwhile(lambda key_count: key_count[1] >= 15, unigram_frequency.most_common()):
    del unigram_frequency[key]
print(len(unigram_frequency))

36193
10638


In [12]:
choice = "separate_prem_hypo"

if choice == "combine_prem_hypo":
  snli_data_sub['sentence1_sentence2_token_nostopwords'] = snli_data_sub['sentence1_token_nostopwords'] + snli_data_sub['sentence2_token_nostopwords']
  ##MAIN
  bigram_doc=[]
  for ele in tqdm(snli_data_sub['sentence1_sentence2_token_nostopwords'].tolist()):
    bigram_doc.append(list(set(list(nltk.bigrams(ele)))))
  bigram=[]
  for doc in bigram_doc:
    bigram += doc
  bigram_frequency = Counter(bigram)
  bigram_frequency
elif choice == "separate_prem_hypo":
  ##
  snli_data_sub_s1 = snli_data_sub[['sentence1_token_nostopwords']].copy()
  snli_data_sub_s2 = snli_data_sub[['sentence2_token_nostopwords']].copy()

  print("S1",len(snli_data_sub_s1))
  snli_data_sub_s1['sentence1_token_nostopwords_str'] = snli_data_sub_s1['sentence1_token_nostopwords'].apply(lambda x: ' '.join(x))
  snli_data_sub_s1.drop_duplicates(subset=['sentence1_token_nostopwords_str'], keep='last',inplace=True)
  print("S1",len(snli_data_sub_s1))

  print("S2",len(snli_data_sub_s2))
  snli_data_sub_s2['sentence2_token_nostopwords_str'] = snli_data_sub_s2['sentence2_token_nostopwords'].apply(lambda x: ' '.join(x))
  snli_data_sub_s2.drop_duplicates(subset=['sentence2_token_nostopwords_str'], keep='last',inplace=True)
  print("S2",len(snli_data_sub_s2))
  ##MAIN
  bigram_doc=[]
  for ele in tqdm(snli_data_sub_s1['sentence1_token_nostopwords'].tolist()+snli_data_sub_s2['sentence2_token_nostopwords'].tolist()):
    bigram_doc.append(list(set(list(nltk.bigrams(ele)))))
  bigram=[]
  for doc in bigram_doc:
    bigram += doc
  bigram_frequency = Counter(bigram)
  bigram_frequency

S1 547558
S1 149532
S2 547558
S2 423258


100%|██████████| 572790/572790 [00:01<00:00, 349408.41it/s]


In [13]:
#Remove words-pair with less than 10 freq
from itertools import dropwhile
print(len(bigram_frequency))
for key, count in dropwhile(lambda key_count: key_count[1] >= 10, bigram_frequency.most_common()):
    del bigram_frequency[key]
print(len(bigram_frequency))

659784
29093


$PMI(w_i, w_j)$ = $log_2 \frac{p(w_i, w_j)}{P(w_i)P(w_j)}$ = $log_2\frac{N\cdot c(w_i, w_j)}{c(w_i)c(w_j)}$


In [14]:
#Calculate PMI
import math

def pmi(word1, word2, unigram_freq, bigram_freq):
  if word1 in unigram_freq.keys() and word2 in unigram_freq.keys():
    if (word1, word2) in bigram_freq.keys() or (word2, word1) in bigram_freq.keys():
      prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
      prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
      prob_word1_word2 = (bigram_freq[(word1, word2)]) / float(sum(bigram_freq.values()))
      if prob_word1_word2 >0:
        return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)


In [15]:
identity_labels

['woman',
 'women',
 'man',
 'men',
 'girl',
 'girls',
 'boy',
 'boys',
 'she',
 'he',
 'her',
 'him',
 'his',
 'female',
 'male',
 'mother',
 'father',
 'sister',
 'brother',
 'daughter',
 'son',
 'feminine',
 'masculine',
 'androgynous',
 'trans',
 'transgender',
 'transsexual',
 'nonbinary',
 'non-binary',
 'two-spirit',
 'hijra',
 'genderqueer',
 'black',
 'asian',
 'hispanic',
 'white',
 'african',
 'american',
 'latino',
 'latina',
 'caucasian',
 'africans',
 'middle-eastern',
 'australian',
 'australians',
 'asians',
 'european',
 'europeans',
 'chinese',
 'indian',
 'indonesian',
 'brazilian',
 'pakistani',
 'bangladeshi',
 'russian',
 'nigerian',
 'japanese',
 'mexican',
 'filipino ',
 'vietnamese ',
 'german',
 'egyptian',
 'ethiopian',
 'turkish',
 'iranian',
 'thai',
 'congolese',
 'french',
 'british ',
 'italian',
 'korean',
 'burmese',
 'canadian ',
 'australian ',
 'spanish',
 'dutch',
 'swiss',
 'saudi',
 'argentinian ',
 'taiwanese ',
 'swedish ',
 'belgian',
 'polish

In [16]:
def Sort(tuple):
    # reverse = None (Sorts in Ascending order)
    return(sorted(tuple, key = lambda a: a[1], reverse = True))

def get_pmi(identity_label,unigram_frequency,bigram_frequency):
  pmi_identity_label=[]
  for word in tqdm(unigram_frequency.keys()):
    pmi_score = pmi(word1=identity_label.lower(), word2=word,unigram_freq=unigram_frequency,bigram_freq=bigram_frequency)
    if pmi_score:
      pmi_identity_label.append((word,pmi_score))

  return Sort(pmi_identity_label)


In [30]:
from collections import defaultdict as dd
pmi_identity_label = dd(list)
for identity_label in (identity_labels):
  print(identity_label)
  for word in tqdm(unigram_frequency.keys()):
    pmi_score = pmi(word1=identity_label, word2=word,unigram_freq=unigram_frequency,bigram_freq=bigram_frequency)
    #print("PMI:",pmi_score)
    if pmi_score:
      pmi_identity_label[identity_label].append((word,pmi_score))

woman


100%|██████████| 10638/10638 [00:01<00:00, 9923.30it/s]


women


100%|██████████| 10638/10638 [00:00<00:00, 26005.84it/s]


man


100%|██████████| 10638/10638 [00:01<00:00, 6244.35it/s]


men


100%|██████████| 10638/10638 [00:00<00:00, 18891.16it/s]


girl


100%|██████████| 10638/10638 [00:00<00:00, 17975.07it/s]


girls


100%|██████████| 10638/10638 [00:00<00:00, 46243.11it/s]


boy


100%|██████████| 10638/10638 [00:00<00:00, 19135.70it/s]


boys


100%|██████████| 10638/10638 [00:00<00:00, 68895.62it/s]


she


100%|██████████| 10638/10638 [00:00<00:00, 2357799.93it/s]


he


100%|██████████| 10638/10638 [00:00<00:00, 2491151.02it/s]


her


100%|██████████| 10638/10638 [00:00<00:00, 2545875.04it/s]


him


100%|██████████| 10638/10638 [00:00<00:00, 2395522.71it/s]


his


100%|██████████| 10638/10638 [00:00<00:00, 2794626.45it/s]


female


100%|██████████| 10638/10638 [00:00<00:00, 93667.75it/s]


male


100%|██████████| 10638/10638 [00:00<00:00, 96540.11it/s]


mother


100%|██████████| 10638/10638 [00:00<00:00, 184449.99it/s]


father


100%|██████████| 10638/10638 [00:00<00:00, 279951.85it/s]


sister


100%|██████████| 10638/10638 [00:00<00:00, 612158.46it/s]


brother


100%|██████████| 10638/10638 [00:00<00:00, 659244.79it/s]


daughter


100%|██████████| 10638/10638 [00:00<00:00, 259034.82it/s]


son


100%|██████████| 10638/10638 [00:00<00:00, 185139.57it/s]


feminine


100%|██████████| 10638/10638 [00:00<00:00, 2476907.18it/s]


masculine


100%|██████████| 10638/10638 [00:00<00:00, 2577941.18it/s]


androgynous


100%|██████████| 10638/10638 [00:00<00:00, 2604576.85it/s]


trans


100%|██████████| 10638/10638 [00:00<00:00, 2876789.55it/s]


transgender


100%|██████████| 10638/10638 [00:00<00:00, 2506263.32it/s]


transsexual


100%|██████████| 10638/10638 [00:00<00:00, 2527559.39it/s]


nonbinary


100%|██████████| 10638/10638 [00:00<00:00, 2440464.14it/s]


non-binary


100%|██████████| 10638/10638 [00:00<00:00, 2740051.95it/s]


two-spirit


100%|██████████| 10638/10638 [00:00<00:00, 2673397.60it/s]


hijra


100%|██████████| 10638/10638 [00:00<00:00, 2606098.12it/s]


genderqueer


100%|██████████| 10638/10638 [00:00<00:00, 2521132.67it/s]


black


100%|██████████| 10638/10638 [00:00<00:00, 35447.69it/s]


asian


100%|██████████| 10638/10638 [00:00<00:00, 130978.12it/s]


hispanic


100%|██████████| 10638/10638 [00:00<00:00, 866203.45it/s]


white


100%|██████████| 10638/10638 [00:00<00:00, 27675.28it/s]


african


100%|██████████| 10638/10638 [00:00<00:00, 283509.48it/s]


american


100%|██████████| 10638/10638 [00:00<00:00, 241804.67it/s]


latino


100%|██████████| 10638/10638 [00:00<00:00, 1194298.87it/s]


latina


100%|██████████| 10638/10638 [00:00<00:00, 1333662.30it/s]


caucasian


100%|██████████| 10638/10638 [00:00<00:00, 655188.70it/s]


africans


100%|██████████| 10638/10638 [00:00<00:00, 1393647.11it/s]


middle-eastern


100%|██████████| 10638/10638 [00:00<00:00, 1392864.02it/s]


australian


100%|██████████| 10638/10638 [00:00<00:00, 1549970.68it/s]


australians


100%|██████████| 10638/10638 [00:00<00:00, 2552282.69it/s]


asians


100%|██████████| 10638/10638 [00:00<00:00, 534609.05it/s]


european


100%|██████████| 10638/10638 [00:00<00:00, 892772.94it/s]


europeans


100%|██████████| 10638/10638 [00:00<00:00, 2204496.34it/s]


chinese


100%|██████████| 10638/10638 [00:00<00:00, 237047.65it/s]


indian


100%|██████████| 10638/10638 [00:00<00:00, 284801.56it/s]


indonesian


100%|██████████| 10638/10638 [00:00<00:00, 1971239.49it/s]


brazilian


100%|██████████| 10638/10638 [00:00<00:00, 1082933.01it/s]


pakistani


100%|██████████| 10638/10638 [00:00<00:00, 1206506.03it/s]


bangladeshi


100%|██████████| 10638/10638 [00:00<00:00, 2235309.15it/s]


russian


100%|██████████| 10638/10638 [00:00<00:00, 1011516.54it/s]


nigerian


100%|██████████| 10638/10638 [00:00<00:00, 2001390.78it/s]


japanese


100%|██████████| 10638/10638 [00:00<00:00, 533324.64it/s]


mexican


100%|██████████| 10638/10638 [00:00<00:00, 668389.45it/s]


filipino 


100%|██████████| 10638/10638 [00:00<00:00, 2157070.63it/s]


vietnamese 


100%|██████████| 10638/10638 [00:00<00:00, 2317509.27it/s]


german


100%|██████████| 10638/10638 [00:00<00:00, 825528.80it/s]


egyptian


100%|██████████| 10638/10638 [00:00<00:00, 1278664.74it/s]


ethiopian


100%|██████████| 10638/10638 [00:00<00:00, 2328029.11it/s]


turkish


100%|██████████| 10638/10638 [00:00<00:00, 2378665.42it/s]


iranian


100%|██████████| 10638/10638 [00:00<00:00, 2382348.55it/s]


thai


100%|██████████| 10638/10638 [00:00<00:00, 1333343.47it/s]


congolese


100%|██████████| 10638/10638 [00:00<00:00, 2366302.82it/s]


french


100%|██████████| 10638/10638 [00:00<00:00, 744096.56it/s]


british 


100%|██████████| 10638/10638 [00:00<00:00, 2338644.90it/s]


italian


100%|██████████| 10638/10638 [00:00<00:00, 848802.59it/s]


korean


100%|██████████| 10638/10638 [00:00<00:00, 1285887.37it/s]


burmese


100%|██████████| 10638/10638 [00:00<00:00, 2370829.22it/s]


canadian 


100%|██████████| 10638/10638 [00:00<00:00, 2378792.24it/s]


australian 


100%|██████████| 10638/10638 [00:00<00:00, 2372468.01it/s]


spanish


100%|██████████| 10638/10638 [00:00<00:00, 1341279.56it/s]


dutch


100%|██████████| 10638/10638 [00:00<00:00, 2399129.26it/s]


swiss


100%|██████████| 10638/10638 [00:00<00:00, 1342692.24it/s]


saudi


100%|██████████| 10638/10638 [00:00<00:00, 2397195.83it/s]


argentinian 


100%|██████████| 10638/10638 [00:00<00:00, 2395265.51it/s]


taiwanese 


100%|██████████| 10638/10638 [00:00<00:00, 2434339.35it/s]


swedish 


100%|██████████| 10638/10638 [00:00<00:00, 2372594.17it/s]


belgian


100%|██████████| 10638/10638 [00:00<00:00, 2370703.25it/s]


polish


100%|██████████| 10638/10638 [00:00<00:00, 1008977.57it/s]


israeli


100%|██████████| 10638/10638 [00:00<00:00, 1458709.49it/s]


irish


100%|██████████| 10638/10638 [00:00<00:00, 1402495.94it/s]


greek


100%|██████████| 10638/10638 [00:00<00:00, 1407983.78it/s]


ukrainian 


100%|██████████| 10638/10638 [00:00<00:00, 2585110.43it/s]


jamaican 


100%|██████████| 10638/10638 [00:00<00:00, 2654945.02it/s]


mongolian


100%|██████████| 10638/10638 [00:00<00:00, 1520912.36it/s]


armenian


100%|██████████| 10638/10638 [00:00<00:00, 1361248.58it/s]


disability


100%|██████████| 10638/10638 [00:00<00:00, 2462009.93it/s]


disabled


100%|██████████| 10638/10638 [00:00<00:00, 1222405.03it/s]


handicap


100%|██████████| 10638/10638 [00:00<00:00, 1591716.82it/s]


handicapped


100%|██████████| 10638/10638 [00:00<00:00, 1215313.12it/s]


mentally


100%|██████████| 10638/10638 [00:00<00:00, 2528992.01it/s]


mental


100%|██████████| 10638/10638 [00:00<00:00, 1232092.73it/s]


autistic


100%|██████████| 10638/10638 [00:00<00:00, 2318352.17it/s]


autism


100%|██████████| 10638/10638 [00:00<00:00, 2651946.86it/s]


lesbian


100%|██████████| 10638/10638 [00:00<00:00, 1188097.62it/s]


lesbians


100%|██████████| 10638/10638 [00:00<00:00, 1342369.08it/s]


gay


100%|██████████| 10638/10638 [00:00<00:00, 779344.06it/s]


bisexual


100%|██████████| 10638/10638 [00:00<00:00, 2460787.89it/s]


pansexual


100%|██████████| 10638/10638 [00:00<00:00, 2762273.63it/s]


asexual


100%|██████████| 10638/10638 [00:00<00:00, 2592319.66it/s]


queer


100%|██████████| 10638/10638 [00:00<00:00, 2670038.06it/s]


straight


100%|██████████| 10638/10638 [00:00<00:00, 694740.38it/s]


muslim


100%|██████████| 10638/10638 [00:00<00:00, 918710.36it/s]


christian


100%|██████████| 10638/10638 [00:00<00:00, 1386415.37it/s]


jew


100%|██████████| 10638/10638 [00:00<00:00, 2415232.54it/s]


jewish


100%|██████████| 10638/10638 [00:00<00:00, 1234103.33it/s]


sikh


100%|██████████| 10638/10638 [00:00<00:00, 2311147.10it/s]


buddhist


100%|██████████| 10638/10638 [00:00<00:00, 1028300.94it/s]


hindu


100%|██████████| 10638/10638 [00:00<00:00, 1380154.23it/s]


atheist


100%|██████████| 10638/10638 [00:00<00:00, 2591115.33it/s]


muslims


100%|██████████| 10638/10638 [00:00<00:00, 1454430.08it/s]


christians


100%|██████████| 10638/10638 [00:00<00:00, 1515488.28it/s]


jews


100%|██████████| 10638/10638 [00:00<00:00, 2422576.06it/s]


sikhs


100%|██████████| 10638/10638 [00:00<00:00, 2661120.41it/s]


buddhists


100%|██████████| 10638/10638 [00:00<00:00, 2820952.52it/s]


hindus


100%|██████████| 10638/10638 [00:00<00:00, 2430096.72it/s]


atheists


100%|██████████| 10638/10638 [00:00<00:00, 2737530.27it/s]


old


100%|██████████| 10638/10638 [00:00<00:00, 120571.05it/s]


elderly


100%|██████████| 10638/10638 [00:00<00:00, 356236.72it/s]


retired


100%|██████████| 10638/10638 [00:00<00:00, 1544123.96it/s]


teenage


100%|██████████| 10638/10638 [00:00<00:00, 567080.22it/s]


young


100%|██████████| 10638/10638 [00:00<00:00, 65195.01it/s]


senior


100%|██████████| 10638/10638 [00:00<00:00, 846371.37it/s]


seniors


100%|██████████| 10638/10638 [00:00<00:00, 1319308.28it/s]


teenager


100%|██████████| 10638/10638 [00:00<00:00, 743290.84it/s]


teenagers


100%|██████████| 10638/10638 [00:00<00:00, 355620.60it/s]





100%|██████████| 10638/10638 [00:00<00:00, 2339503.25it/s]


In [31]:
for identity in pmi_identity_label.keys():
    pmi_identity_label[identity] = Sort(pmi_identity_label[identity])

for key in pmi_identity_label.keys():
    print(key, pmi_identity_label[key][:20])

woman [('buys', 5.864070952771102), ('ordered', 5.826092903207355), ('considering', 5.7607391652132565), ('knits', 5.489748035551119), ('weaves', 5.480287706302052), ('modeling', 5.416784763995894), ('applies', 5.40228519430078), ('flirting', 5.213251369910762), ('sips', 5.200651333131129), ('dries', 5.1881603889479315), ('sews', 5.169624437510505), ('frowns', 5.088624715397017), ('admires', 5.012718382538053), ('feeds', 4.944603357613442), ('shares', 4.817322693579623), ('likes', 4.769342664677802), ('calls', 4.748460835434819), ('owns', 4.714752959939076), ('ironing', 4.6907902876506915), ('gave', 4.61729808527163)]
women [('gossiping', 8.184286926573977), ('bikinis', 6.583842998409607), ('discuss', 5.990145687710843), ('knitting', 5.624395126157468), ('strolling', 5.616330851158511), ('sunbathing', 5.605914235213173), ('discussing', 5.573367554751044), ('chatting', 5.504525353057125), ('chat', 5.490390054299658), ('arguing', 5.3394242707431605), ('jog', 5.245883502903738), ('dressin

In [17]:
man_pmi_data = get_pmi('man',unigram_frequency,bigram_frequency)
woman_pmi_data = get_pmi('woman',unigram_frequency,bigram_frequency)
gay_pmi_data = get_pmi('gay',unigram_frequency,bigram_frequency)


men_pmi_data = get_pmi('men',unigram_frequency,bigram_frequency)
women_pmi_data = get_pmi('women',unigram_frequency,bigram_frequency)
lesbian_pmi_data = get_pmi('lesbian',unigram_frequency,bigram_frequency)

lady_pmi_data = get_pmi('lady',unigram_frequency,bigram_frequency)

100%|██████████| 10638/10638 [00:01<00:00, 5449.38it/s]
100%|██████████| 10638/10638 [00:01<00:00, 8777.62it/s]
100%|██████████| 10638/10638 [00:00<00:00, 640378.41it/s]
100%|██████████| 10638/10638 [00:00<00:00, 16602.59it/s]
100%|██████████| 10638/10638 [00:00<00:00, 21943.05it/s]
100%|██████████| 10638/10638 [00:00<00:00, 1039343.26it/s]
100%|██████████| 10638/10638 [00:00<00:00, 55964.32it/s]


In [18]:
def filter_pmi_set(pmi_data,threshold):
    return set([ele[0] for ele in pmi_data if ele[1]>threshold])
    

male_pmi_set = filter_pmi_set(man_pmi_data, 2.0).union(filter_pmi_set(men_pmi_data, 2.0))
female_pmi_set = filter_pmi_set(woman_pmi_data, 2.0).union(filter_pmi_set(women_pmi_data, 2.0)).union(filter_pmi_set(lady_pmi_data, 2.0))
lgbt_pmi_set = filter_pmi_set(gay_pmi_data, 2.0).union(filter_pmi_set(lesbian_pmi_data,2.0))

In [19]:
male_pmi_set

{'accepts',
 'accidentally',
 'acting',
 'actually',
 'adjusting',
 'adjusts',
 'admires',
 'admiring',
 'advertising',
 'afraid',
 'afro',
 'aiming',
 'almost',
 'angrily',
 'appear',
 'approaches',
 'approaching',
 'argue',
 'arguing',
 'army',
 'arranging',
 'arrested',
 'asked',
 'asking',
 'asks',
 'asleep',
 'ate',
 'attempt',
 'attempting',
 'attempts',
 'attending',
 'attends',
 'awake',
 'backpack',
 'backpacks',
 'bad',
 'baggy',
 'baking',
 'balancing',
 'balcony',
 'bar',
 'barbecuing',
 'beard',
 'beating',
 'begging',
 'beige',
 'bending',
 'bends',
 'bent',
 'beret',
 'best',
 'bicycles',
 'bicycling',
 'biking',
 'black',
 'blowing',
 'blows',
 'blue',
 'boat',
 'bought',
 'bowler',
 'bowls',
 'boxing',
 'breakdances',
 'breakdancing',
 'breaking',
 'breaks',
 'bringing',
 'brings',
 'broke',
 'brothers',
 'browsing',
 'brushing',
 'build',
 'builds',
 'bundled',
 'bungee',
 'buried',
 'burning',
 'burns',
 'business',
 'buying',
 'buys',
 'ca',
 'calling',
 'calls',
 '

In [20]:
female_pmi_set

{'adjusting',
 'adjusts',
 'admires',
 'admiring',
 'advertising',
 'afraid',
 'alone',
 'also',
 'angry',
 'applies',
 'applying',
 'arguing',
 'asking',
 'attending',
 'attends',
 'baking',
 'balances',
 'balancing',
 'ballet',
 'bathing',
 'beige',
 'bending',
 'bends',
 'bent',
 'biking',
 'bikini',
 'bikinis',
 'black',
 'blond-hair',
 'blonde',
 'blow',
 'blowing',
 'blows',
 'blue',
 'bought',
 'braids',
 'breaking',
 'breaks',
 'bright',
 'bringing',
 'broke',
 'brown',
 'browsing',
 'brushes',
 'brushing',
 'bundled',
 'buying',
 'buys',
 'ca',
 'calling',
 'calls',
 'camping',
 'cane',
 'carried',
 'carries',
 'carry',
 'carrying',
 'celebrates',
 'celebrating',
 'chat',
 'chatting',
 'checking',
 'checks',
 'cheering',
 'chopping',
 'cleaning',
 'cleans',
 'climbs',
 'close',
 'coffee',
 'collecting',
 'colorful',
 'compete',
 'competes',
 'competing',
 'considering',
 'conversation',
 'converse',
 'conversing',
 'cook',
 'cooking',
 'cooks',
 'costumes',
 'covering',
 'cove

In [21]:
male_pmi_set - female_pmi_set

{'accepts',
 'accidentally',
 'acting',
 'actually',
 'afro',
 'aiming',
 'almost',
 'angrily',
 'appear',
 'approaches',
 'approaching',
 'argue',
 'army',
 'arranging',
 'arrested',
 'asked',
 'asks',
 'asleep',
 'ate',
 'attempt',
 'attempting',
 'attempts',
 'awake',
 'backpack',
 'backpacks',
 'bad',
 'baggy',
 'balcony',
 'bar',
 'barbecuing',
 'beard',
 'beating',
 'begging',
 'beret',
 'best',
 'bicycles',
 'bicycling',
 'boat',
 'bowler',
 'bowls',
 'boxing',
 'breakdances',
 'breakdancing',
 'brings',
 'brothers',
 'build',
 'builds',
 'bungee',
 'buried',
 'burning',
 'burns',
 'business',
 'came',
 'camouflage',
 'canoe',
 'carefully',
 'carves',
 'carving',
 'cast',
 'casual',
 'catches',
 'catching',
 'caught',
 'celebrate',
 'changes',
 'changing',
 'chased',
 'chasing',
 'checkered',
 'cherry',
 'chiseling',
 'chops',
 'clean',
 'clearing',
 'climb',
 'climbed',
 'climbing',
 'closes',
 'closing',
 'collared',
 'collects',
 'completely',
 'concentrating',
 'conducting',

In [22]:
female_pmi_set - male_pmi_set

{'alone',
 'also',
 'angry',
 'applies',
 'applying',
 'balances',
 'ballet',
 'bathing',
 'bikini',
 'bikinis',
 'blond-hair',
 'blonde',
 'blow',
 'braids',
 'bright',
 'brown',
 'brushes',
 'cheering',
 'close',
 'coffee',
 'colorful',
 'considering',
 'costumes',
 'covering',
 'cross',
 'crying',
 'curly',
 'date',
 'daughters',
 'displaying',
 'displays',
 'dress',
 'dresses',
 'dries',
 'drops',
 'drying',
 'embrace',
 'embracing',
 'ethnic',
 'exiting',
 'fancy',
 'floral',
 'flowered',
 'formal',
 'get',
 'gossiping',
 'green',
 'hair',
 'happy',
 'headscarf',
 'heels',
 'hospital',
 'hug',
 'hula',
 'husband',
 'indoors',
 'jog',
 'kimono',
 'kiss',
 'kitchen',
 'knees',
 'knit',
 'knits',
 'laugh',
 'laundry',
 'leopard',
 'library',
 'made',
 'maroon',
 'modeling',
 'multicolored',
 'ordered',
 'pajamas',
 'pass',
 'pet',
 'photographed',
 'pink',
 'purple',
 'purse',
 'red',
 'red-hair',
 'roller',
 'rolling',
 'saying',
 'screaming',
 'shares',
 'shop',
 'shopping',
 'shor

In [23]:
lgbt_pmi_set - male_pmi_set.union(female_pmi_set)

{'couple', 'men', 'pride', 'rights'}

In [27]:
search_term = "dirty"
mask = snli_data["sentence2"].str.contains(search_term, case=False, na=False)
pd.set_option('display.max_colwidth', 0)
#pd.set_option("display.max_rows", None)
snli_data[mask][["gold_label","sentence1","sentence2"]]



Unnamed: 0,gold_label,sentence1,sentence2
1857,neutral,a woman on a yellow shirt is on the floor.,A woman lying on her shirt so her pants wont get dirty.
5352,neutral,An old man wit a straw hat and umbrella is smiling while sitting and watching something in the distance.,The hat is dirty.
5358,contradiction,A young boy wearing a red shirt with yellow writing on it and a red sports cap is squatted in shallow water and touching a yellow pail filled with sand.,"The little kid with a red shirt and yellow letters, stood on a foot stool to reach both his hands into the dirty water in the sink."
5518,contradiction,A father or family friend playing with baby in a very clean looking house.,The house is dirty looking.
6208,neutral,A man in orange shorts sits on some steps in front of a big blue door.,A young man in shorts sits on some dirty steps.
...,...,...,...
544008,contradiction,Children in festive attire walking with others on a street.,Kids in dirty blue jeans walk on the street.
547285,neutral,Three people with white gloves on picking up trash on a beach.,Three people pick up dirty needles on a beach.
548798,entailment,Two people are cleaning the road.,Two people are cleaning the dirty road.
550146,entailment,Four dirty and barefooted children.,four children have dirty feet.


In [28]:
search_term = "extravagant"
mask = snli_data["sentence1"].str.contains(search_term, case=False, na=False)
pd.set_option('display.max_colwidth', 0)
snli_data[mask][["gold_label","sentence1","sentence2"]]

Unnamed: 0,gold_label,sentence1,sentence2
20205,neutral,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,The woman is preparing for her wedding.
20206,contradiction,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"A woman dresses in western fashion, with unadorned fingers and head."
20207,neutral,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,An Indian lady is wearing a beautiful outfit.
20208,entailment,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,An Indian woman is wearing traditional styles.
20209,neutral,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,A woman shows off to a crowd of people.
20210,neutral,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,A woman is ready for a national cultural celebration.
20211,contradiction,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,A Japanese woman is wearing a kimono.
20212,entailment,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"A very colorfully robed female with fingers painted red, who's culture appears to be Indian, has on an elaborate head piece."
20213,entailment,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,"A woman is dressed in traditional, ethnic clothing."
20214,contradiction,A Indian women displaying her cultural heritage with painted red fingers and an extravagant head piece.,A woman wearing an old-fashioned apron holds an apple pie in her hands.


In [29]:
search_term = "terrorist"
mask = snli_data["sentence2"].str.contains(search_term, case=False, na=False)
pd.set_option('display.max_colwidth', 0)
#pd.set_option("display.max_rows", None)
snli_data[mask][["gold_label","sentence1","sentence2"]]



Unnamed: 0,gold_label,sentence1,sentence2
51534,neutral,People praying by a car while another reads something from a large piece of paper.,The people read the newspaper in horror of the recent American terrorist attacks.
54040,contradiction,A young woman is standing in a darkly-lit room with two red lamps hanging overhead.,George Washington has released a video claiming responsibility for the most recent terrorist attack.
82780,contradiction,Children sitting and writing.,Children writing bomb threats to mail to terrorists.
97940,contradiction,"A man holds up a tiny sign reading ""I'm a photographer not a terrorist.""",A cat is a terrorist.
98016,contradiction,"A man is holding a souvenir that says ""I'm a photographer not a terrorist"" at what appears to be a rally somewhere in Europe.","A man is stealing someones souvenir that says ""I'm a terrorist not a photographer please shoot me""."
98018,entailment,"A man is holding a souvenir that says ""I'm a photographer not a terrorist"" at what appears to be a rally somewhere in Europe.","Ar a rally in Europe a man is holding a souvenir that says ""I'm a photographer not a terrorist""."
117324,neutral,A man holds a flying national flag amid debris and smoke.,The flag is raised as a symbol of strength after the terrorist attack.
119273,neutral,A group of people are congregating outside with a wave flying and a banner of foreign writing on it.,The people are terrorists
130270,neutral,Several Muslim worshipers march towards Mecca.,The Muslims are terrorists.
135622,neutral,Military personnel patrol an area.,Soldiers are looking for an escaped terrorist.
