# Word Embeddings
## As used in "Word embeddings quantify 100 years of gender and ethnic stereotypes"

The article by Garg et al. investigates and validates the use of machine-learned word embeddings to study biases in language:

```
"In word-embedding models, each word in a given language is assigned to a high-dimensional vector such that the geometry of the vectors captures semantic relations between the words — e.g., vectors being closer together has been shown to correspond to more similar words."
```

Using pre-trained models of large text corpora, the authors evaluate vectors of words relating to gender and ethnicity against "neutral" word categories to measure bias.

### Load model(s) and look at vectors

```
"For contemporary snapshot analysis, we use the standard Google News word2vec vectors trained on the Google News dataset."
```

In [None]:
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [17]:
print("Length of each word's vector (for this model):", len(model["cat"]))
print("Sample vector for 'cat':")
print(model["cat"])

Length of each word's vector (for this model): 300
Sample vector for 'cat':
[ 0.0123291   0.20410156 -0.28515625  0.21679688  0.11816406  0.08300781
  0.04980469 -0.00952148  0.22070312 -0.12597656  0.08056641 -0.5859375
 -0.00445557 -0.296875   -0.01312256 -0.08349609  0.05053711  0.15136719
 -0.44921875 -0.0135498   0.21484375 -0.14746094  0.22460938 -0.125
 -0.09716797  0.24902344 -0.2890625   0.36523438  0.41210938 -0.0859375
 -0.07861328 -0.19726562 -0.09082031 -0.14160156 -0.10253906  0.13085938
 -0.00346375  0.07226562  0.04418945  0.34570312  0.07470703 -0.11230469
  0.06738281  0.11230469  0.01977539 -0.12353516  0.20996094 -0.07226562
 -0.02783203  0.05541992 -0.33398438  0.08544922  0.34375     0.13964844
  0.04931641 -0.13476562  0.16308594 -0.37304688  0.39648438  0.10693359
  0.22167969  0.21289062 -0.08984375  0.20703125  0.08935547 -0.08251953
  0.05957031  0.10205078 -0.19238281 -0.09082031  0.4921875   0.03955078
 -0.07080078 -0.0019989  -0.23046875  0.25585938  0.089

In [3]:
# "man" is to "king" as "woman" is to _______

model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

In [10]:
# "paltry" is to "significance" as "banal" is to _______

model.most_similar(positive=['banal', 'significance'], negative=['paltry'])

[('symbolism', 0.5135414600372314),
 ('meanings', 0.4999716579914093),
 ('resonance', 0.47358712553977966),
 ('historical_significance', 0.46956706047058105),
 ('relevance', 0.46694469451904297),
 ('importance', 0.4652005136013031),
 ('liminality', 0.45890143513679504),
 ('unknowability', 0.45856061577796936),
 ('profundity', 0.44765299558639526),
 ('significances', 0.44475480914115906)]

In [11]:
# "clumsy" is to "botch" as "lazy" is to ________

model.most_similar(positive=['botch', 'lazy'], negative=['clumsy'])

[('lazy_bums', 0.45563870668411255),
 ('dammit', 0.40230637788772583),
 ('bungle', 0.4007464349269867),
 ('lousy', 0.3992597460746765),
 ('dopes', 0.39685630798339844),
 ('slackers', 0.392386257648468),
 ("f'd", 0.3912498652935028),
 ('stinkin', 0.3887179493904114),
 ('shitty', 0.3877786695957184),
 ('gullable', 0.3874760568141937)]

In [4]:
model.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])

[('queen', 0.9314123392105103),
 ('monarch', 0.858533501625061),
 ('princess', 0.8476566076278687),
 ('Queen_Consort', 0.8150269985198975),
 ('queens', 0.8099815249443054),
 ('crown_prince', 0.8089976906776428),
 ('royal_palace', 0.8027306795120239),
 ('monarchy', 0.8019613027572632),
 ('prince', 0.800979733467102),
 ('empress', 0.7958389520645142)]

In [5]:
model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [6]:
model.similarity('woman', 'man')

0.76640123

In [7]:
model.similar_by_word("cat")

[('cats', 0.8099379539489746),
 ('dog', 0.760945737361908),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326234579086304),
 ('beagle', 0.7150582671165466),
 ('puppy', 0.7075453400611877),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931973457336),
 ('chihuahua', 0.6709762215614319)]

### Computing Average Embeddings

```
"We first compute the average embedding distance between words that represent women—e.g., she, female—and words for occupations—e.g., teacher, lawyer."
```

In [34]:
# Calculate mean of vectors among all words of each category.

import numpy as np

woman_words = ["she", "daughter", "hers", "her", "mother", "woman", "girl", "herself", "female", "sister", "daughters", "mothers", "women",
"girls", "femen", "sisters", "aunt", "aunts", "niece", "nieces"]

man_words = ["he", "son", "his", "him", "father", "man", "boy", "himself", "male", "brother", "sons", "fathers", "men", "boys", "males", "brothers", "uncle",
"uncles", "nephew", "nephews"]

occupations = ["janitor", "statistician", "midwife", "bailiff", "auctioneer", "photographer", "geologist", "shoemaker", "athlete", "cashier",
"dancer", "housekeeper", "accountant", "physicist", "gardener", "dentist", "weaver", "blacksmith", "psychologist", "supervisor",
"mathematician", "surveyor", "tailor", "designer", "economist", "mechanic", "laborer", "postmaster", "broker", "chemist", "librarian", "attendant", "clerical", "musician", "porter", "scientist", "carpenter", "sailor", "instructor", "sheriff", "pilot", "inspector", "mason",
"baker", "administrator", "architect", "collector", "operator", "surgeon", "driver", "painter", "conductor", "nurse", "cook", "engineer",
"retired", "sales", "lawyer", "clergy", "physician", "farmer", "clerk", "manager", "guard", "artist", "smith", "official", "police", "doctor",
"professor", "student", "judge", "teacher", "author", "secretary", "soldier"]

prof_occupations = ["statistician", "auctioneer", "photographer", "geologist", "accountant", "physicist", "dentist", "psychologist", "supervisor", "mathematician", "designer", "economist", "postmaster", "broker", "chemist", "librarian", "scientist", "instructor",
"pilot", "administrator", "architect", "surgeon", "nurse", "engineer", "lawyer", "physician", "manager", "official", "doctor", "professor",
"student", "judge", "teacher", "author"]

average_woman_words = np.mean(np.array([model[word] for word in woman_words if word in model]), axis = 0)
average_man_words = np.mean(np.array([model[word] for word in man_words if word in model]), axis = 0)
average_occupations = np.mean(np.array([model[word] for word in occupations if word in model]), axis = 0)
average_prof_occupations = np.mean(np.array([model[word] for word in prof_occupations if word in model]), axis = 0)

In [35]:
print(len(average_woman_words))
print(len(average_man_words))
print(len(average_occupations))
print(len(average_prof_occupations))

300
300
300
300


In [78]:
# Calculate distances between occations and 'man' and 'woman' words.
def cossim(v1, v2, signed = True):
    c = np.dot(v1, v2)/np.linalg.norm(v1)/np.linalg.norm(v2)
    if not signed:
        return abs(c)
    return c

def calc_distance_between_vectors(vec1, vec2, distype = ''):
    if distype == 'norm':
        return np.linalg.norm(np.subtract(vec1, vec2))
    else:
        return cossim(vec1, vec2)

occupations_to_woman = calc_distance_between_vectors(average_occupations, average_woman_words)
occupations_to_man = calc_distance_between_vectors(average_occupations, average_man_words)
print("Occupation distances (women, men):", occupations_to_woman, occupations_to_man)

prof_occupations_to_woman = calc_distance_between_vectors(average_prof_occupations, average_woman_words)
prof_occupations_to_man = calc_distance_between_vectors(average_prof_occupations, average_man_words)
print("Occupation distances (women, men):", prof_occupations_to_woman, prof_occupations_to_man)

Occupation distances (women, men): 0.41305923 0.44160503
Occupation distances (women, men): 0.30585325 0.32154357


```
"A natural metric for the embedding bias is the average distance for women minus the average distance for men. If this value is negative, then the embedding more closely associates the occupations with men."
```

In [76]:
embedding_bias = occupations_to_woman - occupations_to_man
print(embedding_bias)

-0.028545797


In [77]:
import pandas as pd

output = []

for occupation in occupations:
    man = calc_distance_between_vectors(model[occupation], average_man_words)
    woman = calc_distance_between_vectors(model[occupation], average_woman_words)
    output.append([occupation, woman - man])  

pd.set_option('display.max_rows', 1000)
display(pd.DataFrame(output, columns=["Occupation", "Woman Bias"]))

Unnamed: 0,Occupation,Woman Bias
0,janitor,-0.072116
1,statistician,-0.058891
2,midwife,0.207095
3,bailiff,-0.03631
4,auctioneer,-0.032615
5,photographer,-0.018147
6,geologist,-0.083728
7,shoemaker,-0.093407
8,athlete,-0.007814
9,cashier,0.082338


In [39]:
doctor_man

2.8498664

In [72]:
model.similarity("janitor", "man")

0.40057373

In [98]:
import pickle

with open("genre-balanced-american-enlish_coha/1980-w.npy", "rb") as f:
    vectors = np.lib.format.read_array(f)

with open("genre-balanced-american-enlish_coha/1980-vocab.pkl", "rb") as f:
    vocab = pickle.load(f)

model = KeyedVectors(vectors.shape[1])
model.add_vectors(vocab, vectors)

In [102]:
model.similarity('human', 'woman')

0.0791897

In [103]:
model.similarity('hello', 'goodbye')

0.3940503

In [104]:
import wget

ModuleNotFoundError: No module named 'wget'