```
Last modified: 2021/09/26, @haewoon 
```


# Lab: Quantifying Gender Stereotypes in Word Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/haewoon/lab-bias-in-word-embeddings/blob/master/Lab%20-%20Gender%20Stereotypes%20in%20Word%20Embeddings.ipynb)

## Step 0: Embeddind data download

The data (word embedding and occupation list) is prepared from https://github.com/tolga-b/debiaswe

In [1]:
!gdown --id 1AAM0XAdeXkAcZxLVasbpUS0JFxiT2Dgh

Downloading...
From: https://drive.google.com/uc?id=1AAM0XAdeXkAcZxLVasbpUS0JFxiT2Dgh
To: /Users/haewoon/Google Drive/SMU/Teaching/2021-22 Term 1/IS457 Fairness in sociotechnical systems/Lab/word embedding/w2v_gnews_small.zip
29.1MB [00:00, 64.6MB/s]


In [2]:
import os
os.makedirs('embeddings', exist_ok=True)

In [3]:
!unzip -o w2v_gnews_small.zip

Archive:  w2v_gnews_small.zip
  inflating: w2v_gnews_small.txt     


## Step 1: Load data


### Word embedding
As the entire Google News embedding (https://code.google.com/archive/p/word2vec/) is too big to load, we use a small word embedding that contains only words that are required for this lab (what we downloaded in Step 0). 

In [4]:
import numpy as np

from we import WordEmbedding

# load google news word2vec
E = WordEmbedding('w2v_gnews_small.txt')

(26423, 300)
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine


### Occupations

Load a list of 320 occupations

In [5]:
occupations = []
with open('occupations.txt') as fi:
    for line in fi:
        occupations.append(line.strip())
occupations[:5]

['accountant', 'acquaintance', 'actor', 'actress', 'adjunct_professor']

In [6]:
len(occupations)

320

## Step 2: Define a gender axis vector (= *v(she)* - *v(he)*)

A gender axis can be defined as a difference between a vector of `she` and that of `he`. You can use multiple relevant nouns (e.g., man, mankind, son, male, etc.) to represent male and female by averging their vectors.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`gender_axis`<br/>
`he` ------------------------> `she`

### *v(she)*: Vector of `she` in word embeddings

In [7]:
len(E.v('she'))

300

In [8]:
E.v('she')[:5] # presentaiton purpose. 5 out of 300 dimensions

array([ 0.0404959, -0.0145427, -0.0561573, -0.0177869,  0.0586184],
      dtype=float32)

### *v(he)*: Vector of `he` in word embeddings

In [9]:
len(E.v('he'))

300

In [10]:
E.v('he')[:5] # presentaiton purpose. 5 out of 300 dimensions

array([ 0.109257 ,  0.0726531, -0.0108841, -0.0166381,  0.0176087],
      dtype=float32)

### *v_gender* = *v(she)* - *v(he)*: Gender axis vector

In [11]:
v_gender = E.diff('she', 'he') # normalization is included
np.linalg.norm(v_gender)

1.0

In [12]:
len(v_gender)

300

In [13]:
v_gender[:5] # presentation purpose. 5 out of 300 dimensions

array([-0.07815731, -0.09911112, -0.05145979, -0.00130578,  0.04661368],
      dtype=float32)

## Step 3: Analyzing gender bias in word embeddings with regard to occupations

We will compute the cosine similarity between a vector of each occupation and *v_gender*.

- similarity(v_gender, occupation) > 0 (well aligned with the gender axis): the occupation is closer to `she`.
- similarity(v_gender, occupation) < 0 (Opposite direction with the gender axis): the occupation is closer to `he`.

In [14]:
similarities = []
for occupation in occupations:
    similarities.append((occupation, E.v(occupation).dot(v_gender)))

import operator
similarities = sorted(similarities, key=operator.itemgetter(1))

#### 20 occupation closer to `she` in word embeddings

In [15]:
similarities[-20:]

[('interior_designer', 0.19714224),
 ('housekeeper', 0.20833439),
 ('stylist', 0.21560375),
 ('bookkeeper', 0.2236317),
 ('maid', 0.23776126),
 ('nun', 0.24125955),
 ('nanny', 0.24782579),
 ('hairdresser', 0.24929334),
 ('paralegal', 0.24946158),
 ('ballerina', 0.25276464),
 ('socialite', 0.25718823),
 ('librarian', 0.26647124),
 ('receptionist', 0.27317622),
 ('waitress', 0.27540293),
 ('nurse', 0.28085968),
 ('registered_nurse', 0.3042623),
 ('homemaker', 0.3043797),
 ('housewife', 0.3403659),
 ('actress', 0.3523514),
 ('businesswoman', 0.35965404)]

#### 20 occupation closer to `he` in word embeddings

In [16]:
similarities[:20]

[('maestro', -0.23798442),
 ('statesman', -0.21665451),
 ('skipper', -0.20758669),
 ('protege', -0.20267202),
 ('businessman', -0.2020676),
 ('sportsman', -0.19492392),
 ('philosopher', -0.18836352),
 ('marksman', -0.1807366),
 ('captain', -0.1728986),
 ('architect', -0.16785555),
 ('financier', -0.16702037),
 ('warrior', -0.16313636),
 ('major_leaguer', -0.15280862),
 ('trumpeter', -0.15001445),
 ('broadcaster', -0.14718868),
 ('magician', -0.14637242),
 ('fighter_pilot', -0.14401694),
 ('boss', -0.13782285),
 ('industrialist', -0.137182),
 ('pundit', -0.13684885)]