# Word Projections

This notebook loads word embeddings and projects some words.

Since we have several embeddings, we will configure the one we want to work with here:

In [1]:
emb_name = 'glove.6B.50d'

## Setup

We need Pandas, Xarray, and Seaborn things:

In [2]:
import pandas as pd
import numpy as np
import xarray as xa
import seaborn as sns
import matplotlib.pyplot as plt

## Load the Data

Let's load the data by opening an `xarray` data array:

In [3]:
vectors = xa.open_dataarray(f'data/{emb_name}.netcdf')
vectors

<xarray.DataArray 'glove.6B.50d.txt' (word: 399998, dim: 50)>
[19999900 values with dtype=float64]
Coordinates:
  * word     (word) object 'the' ',' '.' ... 'rolonda' 'zsombor' 'sandberger'
  * dim      (dim) int32 1 2 3 4 5 6 7 8 9 10 ... 41 42 43 44 45 46 47 48 49 50

## Accessing Data

We can peek at a few words:

In [4]:
vectors.loc['apple', :]

<xarray.DataArray 'glove.6B.50d.txt' (dim: 50)>
array([ 0.098334, -0.157094,  0.094402,  0.243615,  0.021748,  0.010869,
       -0.259865, -0.183874,  0.034665,  0.090077, -0.028554,  0.067138,
        0.048961, -0.147112,  0.098597,  0.09012 , -0.269275,  0.16212 ,
        0.113033, -0.206014,  0.063439, -0.115054,  0.078872,  0.040755,
       -0.014015, -0.110008, -0.085066,  0.0326  ,  0.031079, -0.072582,
        0.439936, -0.125997, -0.109934,  0.140559,  0.017953, -0.090442,
       -0.159836,  0.073132,  0.044768, -0.293309,  0.122444, -0.031217,
       -0.278118, -0.030655,  0.150891,  0.184022,  0.075632, -0.041403,
       -0.058458,  0.050225])
Coordinates:
    word     <U5 'apple'
  * dim      (dim) int32 1 2 3 4 5 6 7 8 9 10 ... 41 42 43 44 45 46 47 48 49 50

In [5]:
vectors.loc['cookie', :]

<xarray.DataArray 'glove.6B.50d.txt' (dim: 50)>
array([-0.015997, -0.043809,  0.117472,  0.011932,  0.265269,  0.140599,
       -0.135699, -0.079635,  0.057   , -0.130098, -0.138527,  0.042825,
        0.051595,  0.278589, -0.020974,  0.002277, -0.007708, -0.005896,
        0.055136, -0.177196,  0.142576, -0.070822,  0.216368,  0.090189,
       -0.026923,  0.015462, -0.320445,  0.085986,  0.279965, -0.09743 ,
        0.189647, -0.089038, -0.187425,  0.358007, -0.02136 ,  0.099679,
       -0.139147,  0.216743,  0.217306, -0.08934 ,  0.247822, -0.011722,
       -0.039699,  0.004912,  0.039757,  0.136016,  0.080567, -0.004041,
        0.061379,  0.026404])
Coordinates:
    word     <U6 'cookie'
  * dim      (dim) int32 1 2 3 4 5 6 7 8 9 10 ... 41 42 43 44 45 46 47 48 49 50

How similar are 'apple' and 'cookie'?

In [6]:
np.dot(vectors.loc['apple', :], vectors.loc['cookie', :])

0.4664089360109994

How about and 'apple' and 'orange'?

In [7]:
np.dot(vectors.loc['apple', :], vectors.loc['orange', :])

0.5388040721946523

## Projecting Words

The "Man is to Programmer as Woman is to Homemaker" paper projected words onto gender axes to demonstrate the existence of gender bias in word embeddings.

Let's make an axis from 'she' to 'he' that represents gender (there are more sophisticated ways to do this, but this will be fine to start):

In [8]:
heshe = vectors.loc['he', :] - vectors.loc['she', :]
heshe

<xarray.DataArray 'glove.6B.50d.txt' (dim: 50)>
array([-0.04743967, -0.07802382,  0.02016525, -0.02658374,  0.00182033,
       -0.11334677, -0.02821977,  0.00171081, -0.13525842,  0.00737956,
        0.00845505,  0.04405043, -0.10670176, -0.04520413, -0.0479865 ,
       -0.02838407,  0.07714101, -0.01369819, -0.12269473,  0.03636529,
       -0.04882438, -0.12152056, -0.00342366, -0.07590407, -0.11048574,
       -0.03687019,  0.07559901, -0.03808821, -0.01600614,  0.08881611,
        0.07007162, -0.04000146, -0.048872  , -0.03575586,  0.04086934,
        0.0269937 ,  0.02477164,  0.03129705, -0.00123182,  0.10933865,
       -0.02301721,  0.03052752, -0.11418023,  0.09632625, -0.06925334,
        0.0583385 , -0.07041178,  0.16175676,  0.00313913, -0.07725143])
Coordinates:
  * dim      (dim) int32 1 2 3 4 5 6 7 8 9 10 ... 41 42 43 44 45 46 47 48 49 50

We can project words onto this axis with a dot product:

In [9]:
np.dot(vectors.loc['mother', :], heshe)

-0.1823949898060081

In [10]:
np.dot(vectors.loc['father', :], heshe)

0.007399722274729046

In [11]:
np.dot(vectors.loc['apple', :], heshe)

-0.034228354444705344

Let's project several words:

In [12]:
words = ['person', 'mother', 'father', 'programmer', 'chef', 'botanist']

In [16]:
(vectors.loc[words, :] * heshe).sum(axis=1).to_pandas()

word
person       -0.055801
mother       -0.182395
father        0.007400
programmer    0.023561
chef         -0.077140
botanist      0.084533
dtype: float64