In [1]:
!pip install gensim

[0m

In [2]:
import numpy as np
import gensim.downloader as api

In [3]:
dataset = api.load('text8')

In [4]:
from gensim.models import word2vec
dataset = word2vec.Text8Corpus('text8')

In [5]:
np.random.seed(1)

In [6]:
model = word2vec.Word2Vec(dataset)

This may take a minute or two, or less, depending on your system. Once complete, we will have our trained word vectors in the model and have access to multiple handy utilities to work with these word vectors. Let's access the word vector/embedding for a term

In [7]:
print(model.wv['animal'])

[-0.20193794 -0.51874137 -0.91420144 -1.6580372   1.3939288   1.7950013
 -2.5248644  -2.4950335  -1.2926555  -0.63928956 -1.2239358  -1.8010147
  0.24652585 -0.3728686  -1.1039593  -0.56072396  1.983839    0.37115866
  0.93721384 -0.02463511  0.7784846   0.68078744 -1.4710312  -1.0266024
  0.66701883 -3.7085626   2.061994    0.62728363 -1.1804457  -1.5666814
  2.7032795   0.43801767  0.42981827  1.041845    1.8631765  -2.031059
  0.38920856 -2.2387776   1.1209686  -0.9926629  -1.6059679  -0.4531603
 -0.38786557  0.20408194 -0.6060187  -1.3537091   0.8878299  -0.7379972
 -0.7410342   0.99285614  0.99263996  2.0742967  -0.58120114  1.7768904
  1.1369077  -1.2211988   0.38107762  1.0072191  -2.1647868   2.1298537
  0.8601733  -0.40261608 -1.7906673   2.3166728  -1.4726409  -0.86715883
  1.0949367  -2.3150716  -0.6426731  -0.07227715 -1.5966916   1.8781711
  0.3385172   2.13873    -0.2654786   0.14419995  0.1303719   1.3711656
 -1.2407774  -0.44129905  0.21813473 -1.2543632   0.43601993  0

You have a series of numbers – the vector for the term
Let's find the length of the vector

In [8]:
len(model.wv['animal'])

100

The representation for each term is now a vector of length 100 (the length is a hyperparameter we can change; we used the default setting to get started). The vector for any term can be accessed as we did previously. Among the other handy utilities is the most_similar() method, which helps us find the terms that are the most similar to a target term. Let's see it in action:

In [9]:
model.wv.most_similar('animal')

[('animals', 0.7503687143325806),
 ('insect', 0.7306487560272217),
 ('humans', 0.6731089353561401),
 ('ants', 0.6644068360328674),
 ('organism', 0.6493010520935059),
 ('aquatic', 0.6490638256072998),
 ('feces', 0.6486082077026367),
 ('insects', 0.6470466256141663),
 ('mammal', 0.645098865032196),
 ('bees', 0.637350857257843)]

The output is a list of tuples, with each tuple containing the term and its similarity score with the term "animal".

Let's see what the model has learned as top terms related to "happiness"

In [10]:
model.wv.most_similar('happiness')

[('humanity', 0.8003098368644714),
 ('pleasure', 0.7515838742256165),
 ('perfection', 0.7506499886512756),
 ('mankind', 0.7404095530509949),
 ('compassion', 0.7346247434616089),
 ('desires', 0.7338270545005798),
 ('goodness', 0.731827437877655),
 ('salvation', 0.7279468178749084),
 ('dignity', 0.7221781611442566),
 ('striving', 0.7101019024848938)]

#### Semantic Regularities in Word Embeddings

We'll use the most_similar() method here, which allows us to add and subtract vectors from each other. We'll provide 'king' and 'woman' as vectors to add to each other, use 'man' to subtract from the result, and then check out the five terms that are the most similar to the resulting vector:

In [11]:
model.wv.most_similar(
    positive=['woman', 'king'],
    negative=['man'],
    topn=5
)

[('queen', 0.6848368644714355),
 ('prince', 0.6343647241592407),
 ('princess', 0.6249984502792358),
 ('throne', 0.6246045827865601),
 ('emperor', 0.6148173809051514)]

The top result is 'queen'. Looks like the model is capturing these regularities. Let's try out another example. "Man" is to "uncle" as "woman" is to ? Or in an arithmetic form, what is the vector closest to uncle - man + woman = ?

In [12]:
model.wv.most_similar(
    positive=['uncle', 'woman'],
    negative=['man'],
    topn=5
)

[('aunt', 0.819966197013855),
 ('grandmother', 0.8063263893127441),
 ('wife', 0.797320544719696),
 ('widow', 0.787786066532135),
 ('daughter', 0.777454674243927)]

This seems to be working great. Notice that all the top five results are for the feminine gender. So, we took uncle, removed the masculine elements, added feminine elements, and now we have some really good results.

#### Exercise 4.05 - Vectors for Phrases
In this exercise, we will begin to create vectors for two different phrases, get happy and make merry, by taking the average of the individual vectors. We will find a similarity between the representations for the phrases. 

In [13]:
# Extract the vector for the term "get" and store it in a variable
v1 = model.wv['get']

In [14]:
# Extract the vector for the term "happy" and store it in a variable
v2 = model.wv['happy']

In [15]:
# Create a vector as the element-wise average of the two vectors, (v1 + v2)/2. This is our vector for the entire phrase "get happy"
res1 = (v1 + v2)/2

In [16]:
# make merry vectors
v1, v2 = model.wv['make'], model.wv['merry']

In [17]:
# make merry vector phrase
res2 = (v1+v2)/2

In [18]:
# Using the cosine_similarities() method in the model, find the cosine similarity between the two
model.wv.cosine_similarities(res1, [res2])

array([0.5528542], dtype=float32)

The result is a cosine similarity of about 0.57, which is positive and much higher than 0. This means that the model thinks the phrases "get happy" and "make merry" are similar in meaning. 
In this exercise, we saw how we could use vector arithmetic to represent phrases, instead of individual terms, and we saw that meaning is still captured. This brings us to a very important lesson – vector arithmetic on word embeddings has meaning

#### Effect of Parameters

In [19]:
# Let's retrain the word embeddings, with size as 30 this time
model = word2vec.Word2Vec(
    dataset,
    vector_size=30
)

In [20]:
# Now, let's check the analogy task from earlier, that is, king - man + woman
model.wv.most_similar(
    positive=['woman', 'king'],
    negative=['man'],
    topn=5
)

[('empress', 0.8121461272239685),
 ('son', 0.8082416653633118),
 ('judah', 0.8058237433433533),
 ('emperor', 0.8039002418518066),
 ('prince', 0.8004615306854248)]

We can see that queen isn't present in the top five results. It looks like by using a very low dimensionality, we aren't capturing enough information in the representation for a term.

#### Skip-gram versus CBOW

In [21]:
# Find most similar terms for the word, 'oeuvre' (boday of work of an artist/performer), using default vector_size
model = word2vec.Word2Vec(dataset)
model.wv.most_similar(
    'oeuvre',
    topn=5
)

[('nieve', 0.7358265519142151),
 ('whimsical', 0.7154113054275513),
 ('diaghilev', 0.7142093181610107),
 ('nstlerroman', 0.7102295756340027),
 ('demian', 0.7060530781745911)]

We can see that most results are the names of artists (alcaeus, loesser, and respighi) or art forms (naturae, sonnet). None of the top five results are close in meaning to the target term. Now, let's retrain our vectors using the Skip-gram method and see the result on the same task

In [22]:
model_sg = word2vec.Word2Vec(
    dataset,
    sg=1
)

model_sg.wv.most_similar(
    'oeuvre',
    topn=5
)

[('masterful', 0.8247474431991577),
 ('cubist', 0.8234925866127014),
 ('inimitable', 0.8187395334243774),
 ('impressionistic', 0.8166742324829102),
 ('lithographs', 0.8158390522003174)]

We can see that the top terms are much closer in meaning (masterful, cubist, impressionistic). So, the Skip-gram method seems to work better for rare words.

#### Bias in embeddings

In [23]:
model.wv.most_similar(
    positive=['woman', 'doctor'],
    negative=['man'],
    topn=5
)

[('nurse', 0.6096571683883667),
 ('child', 0.5963546633720398),
 ('teacher', 0.5726506114006042),
 ('prostitute', 0.5636792778968811),
 ('helen', 0.5335953831672668)]

That's not the kind of result we want. Doctors are males, while females are nurses? Let's try another example. This time, let's try what the model thinks regarding females as corresponding to "smart" for "males":

In [24]:
model.wv.most_similar(
    positive=['woman', 'smart'],
    negative=['man'],
    topn=5
)

[('lazy', 0.5782583355903625),
 ('dumb', 0.5647369027137756),
 ('rodeo', 0.5478234887123108),
 ('wee', 0.5471178889274597),
 ('lucille', 0.5415554046630859)]

What's happening here? Is this seemingly great representation approach sexist? Is the word2vec algorithm sexist? There definitely is bias in the resulting word vectors, but think about where the bias is coming from. It's the underlying data that uses 'nurse' for females in contexts where 'doctor' is used for males. It is, therefore, the underlying text that contains the bias, not the algorithm.