In this exercise, we will train our own word vectors on the Brown corpus and the IMDb movie reviews corpus. We will assess the differences in the representations learned and the effect of the underlying training data. Follow these steps to complete this exercise:

In [3]:
import numpy as np
from gensim.models import word2vec

In [1]:
# Import the Brown and IMDb movie reviews corpus from NLTK:
import nltk

In [2]:
nltk.download('brown')
nltk.download('movie_reviews')
from nltk.corpus import brown, movie_reviews

[nltk_data] Downloading package brown to /Users/LNonyane/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


The corpora have a convenient method, sent(), to extract the individual sentences and words (tokenized sentences, which can be directly passed to the word2vec algorithm). Since both the corpora are rather small, use the Skip-gram method to create the embeddings:

In [4]:
model_brown = word2vec.Word2Vec(brown.sents(), sg=1)
model_movie = word2vec.Word2Vec(movie_reviews.sents(), sg=1)

We now have two embeddings that have been learned on different contexts for the same term. Let's see the most similar terms for money from the model on the Brown corpus.

In [5]:
# print top 5 terms similar to money from model_brown
model_brown.wv.most_similar('money', topn=5)

[('care', 0.829934298992157),
 ('job', 0.8272829055786133),
 ('friendship', 0.8167790174484253),
 ('risk', 0.8044701218605042),
 ('joy', 0.8039414882659912)]

We can see that the top term is 'care'; fair enough. Let's see what the model learned regarding movie reviews.

In [6]:
# print top 5 terms similar to money from model_movie
model_movie.wv.most_similar('money', topn=5)

[('cash', 0.7237001061439514),
 ('paid', 0.7035631537437439),
 ('ransom', 0.6973373293876648),
 ('record', 0.6866511702537537),
 ('bucks', 0.6820871829986572)]

The top terms are cash and paid. Considering the language being used in movies, and thus in movie reviews, this isn't very surprising.

In this exercise, we created word vectors using different datasets and saw that the representations for the same terms and the associations that were learned are very affected by the underlying data. So, choose your data wisely.

#### Using pre-trained word vectors
So far, we've trained our own word embeddings using the small datasets we had access to. The folks at the Stanford NLP group have trained word embeddings on 6 billion tokens with 400,000 terms in the vocabulary. Individually, we will not have the resources to handle this scale. Fortunately, the Stanford NLP group has been benevolent enough to make these trained embeddings available to the general public so that people like us can benefit from their work. The trained embeddings are available on the GloVe page (https://nlp.stanford.edu/projects/glove/).

In [7]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B/glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.w2vformat.txt'
glove2word2vec(
    glove_input_file,
    word2vec_output_file
)

  glove2word2vec(


(400000, 100)

We specified the input and the output file and ran the glove2word2vec utility. As the name suggests, the utility takes in word vectors in GloVe format and converts them into word2vec format. After this, the word2vec models can understand these embeddings easily. Now, let's load the keyed word vectors from the text file (reformatted):

In [8]:
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format(
    'glove.6B.100d.w2vformat.txt',
    binary=False
)

With this done, we have the GloVe embeddings in the model, along with all the handy utilities we had for the embeddings model from word2vec. Let's check out the top terms similar to "money"

In [9]:
glove_model.most_similar(
    'money',
    topn=5
)

[('funds', 0.8508071303367615),
 ('cash', 0.848483681678772),
 ('fund', 0.7594833374023438),
 ('paying', 0.7415367364883423),
 ('pay', 0.740767240524292)]

For closure, let's also check how this model performs on the king and queen tasks

In [13]:
glove_model.most_similar(
    positive=['woman', 'king'],
    negative=['man'],
    topn=5
)

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380928039551),
 ('throne', 0.6755736470222473),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534157752991)]

Now that we have these embeddings in a model, we can work with them the same way we worked with the embeddings we created previously and can benefit from the larger dataset and vocabulary and the processing power used by the contributing organization.

#### Bias in embeddings
It's great that the embeddings are capturing these regularities by learning from the text data. Let's try something similar to a profession. Let's see the term closest to doctor – man + woman

In [12]:
glove_model.most_similar(
    positive=['woman', 'doctor'],
    negative=['man'],
    topn=5
)

[('nurse', 0.7735227942466736),
 ('physician', 0.7189430594444275),
 ('doctors', 0.6824328303337097),
 ('patient', 0.6750683188438416),
 ('dentist', 0.6726033091545105)]