# Exploring Word2Vec with Gensim

## Overview

Word2Vec is an approach to learning *word embeddings*, vector representations of words that capture semantic and syntactic relationships between words based on their co-occurrences in natural language text. 

This unsupervised learning approach also reduces the dimensionality of the vectors representing words, which can be helpful for memory and to manage the *curse of dimensionality*, whereby high-dimensional vector spaces lead to a relative data sparsity, e.g., for machine learning. 

In this exercise you will look at the capabilities of Word2Vec as implemented in the module Gensim. 

## Requirements 

Uncomment the lines below, run the installations once as needed, then comment the code out again.

In [1]:
#!pip install --upgrade pip
#!pip install --upgrade Cython
#!pip install --upgrade gensim

Import all necessary libraries. 

In [2]:
# Import modules and set up logging.
import gensim.downloader as api
from gensim.models import Word2Vec
import logging
import numpy as np
import os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import ipytest
import pytest

ipytest.autoconfig()

## Download data

In [3]:
# Load the Text8 corpus.
print(api.info('text8'))
text8_corpus = api.load('text8')

{'num_records': 1701, 'record_format': 'list of str (tokens)', 'file_size': 33182058, 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py', 'license': 'not found', 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.', 'checksum': '68799af40b6bda07dfa47a32612e5364', 'file_name': 'text8.gz', 'read_more': ['http://mattmahoney.net/dc/textdata.html'], 'parts': 1}


## Train a model

In [4]:
# Train a Word2Vec model on the Text8 corpus with default hyperparameters. 
model = Word2Vec(text8_corpus)  

# Perform a sanity check on the trained model.
print(model.wv.similarity('tree', 'leaf')) 

2020-10-12 12:53:17,379 : INFO : collecting all words and their counts
2020-10-12 12:53:17,390 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-10-12 12:53:23,698 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2020-10-12 12:53:23,699 : INFO : Loading a fresh vocabulary
2020-10-12 12:53:23,983 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2020-10-12 12:53:23,983 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2020-10-12 12:53:24,201 : INFO : deleting the raw counts dictionary of 253854 items
2020-10-12 12:53:24,215 : INFO : sample=0.001 downsamples 38 most-common words
2020-10-12 12:53:24,216 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2020-10-12 12:53:24,487 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2020-10-12 12:53:24,487 : 

2020-10-12 12:54:35,339 : INFO : EPOCH - 3 : training on 17005207 raw words (12505275 effective words) took 16.0s, 782139 effective words/s
2020-10-12 12:54:36,347 : INFO : EPOCH 4 - PROGRESS: at 6.58% examples, 813035 words/s, in_qsize 2, out_qsize 0
2020-10-12 12:54:37,368 : INFO : EPOCH 4 - PROGRESS: at 13.11% examples, 804042 words/s, in_qsize 0, out_qsize 1
2020-10-12 12:54:38,373 : INFO : EPOCH 4 - PROGRESS: at 18.75% examples, 770362 words/s, in_qsize 5, out_qsize 0
2020-10-12 12:54:39,377 : INFO : EPOCH 4 - PROGRESS: at 25.10% examples, 776023 words/s, in_qsize 0, out_qsize 0
2020-10-12 12:54:40,380 : INFO : EPOCH 4 - PROGRESS: at 31.63% examples, 785945 words/s, in_qsize 0, out_qsize 0
2020-10-12 12:54:41,385 : INFO : EPOCH 4 - PROGRESS: at 38.10% examples, 790426 words/s, in_qsize 0, out_qsize 0
2020-10-12 12:54:42,393 : INFO : EPOCH 4 - PROGRESS: at 44.33% examples, 788064 words/s, in_qsize 1, out_qsize 0
2020-10-12 12:54:43,405 : INFO : EPOCH 4 - PROGRESS: at 50.56% example

0.6936887


In [5]:
# Reduce logging level.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [6]:
print(model.wv.most_similar('tree')) 
print(model.wv.most_similar('leaf')) 

2020-10-12 12:55:07,262 : INFO : precomputing L2-norms of word weight vectors


[('leaf', 0.6936887502670288), ('trees', 0.6819906234741211), ('bark', 0.6473582983016968), ('flower', 0.6167069673538208), ('cactus', 0.611079216003418), ('avl', 0.6076448559761047), ('fruit', 0.605135440826416), ('bird', 0.6035254001617432), ('vine', 0.5807324647903442), ('cave', 0.5720144510269165)]
[('flower', 0.7811040878295898), ('coloured', 0.7625181674957275), ('colored', 0.7616524696350098), ('grass', 0.7488452196121216), ('bark', 0.7405838966369629), ('haliotis', 0.7371957898139954), ('goat', 0.7328765392303467), ('crab', 0.7317043542861938), ('beetle', 0.7304386496543884), ('maple', 0.7257818579673767)]


## Relationships

Investigate the relationships between words in terms of trained representations. 

### Evaluate  analogies
With the model you have trained, evaluate the analogy
`king-man+woman =~ queen`

In [7]:
print(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=5))

[('queen', 0.670697808265686), ('princess', 0.6247391700744629), ('empress', 0.6237273216247559), ('son', 0.6188290119171143), ('prince', 0.6153911352157593)]


  """Entry point for launching an IPython kernel.


Evaluate the analogy `ship-boat+rocket =~ spacecraft`. How similar are the left-hand side of the analogy to the right-hand side? Implement a function that can find the answer for analogies in general. We assume the right-hand side of the analogy will always be a single, positive term. 

In [8]:
def eval_analogy(model, lhs_pos, lhs_neg, rhs):
    """Returns the similarity between the left-hand and right-hand sides of an anaology.
    
        Arguments: 
            model: Trained Gensim word2vec model to use.
            lhs_pos: List of terms that are positive on the left-hand side in the analogy. 
            lhs_neg: List of terms that are negative on the left-hand side in the analogy. 
            rhs: A single positive term on the right-hand side in the analogy.
            
        Returns:
            Float of the similarity if right-hand side term is found in the top 500 most similar terms.
            Otherwise, return None."""
    # How similar are the left-hand side of the analogy to the right-hand side? 
    # Implement a function that can find the answer for analogies in general.
    # TODO: Complete.
    similarities_list = model.most_similar(positive=lhs_pos, negative=lhs_neg, topn=500)
    similarities_dict = {}
    for term, sim in similarities_list:
        similarities_dict[term] = sim
    if rhs in similarities_dict:
        return similarities_dict[rhs]
    else:
        print("Right-hand side term not found in top 500 most similar terms to the left-hand side analogy.")
        None

Test:

In [9]:
%%run_pytest[clean]

def test_eval_analogy():
    assert eval_analogy(model, ['ship', 'rocket'], ['boat'], 'spacecraft') == pytest.approx(0.7043, abs=1e-4)

F                                                                        [100%]
______________________________ test_eval_analogy _______________________________

    def test_eval_analogy():
>       assert eval_analogy(model, ['ship', 'rocket'], ['boat'], 'spacecraft') == pytest.approx(0.7043, abs=1e-4)
E       AssertionError: assert 0.6901870369911194 == 0.7043 ± 1.0e-04
E        +  where 0.6901870369911194 = eval_analogy(<gensim.models.word2vec.Word2Vec object at 0x117d44fd0>, ['ship', 'rocket'], ['boat'], 'spacecraft')
E        +  and   0.7043 ± 1.0e-04 = <function approx at 0x117d3e0d0>(0.7043, abs=0.0001)
E        +    where <function approx at 0x117d3e0d0> = pytest.approx

<ipython-input-9-b916198ae617>:2: AssertionError
tmpclbytxa_.py::test_eval_analogy
    app.launch_new_instance()

FAILED tmpclbytxa_.py::test_eval_analogy - AssertionError: assert 0.690187036...


## Load a pre-trained model

In [10]:
import gensim.downloader as api
model_loaded = api.load('word2vec-google-news-300')

2020-10-12 12:55:07,914 : INFO : loading projection weights from /Users/trondlinjordet/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2020-10-12 12:55:59,462 : INFO : loaded (3000000, 300) matrix from /Users/trondlinjordet/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz


In [11]:
loaded_analogy_eval = -1
# Evaluate the analogy 'king'-'man'+'woman' compared to 'queen' using the loaded model 
# and assign the value to the variable `loaded_analogy_eval`.
# TODO: Complete.
loaded_analogy_eval = eval_analogy(model_loaded, ['king', 'woman'], ['man'], 'queen')

2020-10-12 12:55:59,469 : INFO : precomputing L2-norms of word weight vectors


In [12]:
%%run_pytest[clean]

def test_loaded_analogy_eval():
    assert loaded_analogy_eval != -1
    assert loaded_analogy_eval == pytest.approx(0.7118, abs=1e-4)

.                                                                        [100%]
1 passed in 0.01s


## Train Word2Vec on different corpora

In [13]:
# Download the rap lyrics of Kanye West.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/kanye/input.txt
! mv input.txt kanye.txt

# Download the complete works of William Shakespeare.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/shakespeare/input.txt
! mv input.txt shakespeare.txt

--2020-10-12 12:56:14--  https://raw.githubusercontent.com/gsurma/text_predictor/master/data/kanye/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.64.133, 151.101.128.133, 151.101.192.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.64.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330453 (323K) [text/plain]
Saving to: ‘input.txt’


2020-10-12 12:56:18 (2,35 MB/s) - ‘input.txt’ saved [330453/330453]

--2020-10-12 12:56:18--  https://raw.githubusercontent.com/gsurma/text_predictor/master/data/shakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.64.133, 151.101.128.133, 151.101.192.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.64.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4,4M) [text/plain]
Saving to: ‘input.txt’


2020-10-12 12:56:19 (5,89 MB/s) - ‘input.txt’ 

In [14]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        corpus_path = datapath(self.data)
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

2020-10-12 12:56:19,821 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-10-12 12:56:19,823 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


Separately train two new models using the two different datasets, and compare how these datasets affect relationships between 

In [15]:
kanye_data = MyCorpus(os.getcwd()+'/kanye.txt')
shakespeare_data = MyCorpus(os.getcwd()+'/shakespeare.txt')

In [16]:
kanye_model = None
# Train a Word2Vec model on the Kanye corpus, and name it `kanye_model`.
# TODO: Complete
kanye_model = Word2Vec(sentences=kanye_data)

2020-10-12 12:56:19,840 : INFO : collecting all words and their counts
2020-10-12 12:56:19,842 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-10-12 12:56:19,962 : INFO : collected 5643 word types from a corpus of 58845 raw words and 9181 sentences
2020-10-12 12:56:19,963 : INFO : Loading a fresh vocabulary
2020-10-12 12:56:19,970 : INFO : effective_min_count=5 retains 1241 unique words (21% of original 5643, drops 4402)
2020-10-12 12:56:19,972 : INFO : effective_min_count=5 leaves 52070 word corpus (88% of original 58845, drops 6775)
2020-10-12 12:56:19,981 : INFO : deleting the raw counts dictionary of 5643 items
2020-10-12 12:56:19,982 : INFO : sample=0.001 downsamples 69 most-common words
2020-10-12 12:56:19,984 : INFO : downsampling leaves estimated 37293 word corpus (71.6% of prior 52070)
2020-10-12 12:56:19,995 : INFO : estimated required memory for 1241 words and 100 dimensions: 1613300 bytes
2020-10-12 12:56:19,997 : INFO : resetting layer weigh

In [24]:
shakespeare_model = None
# Train a Word2Vec model on the Shakespeare corpus, and name it `shakespeare_model`.
# TODO: Complete
shakespeare_model = Word2Vec(sentences=shakespeare_data)

2020-10-12 13:00:26,405 : INFO : collecting all words and their counts
2020-10-12 13:00:26,407 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-10-12 13:00:26,508 : INFO : PROGRESS: at sentence #10000, processed 46762 words, keeping 5336 word types
2020-10-12 13:00:26,607 : INFO : PROGRESS: at sentence #20000, processed 99516 words, keeping 8036 word types
2020-10-12 13:00:26,705 : INFO : PROGRESS: at sentence #30000, processed 150456 words, keeping 9920 word types
2020-10-12 13:00:26,795 : INFO : PROGRESS: at sentence #40000, processed 195977 words, keeping 11442 word types
2020-10-12 13:00:26,886 : INFO : PROGRESS: at sentence #50000, processed 242171 words, keeping 12738 word types
2020-10-12 13:00:26,973 : INFO : PROGRESS: at sentence #60000, processed 285628 words, keeping 13827 word types
2020-10-12 13:00:27,076 : INFO : PROGRESS: at sentence #70000, processed 331742 words, keeping 14731 word types
2020-10-12 13:00:27,188 : INFO : PROGRESS: at sente

2020-10-12 13:00:50,437 : INFO : EPOCH - 10 : training on 802937 raw words (596540 effective words) took 2.0s, 305042 effective words/s
2020-10-12 13:00:50,438 : INFO : training on a 8029370 raw words (5961025 effective words) took 20.5s, 291235 effective words/s


For each of the models, we can easily find words where the two models learn very different similarities.

In [18]:
# For example, compare:
print(kanye_model.most_similar(positive=['king'], topn=5))
print(shakespeare_model.most_similar(positive=['king'], topn=5))

  
2020-10-12 12:56:36,173 : INFO : precomputing L2-norms of word weight vectors
  This is separate from the ipykernel package so we can avoid doing imports until
2020-10-12 12:56:36,177 : INFO : precomputing L2-norms of word weight vectors


[('yo', 0.9996579885482788), ('before', 0.9996414184570312), ('die', 0.999640703201294), ('his', 0.9996396899223328), ('away', 0.9996213912963867)]
[('prince', 0.8854881525039673), ('duke', 0.7881519198417664), ('crown', 0.7170592546463013), ('devil', 0.6886593103408813), ('bolingbroke', 0.6861094236373901)]


## Compare Skip-gram and CBOW

By using the arguments of the model (training) method in `gensim.models.Word2Vec()` you can select either Skip-gram or CBOW explicitly, as well as modifying other hyperparameters. 

Train a Skip-gram model on the Shakespeare corpus and compare with the default CBOW model on the same dataset, with the same context window size, and compare how relationships are expressed in terms of the resulting embedding vectors.

**Hint:** Use the keyword argument `sg` in when instantiating the model object to specify Skip-gram, rather than the defaul CBOW setting.

In [25]:
shakespeare_model_sg = None
# Train a Word2Vec model on the Shakespeare corpus, and name it `shakespeare_model`.
# TODO: Complete
shakespeare_model_sg = Word2Vec(sentences=shakespeare_data, sg=1)

2020-10-12 13:00:50,450 : INFO : collecting all words and their counts
2020-10-12 13:00:50,452 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-10-12 13:00:50,561 : INFO : PROGRESS: at sentence #10000, processed 46762 words, keeping 5336 word types
2020-10-12 13:00:50,667 : INFO : PROGRESS: at sentence #20000, processed 99516 words, keeping 8036 word types
2020-10-12 13:00:50,778 : INFO : PROGRESS: at sentence #30000, processed 150456 words, keeping 9920 word types
2020-10-12 13:00:50,871 : INFO : PROGRESS: at sentence #40000, processed 195977 words, keeping 11442 word types
2020-10-12 13:00:50,963 : INFO : PROGRESS: at sentence #50000, processed 242171 words, keeping 12738 word types
2020-10-12 13:00:51,054 : INFO : PROGRESS: at sentence #60000, processed 285628 words, keeping 13827 word types
2020-10-12 13:00:51,152 : INFO : PROGRESS: at sentence #70000, processed 331742 words, keeping 14731 word types
2020-10-12 13:00:51,253 : INFO : PROGRESS: at sente

2020-10-12 13:01:14,026 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-10-12 13:01:14,037 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-10-12 13:01:14,054 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-10-12 13:01:14,055 : INFO : EPOCH - 9 : training on 802937 raw words (596332 effective words) took 2.1s, 283936 effective words/s
2020-10-12 13:01:15,061 : INFO : EPOCH 10 - PROGRESS: at 37.76% examples, 222816 words/s, in_qsize 0, out_qsize 0
2020-10-12 13:01:16,069 : INFO : EPOCH 10 - PROGRESS: at 79.29% examples, 232730 words/s, in_qsize 0, out_qsize 0
2020-10-12 13:01:16,541 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-10-12 13:01:16,551 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-10-12 13:01:16,575 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-10-12 13:01:16,577 : INFO : EPOCH - 10 : training on 802937 raw words (595956 effectiv

In [26]:
loaded_analogy_eval_sm_sg = eval_analogy(shakespeare_model_sg, ['king', 'woman'], ['man'], 'queen')

  app.launch_new_instance()
2020-10-12 13:01:22,120 : INFO : precomputing L2-norms of word weight vectors


Right-hand side term not found in top 500 most similar terms to the left-hand side analogy.


In [27]:
loaded_analogy_eval_sm_cbow = eval_analogy(shakespeare_model, ['king', 'woman'], ['man'], 'queen')

  app.launch_new_instance()
2020-10-12 13:01:22,545 : INFO : precomputing L2-norms of word weight vectors


In [28]:
%%run_pytest[clean]

def test_loaded_analogy_eval():
    assert loaded_analogy_eval_sm_cbow == None
    assert loaded_analogy_eval_sm_sg == pytest.approx(0.4652, abs=1e-4)

F                                                                        [100%]
___________________________ test_loaded_analogy_eval ___________________________

    def test_loaded_analogy_eval():
>       assert loaded_analogy_eval_sm_cbow == None
E       assert 0.48434120416641235 == None

<ipython-input-28-39e50d74f192>:2: AssertionError
FAILED tmp9hj30bki.py::test_loaded_analogy_eval - assert 0.48434120416641235 ...
1 failed in 0.03s


For more information about Gensim, see https://radimrehurek.com/gensim.

## Feedback

Please give (anonymous) feedback on this exercise by filling out [this form](https://forms.gle/2jPayczbFhEcC9K68).