# Exploring Word2Vec with Gensim

## Overview

Word2Vec is an approach to learning *word embeddings*, vector representations of words that capture semantic and syntactic relationships between words based on their co-occurrences in natural language text. 

This unsupervised learning approach also reduces the dimensionality of the vectors representing words, which can be helpful for memory and to manage the *curse of dimensionality*, whereby high-dimensional vector spaces lead to a relative data sparsity, e.g., for machine learning. 

In this exercise you will look at the capabilities of Word2Vec as implemented in the module Gensim. 

## Requirements 

Uncomment the lines below, run the installations once as needed, then comment the code out again.

In [None]:
#!pip install --upgrade pip
#!pip install --upgrade Cython
#!pip install --upgrade gensim

Import all necessary libraries. 

In [None]:
# Import modules and set up logging.
import gensim.downloader as api
from gensim.models import Word2Vec
import logging
import numpy as np
import os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import ipytest
import pytest

ipytest.autoconfig()

## Download data

In [None]:
# Load the Text8 corpus.
print(api.info('text8'))
text8_corpus = api.load('text8')

## Train a model

In [None]:
# Train a Word2Vec model on the Text8 corpus with default hyperparameters. 
model = Word2Vec(text8_corpus)  

# Perform a sanity check on the trained model.
print(model.wv.similarity('tree', 'leaf')) 

In [None]:
# Reduce logging level.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [None]:
print(model.wv.most_similar('tree')) 
print(model.wv.most_similar('leaf')) 

## Relationships

Investigate the relationships between words in terms of trained representations. 

### Evaluate  analogies
With the model you have trained, evaluate the analogy
`king-man+woman =~ queen`

In [None]:
print(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=5))

Evaluate the analogy `ship-boat+rocket =~ spacecraft`. How similar are the left-hand side of the analogy to the right-hand side? Implement a function that can find the answer for analogies in general. We assume the right-hand side of the analogy will always be a single, positive term. 

In [None]:
def eval_analogy(model, lhs_pos, lhs_neg, rhs):
    """Returns the similarity between the left-hand and right-hand sides of an anaology.
    
        Arguments: 
            model: Trained Gensim word2vec model to use.
            lhs_pos: List of terms that are positive on the left-hand side in the analogy. 
            lhs_neg: List of terms that are negative on the left-hand side in the analogy. 
            rhs: A single positive term on the right-hand side in the analogy.
            
        Returns:
            Float of the similarity if right-hand side term is found in the top 500 most similar terms.
            Otherwise, return None."""
    # How similar are the left-hand side of the analogy to the right-hand side? 
    # Implement a function that can find the answer for analogies in general.
    # TODO: Complete.
    pass

Test:

In [None]:
%%run_pytest[clean]

def test_eval_analogy():
    assert eval_analogy(model, ['ship', 'rocket'], ['boat'], 'spacecraft') == pytest.approx(0.7043, abs=1e-4)

## Load a pre-trained model

In [None]:
import gensim.downloader as api
model_loaded = api.load('word2vec-google-news-300')

In [None]:
loaded_analogy_eval = -1
# Evaluate the analogy 'king'-'man'+'woman' compared to 'queen' using the loaded model 
# and assign the value to the variable `loaded_analogy_eval`.
# TODO: Complete.
pass

In [None]:
%%run_pytest[clean]

def test_loaded_analogy_eval():
    assert loaded_analogy_eval != -1
    assert loaded_analogy_eval == pytest.approx(0.7118, abs=1e-4)

## Train Word2Vec on different corpora

In [None]:
# Download the rap lyrics of Kanye West.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/kanye/input.txt
! mv input.txt kanye.txt

# Download the complete works of William Shakespeare.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/shakespeare/input.txt
! mv input.txt shakespeare.txt

In [None]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        corpus_path = datapath(self.data)
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

Separately train two new models using the two different datasets, and compare how these datasets affect relationships between 

In [None]:
kanye_data = MyCorpus(os.getcwd()+'/kanye.txt')
shakespeare_data = MyCorpus(os.getcwd()+'/shakespeare.txt')

In [None]:
kanye_model = None
# Train a Word2Vec model on the Kanye corpus, and name it `kanye_model`.
# TODO: Complete
pass

In [None]:
shakespeare_model = None
# Train a Word2Vec model on the Shakespeare corpus, and name it `shakespeare_model`.
# TODO: Complete
pass

For each of the models, we can easily find words where the two models learn very different similarities.

In [None]:
# For example, compare:
print(kanye_model.most_similar(positive=['king'], topn=5))
print(shakespeare_model.most_similar(positive=['king'], topn=5))

## Compare Skip-gram and CBOW

By using the arguments of the model (training) method in `gensim.models.Word2Vec()` you can select either Skip-gram or CBOW explicitly, as well as modifying other hyperparameters. 

Train a Skip-gram model on the Text8 corpus and compare with the default CBOW model on the same dataset, with the same context window size, and compare how relationships are expressed in terms of the resulting embedding vectors.

**Hint:** Use the keyword argument `sg` in when instantiating the model object to specify Skip-gram, rather than the defaul CBOW setting.

In [None]:
model_sg = None
# Train a skip-gram Word2Vec model on `text8_corpus` and name it `model_sg``
# TODO: Complete
pass

In [None]:
loaded_analogy_eval_sg = eval_analogy(model_sg, ['king', 'woman'], ['man'], 'queen')

In [None]:
loaded_analogy_eval_cbow = eval_analogy(model, ['king', 'woman'], ['man'], 'queen')

**Discuss:** Which of the models produces the highest similarity for the example analogy? Will this always be the case? Why or why not?

For more information about Gensim, see https://radimrehurek.com/gensim.

## Feedback

Please give (anonymous) feedback on this exercise by filling out [this form](https://forms.gle/2jPayczbFhEcC9K68).