# Exercise #1: Exploring Word2Vec with Gensim

## Overview

Word2Vec is an approach to learning *word embeddings*, vector representations of words that capture semantic and syntactic relationships between words based on their co-occurrences in natural language text. 

This unsupervised* learning approach also reduces the dimensionality of the vectors representing words, which can be helpful for memory and to manage the *curse of dimensionality*, whereby high-dimensional vector spaces lead to a relative data sparsity, e.g., for machine learning. 

## Requirements 

Run installations once, as needed, then comment the code out again.

In [3]:
#!pip install --upgrade pip
#!pip install --upgrade Cython
#!pip install --upgrade gensim

Import all necessary libraries. 

In [5]:
# Import modules and set up logging.
import gensim.downloader as api
from gensim.models import Word2Vec
import logging
import numpy as np
import os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Download data

In [None]:
# Load the Text8 corpus.
print(api.info('text8'))
text8_corpus = api.load('text8')

## Train a model

In [None]:
# Train a Word2Vec model on the Text8 corpus with default hyperparameters. 
model = Word2Vec(text8_corpus)  

# Perform a sanity check on the trained model.
print(model.wv.similarity('tree', 'leaf')) 

In [None]:
# Reduce logging level.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [None]:
print(model.wv.most_similar('tree')) 
print(model.wv.most_similar('leaf')) 

## Relationships

Investigate the relationships between words in terms of trained representations. 

### Task 1: Evaluate  analogies
With the model you have trained, evaluate the analogy
`king-man+woman =~ queen`

In [None]:
print(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=5))

**TODO:** Find other interesting relationships and analogies using this model. 

In [1]:
# TODO Find other analogies that are interesting.

## Load a pre-trained model

In [None]:
import gensim.downloader as api
model_loaded = api.load('word2vec-google-news-300')

**TODO:** Evaluate the analogy 'king'-'man'+'woman' and compare to 'queen' using the loaded model.

In [None]:
# TODO

## Train Word2Vec on different corpora

In [None]:
# Download the rap lyrics of Kanye West.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/kanye/input.txt
! mv input.txt kanye.txt

# Download the complete works of William Shakespeare.
! wget https://raw.githubusercontent.com/gsurma/text_predictor/master/data/shakespeare/input.txt
! mv input.txt shakespeare.txt

In [None]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        corpus_path = datapath(self.data)
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

Separately train two new models using the two different datasets, and compare how these datasets affect relationships between 

In [None]:
kanye_data = MyCorpus(os.getcwd()+'/kanye.txt')
shakespeare_data = MyCorpus(os.getcwd()+'/shakespeare.txt')

In [None]:
kanye_model = Word2Vec(sentences=kanye_data)

In [None]:
shakespeare_model = Word2Vec(sentences=shakespeare_data)

**TODO:** For each of the models, choose five words where the two models learn very different similarities.

In [None]:
# For example, compare:
print(kanye_model.most_similar(positive=['king'], topn=5))
print(shakespeare_model.most_similar(positive=['king'], topn=5))

In [None]:
# TODO

**TODO:** Find new relationships or analogies between words and describe your reasoning in choosing the expressions and interpret the similarity values of those relationships in light of the two embedding models based on Kanye West lyrics and the complete works of William Shakespeare, respectively. 

In [None]:
# TODO

## (Optional) Compare Skip-gram and CBOW

By using the arguments of the model (training) method in `gensim.models.Word2Vec()` you can select either Skip-gram or CBOW explicitly, as well as modifying other hyperparameters. 

**TODO:** Train a Skip-gram model and a CBOW model on the same dataset, with the same context window size, and compare how relationships are expressed in terms of the resulting embedding vectors. 

In [None]:
# TODO

For more information about Gensim, see https://radimrehurek.com/gensim.