# Lab 6 - word-2-vec with pytorch and gensim

 "A word is characterized by the company it keeps" - Firth (1957)
 

## Execise 0 (0pt)


To do the following exercises you will need certain python packages. This
first exercise is about installing them. You will need `sklearn`, `nltk`, `numpy`,
`gensim`. Please make sure you have installed them (by your distribution’s
package manager, pip, anaconda, . . . ) and check your installation by trying
to import them:

In [None]:
import sklearn
import nltk
import numpy
import gensim

## Exercise 1.1 (0pt)

In `wordspace.py` you find some convenience functions to extract a word
cooccurrence matrix from text. Run the following script and evaluate the
embeddings by looking at the nearest neighbors of some words.

In [None]:
from wordspace import cooccurrence_matrix ,\
nearest_neighbor_loop

with open('brown.txt', 'r') as f:
    brown = f.read()

matrix , vocabulary = cooccurrence_matrix(brown)

In [None]:
nearest_neighbor_loop(matrix , vocabulary)

In [None]:
del matrix 
del vocabulary

## Exercise 1.2 (1pt)

One simple way to improve a basic counting model is transforming the word
counts by, e.g., applying the square root afterwards.
Modify the script from exercise 1.1 by using `numpy.sqrt` to do so.

## Exercise 1.3 (1pt)

Next let us examine the parameters of the function `cooccurrence_matrix`.
You can modify the `window_size` and/or try a different vectorizer than
the standard `CountVectorizer` to compute the cooccurrence scores. Try
`sklearn.feature_extraction.text.TfidfVectorizer`!

In [None]:
cooccurrence_matrix(
    text , window_size=2, max_vocab_size=20000,
    same_word_zero=False , vectorizer=CountVectorizer
)

# Singular Value Decomposition

## Exercise 2 (1pt)

With Singular Value Decomposition (SVD) you can reduce the dimensionality
of your embeddings. Try `sklearn.decomposition.TruncatedSVD` and
see how your embeddings change! Consider the following usage example:

In [None]:
from sklearn.decomposition import TruncatedSVD
with open('brown.txt', 'r') as f:
    brown = f.read()
some_text = brown
try:
    C
except:
    C, V = cooccurrence_matrix(some_text)
    
svd = TruncatedSVD(
    n_components=100, algorithm="randomized",
    n_iter=5, random_state=42, tol=0.
)
new_C = svd.fit_transform(C)

In [None]:
new_C.shape

In [None]:
C.shape

In [None]:
C

In [None]:
new_C

In [None]:
del C
del new_C

- `n_components` - desired embedding dimension
- `algorithm` - SVD solver to use; either “arpack” or “randomized”
- `n_iter` - number of iterations for randomized SVD solver (not used by ARPACK)
- `random_state` - seed for pseudo-random number generator
- `tol` -  toleranze for ARPACK. Ignored by randomized SVD solver

# Word2Vec

## Exercise 3.1 (1pt)

Use the following code snippets to train your own word2vec model on the
brown corpus (or any other large text file you have). `semantic_tests.py`
contains some tests for your embeddings.

In [None]:
from semantic_tests import semantic_tests
from gensim.models.word2vec import Word2Vec
import nltk.data
from nltk.tokenize import word_tokenize
import logging
logging.basicConfig(
    format='%(asctime)s: %(levelname)s: %(message)s',
    level=logging.INFO
)

#nltk.download('punkt')

sent = nltk.data.load(
    'tokenizers/punkt/english.pickle'
)

with open('brown.txt', 'r') as f:
    sentences = sent.tokenize(f.read())
    sentences = map(lambda s: word_tokenize(s), sentences)

model = Word2Vec(
    sentences , size=100, window=5,
    min_count=5, hs=0, negative=5,
    cbow_mean=1, iter=5, workers=3
)

semantic_tests(model.wv)

## Exercise 3.2 (1pt)

Instead of training your own word2vec model, you can also download pretrained
embeddings and load them into `gensim.` Are they doing better in
your `semantic_tests`?

A popular pre-trained option is the Google News dataset model, containing 300-dimensional embeddings for 3 millions words and phrases. Download the binary file ‘GoogleNews-vectors-negative300.bin’ (1.3 GB compressed) from https://code.google.com/archive/p/word2vec/.

In [None]:
from gensim.models import KeyedVectors
from semantic_tests import semantic_tests

model = KeyedVectors.load_word2vec_format(
    'path/to/GoogleNews􀀀vectors􀀀negative300.bin.gz',
    binary=True
)

semantic_tests(model)

# Implementation i pytorch

## Exercise 4 (2pt)


- Train word2vec skip-gram model on sentence "the quick brown fox jumps over the lazy dog". Assume context window = 2, embedding_dim = 5. No preprocessing apart from tokenization.
- Compute model output probabilities for words "lazy" and "dog". If you have trained the model correctly, the output probabilities for word "lazy" should be higher for words "over", "the", "dog" (close to 1/3 each) and lower for other words (close to 0 each). For word "dog", the output probabilities should be higher for words, "the", "lazy" (close to 1/2 each) and lower for other words (close to 0 each). 
- Compute dot product between the vector of word "dog" and the vector of word "lazy" (could be representation of center vector and representation of context vector) and between "dog" and "brown". Which one is higher? Why?


You can use this tutorial https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb

Use pytorch (or tensorflow).

In [None]:
# http://pytorch.org/
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

In [None]:
sentence = "the quick brown fox jumps over the lazy dog"

If our vocabulary is bigger, the word2vec model needs a LOT of data to obtain reasonable results. With this amount of data, the code needs to be optimized very well. Writing such code will be more suitable for a project instead of a simple exercise, therefore in the next exercise we will use [gensim](https://radimrehurek.com/gensim/), a library made for efficient training of word vectors.

## * Exercise (2pt)

- Use [gensim](https://radimrehurek.com/gensim/) to train a word2vec model on [OpinRank](http://kavita-ganesan.com/entity-ranking-data/). You can follow this [tutorial](https://medium.freecodecamp.org/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3), but make sure you have used negative sampling.
- Find 10 similar words to word "dirty" and "canada"
- Check if similarity between "dirty" and "dusty" is bigger than between "dirty" and "clean"