# Assignment 1.3: Naive word2vec (40 points)

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

You can use TensorFlow/PyTorch and code from your previous task.

## Results of this task: (30 points)
 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D (don't forget to add screenshots to the task)

## Extra questions: (10 points)
 * Intrinsic evaluation: you can find datasets [here](http://download.tensorflow.org/data/questions-words.txt)
 * Extrinsic evaluation: you can use [these](https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e)

Also, you can find any other datasets for quantitative evaluation.

Again. It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

In [131]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm

import random
import numpy as np

from batcher import Batcher

SEED = 42
USE_GPU = True
dtype = torch.float32 

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
    
print('using device:', device)
random.seed(SEED)
torch.manual_seed(SEED)
np.random.seed(SEED)

using device: cpu


In [125]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super(CBOW, self).__init__()
        self.linear1 = nn.Linear(vocab_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        x = self.linear1(x)
        x = F.relu(x)
        x = self.linear2(x)
        x = F.log_softmax(x, dim=1)
        
        return x

In [10]:
with open('./text8') as file:
        data = file.read().split()

vocab_size = 15000
batch_size = 1000
window_size = 5
train_data = data[:30000]

batcher = Batcher(train_data, window_size, batch_size, vocab_size)
data_loader = iter(batcher)

In [11]:
hidden_size = 30

model = CBOW(vocab_size=vocab_size, hidden_size=hidden_size)
model = model.to(device=device)

loss = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=1)

In [123]:
def OHE(data, vocab_size, batch_size=2):  
    if batch_size != 1:
        ohe = torch.zeros((len(data), vocab_size))
        for index, words in enumerate(batch):
            for word in words:
                ohe[index][word] = ohe[index, word] + 1
    else:
        ohe = torch.zeros(vocab_size)
        for word in data:
            ohe[word] = 1
            
    return ohe

In [51]:
### Train

In [None]:
epochs = 30000

print_every = len(train_data) // batch_size
losses = []
for epoch in tqdm(range(epochs)):
    batch, labels = next(data_loader)
    X = torch.tensor(OHE(batch, vocab_size) / window_size, device=device, dtype = dtype)
    y = torch.tensor(labels, device=device, dtype=torch.long)

    model.train() 
    preds = model(X).to(device=device, dtype=dtype)
    los = loss(preds, y)
    
    optimizer.zero_grad()
    los.backward()
    optimizer.step()
    
    losses.append(los.item())
    if epoch % print_every == 0:
        print('Iteration %d, loss = %.4lg' % (epoch, sum(losses[-print_every:])))

# Train
<img src = "./loss.png"></img>

In [None]:
model = CBOW(vocab_size=vocab_size, hidden_size=hidden_size)
model.load_state_dict(torch.load("./model", map_location=torch.device('cpu')))

In [32]:
word2vec = lambda x: model.linear1(OHE([batcher.w2i[x]], vocab_size, batch_size=1)).detach().cpu().numpy()

In [99]:
def cos_similarity(word1, word2):
    return np.dot(word1, word2) / (np.linalg.norm(word1)* np.linalg.norm(word2))

In [119]:
def most_common(word):
    d = {}
    for w in batcher.data:
        d.update({cos_similarity(word2vec(batcher.i2w[w]), word2vec(word)) : batcher.i2w[w]})
    return [d[k] for k in sorted(d, reverse = True)[:10]]

In [130]:
most_common("camera")

['camera',
 'sets',
 'determines',
 'alanine',
 'archaeological',
 'islands',
 'user',
 'biochemistry',
 'cameras',
 'mode']

In [None]:
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 200
final_embeddings = []
final_embeddings = [batch.i2w[i] for i in range(plot_only)]
low_dim_embs = tsne.fit_transform(np.array(final_embeddings))
labels = [idx2word[i] for i in range(plot_only)]