# Assignment 1.3: Naive word2vec (40 points)

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

You can use TensorFlow/PyTorch and code from your previous task.

## Results of this task: (30 points)
 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D (don't forget to add screenshots to the task)

## Extra questions: (10 points)
 * Intrinsic evaluation: you can find datasets [here](http://download.tensorflow.org/data/questions-words.txt)
 * Extrinsic evaluation: you can use [these](https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e)

Also, you can find any other datasets for quantitative evaluation.

Again. It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

In [1]:
import gc
import string
import re
from collections import Counter
import numpy as np
gc.collect()
import nltk
nltk.download('stopwords')
import nltk
from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
print(len(STOP_WORDS))
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau

0

### Create own batcher with batch generator

In [92]:
class Batcher:
    def __init__(self, max_len, window_size, corpus_path, min_freq, max_freq, max_voc_size, batch_size):
        self.corpus_path = corpus_path
        self.window_size = window_size
        self.min_freq = min_freq
        self.max_freq = max_freq
        self.max_voc_size = max_voc_size
        self.batch_size = batch_size
        self.max_len = max_len
        self.words = None
        self.word2index = None
        self.index2word = None
        self.freq = None
        self.voc = None
        self.voc_size = None
        self.corpus = None
        self.corpus_size = None
        
        
    def read_data(self, S):
        if S == None:
            with open(self.corpus_path, 'r') as f:
                S = f.read()
            if S!=None:
                S = S.lower()[: self.max_len]
        print('Len of S = ', len(S))
        regex = re.compile('[%s]' % re.escape(string.punctuation))
        S = regex.sub(' ', S)
        words_raw = list(S.split())
        words = []
        for word in words_raw:
            if word in STOP_WORDS:
                pass
            else:
                words.append(word)
        
        print('Size of words = ', len(words))
        counter = Counter(words)
        print('Size of counter = ', len(counter))
        if self.min_freq != None:
            counter = {x : counter[x] for x in counter if counter[x] >= self.min_freq}
        print('Size of counter after min_freq = ', len(counter))
        if self.max_freq != None:
            counter = {x : counter[x] for x in counter if counter[x] <= self.max_freq}
        print('Size of counter after max_freq = ', len(counter))
        counter = Counter(counter)

        freq = dict(counter.most_common(self.max_voc_size))
        voc = set(freq)
        
        unk = set(words).difference(voc)
        print('Size of freq dict = ', len(voc))
        print('Number of vocabulary words = ', len(voc))
        print('Number of unknown words = ', len(unk))

        words = ['UNK' if word in unk else word for word in words]        
        if len(words)%self.batch_size == 0:
            padding = self.window_size
        else:
            padding = self.batch_size - len(words)%self.batch_size + self.window_size
            
        words = ['PAD']*self.window_size + words + ['PAD']*padding
        unique_words = list(set(words))
        print('Size of corpus = ', len(words))
        print('Size of vocabulary = ', len(unique_words))
        self.word2index = {k: v for v, k in enumerate(unique_words)}
        self.index2word = {v: k for v, k in enumerate(unique_words)}
        words = [self.word2index[word] for word in words]
        self.freq = Counter(words)
        self.voc = set(self.freq)
        self.voc_size = len(self.voc)
        self.corpus = words
        self.corpus_size = len(words)
    
    def generator(self):
        i = self.window_size
        x_batch = []
        y_batch = []
        
        while i < self.corpus_size-self.window_size:
            if len(x_batch)==self.batch_size:
                x_batch = []
                y_batch = []
                
            x = self.corpus[i-self.window_size: i] + self.corpus[i+1: i+self.window_size+1]
            y = [0]*self.voc_size
            y[self.corpus[i]] = 1
            x_batch.append(x)
            y_batch.append(y)
            i += 1
            if len(x_batch)==self.batch_size:
                yield np.array(x_batch), np.array(y_batch)

### Initialize Batcher with parameters

In [93]:
BATCH_SIZE = 64
MAX_LEN = 1000000000
batcher = Batcher(max_len=MAX_LEN, window_size=2, corpus_path='text8', min_freq=5, max_freq=None, max_voc_size=10000000, batch_size=BATCH_SIZE)
batcher.read_data(S=None)

Len of S =  100000000
17005207
10890638
Size of words =  10890638
Size of counter =  253702
Size of counter after min_freq =  71140
Size of counter after max_freq =  71140
Size of freq dict =  71140
Number of vocabulary words =  71140
Number of unknown words =  182562
Size of corpus =  10890692
Size of vocabulary =  71142


### Check dimentions

In [94]:
for x, y in batcher.generator():
    print(x.shape, y.shape)
    break

(64, 4) (64, 71142)


### Check value of the first batch

In [1]:
x_words = []
for i in range(BATCH_SIZE):
    line = []
    for j in range(batcher.window_size*2):
        line.append(batcher.index2word[x[i, j]])
    x_words.append(line)
x_words

In [2]:
for i in range(BATCH_SIZE):
    print(batcher.index2word[list(y[i]).index(1)])

### Create CBOW class using PyTorch

In [98]:
USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print (torch.cuda.get_device_name(0))

GeForce GTX 1050 Ti


In [99]:
class CBOW(nn.Module):
    def __init__(self, voc_size, embedding_dim, window_size, batch_size):
        super(CBOW, self).__init__()
        self.embedding1 = nn.Embedding(voc_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, voc_size)
        
    def forward(self, inputs):
        embs1 = self.embedding1(torch.tensor(inputs))
        z1 = self.linear1(embs1)
        log_softmax = F.log_softmax(z1, dim=2)
        return log_softmax

### Run training with Exponential Scheduler

In [76]:
losses = []
loss_function = nn.NLLLoss()
model = CBOW(voc_size=batcher.voc_size, embedding_dim=50, window_size=batcher.window_size, batch_size=batcher.batch_size)
model.cuda()
optimizer = optim.Adam(model.parameters(), lr=0.01)
lr_scheduler = ExponentialLR(optimizer, gamma=0.9)

for epoch in [0, 1, 2]:
    print('========== Epoch {} =========='.format(epoch))
    total_loss = 0
    i = 1
    N = int(len(batcher.corpus)//BATCH_SIZE)
    for context, target in batcher.generator():
        model.train()
        context = torch.tensor(context).to(device='cuda')
        target = torch.tensor(target).to(device='cuda')
        
        log_probs = model(context)
        loss = loss_function(log_probs, target)
        optimizer.zero_grad()
        model.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        if i%10==0 or i==1:
            print('Batch {}/{}'.format(i, N))
            print('Loss = {}'.format(float(loss)))
            lr = float(optimizer.param_groups[0]['lr'])
            print("Learning rate = {}".format(lr), '\n')
        if i%10==0:
            lr_scheduler.step()
        i += 1
        losses.append(loss)

Batch 1/17046
Loss = 10.885300636291504
Learning rate = 0.01 



  # Remove the CWD from sys.path while we load stuff.


Batch 10/17046
Loss = 10.476454734802246
Learning rate = 0.01 

Batch 20/17046
Loss = 10.166227340698242
Learning rate = 0.009000000000000001 

Batch 30/17046
Loss = 10.074182510375977
Learning rate = 0.008100000000000001 

Batch 40/17046
Loss = 10.000591278076172
Learning rate = 0.007290000000000001 

Batch 50/17046
Loss = 9.957330703735352
Learning rate = 0.006561000000000002 

Batch 60/17046
Loss = 9.918911933898926
Learning rate = 0.005904900000000002 

Batch 70/17046
Loss = 9.914350509643555
Learning rate = 0.005314410000000002 

Batch 80/17046
Loss = 9.9038724899292
Learning rate = 0.004782969000000002 

Batch 90/17046
Loss = 9.896750450134277
Learning rate = 0.004304672100000002 

Batch 100/17046
Loss = 9.889885902404785
Learning rate = 0.003874204890000002 

Batch 110/17046
Loss = 9.885173797607422
Learning rate = 0.003486784401000002 

Batch 120/17046
Loss = 9.882335662841797
Learning rate = 0.003138105960900002 

Batch 130/17046
Loss = 9.882954597473145
Learning rate = 0.0028

KeyboardInterrupt: 

### Run training with ReduceLROnPlateau Scheduler

In [None]:
losses = []
loss_function = nn.NLLLoss()
model = CBOW(voc_size=batcher.voc_size, embedding_dim=256, window_size=batcher.window_size, batch_size=batcher.batch_size)
model.cuda()
optimizer = optim.Adam(model.parameters(), lr=0.01)
lr_scheduler = ReduceLROnPlateau(optimizer = optimizer, \
                                 mode = 'min', \
                                 factor = 0.5, \
                                 threshold = 0.001 \
                                )

for epoch in [0, 1, 2]:
    print('========== Epoch {} =========='.format(epoch))
    total_loss = 0
    i = 1
    N = int(len(batcher.corpus)//BATCH_SIZE)
    for context, target in batcher.generator():
        model.train()
        context = torch.tensor(context).to(device='cuda')
        target = torch.tensor(target).to(device='cuda')
        
        log_probs = model(context)
        loss = loss_function(log_probs, target)
        
        if i%100==0 or i==1:
            print('Batch {}/{}'.format(i, N))
            print('Loss = {}'.format(round(float(loss), 3)))
            lr = float(optimizer.param_groups[0]['lr'])
            print("Learning rate = {}".format(lr), '\n')
        if i%100==0:
            lr_scheduler.step(loss)
            
        optimizer.zero_grad()
        model.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        i += 1
        losses.append(loss)

Batch 1/170167
Loss = 12.149
Learning rate = 0.01 



  # Remove the CWD from sys.path while we load stuff.


Batch 100/170167
Loss = 11.177
Learning rate = 0.01 

Batch 200/170167
Loss = 11.173
Learning rate = 0.01 

Batch 300/170167
Loss = 11.172
Learning rate = 0.01 

Batch 400/170167
Loss = 11.172
Learning rate = 0.01 

Batch 500/170167
Loss = 11.172
Learning rate = 0.01 

Batch 600/170167
Loss = 11.172
Learning rate = 0.01 

Batch 700/170167
Loss = 11.172
Learning rate = 0.01 

Batch 800/170167
Loss = 11.172
Learning rate = 0.01 

Batch 900/170167
Loss = 11.173
Learning rate = 0.01 

Batch 1000/170167
Loss = 11.173
Learning rate = 0.01 

Batch 1100/170167
Loss = 11.174
Learning rate = 0.01 

