# Assignment 1.4: Negative sampling (15 points)

You may have noticed that word2vec is really slow to train. Especially with big (> 50 000) vocabularies. Negative sampling is the solution.

The task is to implement word2vec with negative sampling.

This is what was discussed in Stanford lecture. The main idea is in the formula:

$$ L = \log\sigma(u^T_o \cdot u_c) + \sum^k_{i=1} \mathbb{E}_{j \sim P(w)}[\log\sigma(-u^T_j \cdot u_c)]$$

Where $\sigma$ - sigmoid function, $u_c$ - central word vector, $u_o$ - context (outside of the window) word vector, $u_j$ - vector or word with index $j$.

The first term calculates the similarity between positive examples (word from one window)

The second term is responsible for negative samples. $k$ is a hyperparameter - the number of negatives to sample.
$\mathbb{E}_{j \sim P(w)}$
means that $j$ is distributed accordingly to unigram distribution.

Thus, it is only required to calculate the similarity between positive samples and some other negatives. Not across all the vocabulary.

Useful links:
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

# README

## Installation

The whole training code for all this was written and saved as distinct package. To reproduce the code in the cells below please follow the installation guideline.

1. **[IMPORTANT]** Create new python environment with python3.6.9 and activate it.
2. Run the package installation command from the root of this repository. This command will install all the required packages from `requirements.txt` and the package `assignment1`.
```
pip install -e .
```
3. **[IMPORTANT]** Run all jupyter notebooks from the created environment
4. Load text file and upzip it to the `./data/` folder.
4. Load tensorboard logs (if you wish to see the loss plots) from []() and put them to the `./runs/` folder.
5. Load weights of the model from []() and put them to the `./models/` folder.

## Manual tests

All commands should be runned from the terminal with active environment created on the `Installation` step.

1. Run `python assignment1/dataset.py` to test the Batcher creation.
2. Run `python assignment1/train.py --force_cpu --test_mode --test_size 10000` to test train loop of the vanilla SkipGram model on cpu (`--force_cpu`) using only first 10000 tokens from the data file.
3. Run `python assignment1/train.py --force_cpu --test_mode --negative_sampling` to test train loop of the SkipGram with Negative Sampling model.
4. Run `tensorboard --logdir runs/`, then open `http://localhost:6006/` in your browser to see original logs of word2vec training.


## SkipGram with Negative Sampling

I've runned 2 experiments, but wanted to see if the first will converge better after 20th epoch.

1. lr=1e-4, negative sampling in the loop
The first training command:
```
python assignment1/train.py --task_name neg_s_skipgram_zerou_mc_5 --batch_size 2048 --num_workers 2 --num_epochs 20 --negative_sampling
```
The second training command:
```
python assignment1/train.py --task_name neg_s_skipgram_zerou_mc_5_continue --batch_size 2048 --num_workers 2 --num_epochs 15 --negative_sampling --checkpoint models/neg_s_skipgram_zerou_mc_5/model_best.pth
```
One epoch lasted 14.1 minutes, total training took ~8.2 hours on Nvidia 1060 Max-Q GPU.

2. lr=1e-3, negative sampling in the BatcherNS class
Training command:
```
python assignment1/train.py --task_name neg_s_skipgram_zerou_mc_5_lr_0.001 --batch_size 2048 --num_workers 2 --num_epochs 20 --negative_sampling --lr 1e-3
```
One epoch lasted 28.6 minutes, total training took ~9.5 hours. This learning rate seems to be better.

Please investigate learning curves in tensorboard (`manual test 4`).

<img src="./imgs/tensorboard/00_loss2.png" width="400">

In [1]:
import os
import os.path as osp

import numpy as np
import pandas as pd
import torch

from assignment1.config import config
from assignment1.dataset import SkipGramDataset

In [7]:
# task_name = 'neg_s_skipgram_zerou_mc_5_continue'
task_name = 'neg_s_skipgram_zerou_mc_5_lr_0.001'

weights = osp.join(config.model_dir, task_name, 'model_best.pth')
checkpoint = torch.load(weights, map_location='cpu')

## Save big vectors

In [8]:
word_vectors = checkpoint['state']['V.weight'].numpy()

# output_task_name = 'negative_sampling_lr1e-4'
output_task_name = 'negative_sampling_lr1e-3'
save_dir = osp.join(config.model_dir, 'final', output_task_name)
os.makedirs(save_dir, exist_ok=True)

# THESE VECTORS ARE ALREADY IN YOUR `./data/` FOLDER, BUT YOU CAN ALWAYS RE-GENERATE THEM
np.savetxt(osp.join(save_dir, 'word_vectors.tsv'), word_vectors, delimiter='\t', fmt='%.8e')
with open(osp.join(save_dir, 'meta.tsv'), 'w') as fout:
#     fout.write('\n'.join(pd.Series(dataset.token2int).sort_values().index.to_list()))
    fout.write('\n'.join(pd.Series(checkpoint['int2token']).sort_index().values))

## Save vectors for 1000 most frequent words

In [9]:
np.savetxt(osp.join(save_dir, 'word_vectors_small.tsv'), word_vectors[:1000], delimiter='\t', fmt='%.8e')
with open(osp.join(save_dir, 'meta_small.tsv'), 'w') as fout:
    fout.write('\n'.join(pd.Series(checkpoint['int2token']).sort_index().values[:1000]))

## tSNE visualization

For the model trained with negative sampling clusters in tSNE changed significantly. First, it required ~4200 iterations to come the picture like in the screenshot. Second, I failed to find cluster with numbers, and the cluster with months is not so clearly distinguished from the other words.

<img src="./imgs/tensorboard/08_tsne_ns.png" width="800">

## Intrinsic evaluation of vectors 

In [10]:
import os
import os.path as osp
import tqdm

import numpy as np
import pandas as pd

from assignment1.config import config
from assignment1.word2vec import Word2Vec

In [11]:
output_task_name = 'negative_sampling_lr1e-3'
save_dir = osp.join(config.model_dir, 'final', output_task_name)
vectors_path = osp.join(save_dir, 'word_vectors.tsv')
meta_path = osp.join(save_dir, 'meta.tsv')

In [12]:
w2v = Word2Vec(vectors_path, meta_path)

In [15]:
w2v.most_similar(['king', 'woman'], ['man'])

[('queen', 0.6287472177865924),
 ('daughter', 0.5005021417968559),
 ('throne', 0.4822215705474069),
 ('wife', 0.47699453373074796),
 ('hrh', 0.4737004855590008),
 ('kings', 0.46296776654910854),
 ('mary', 0.4541321741599966),
 ('princess', 0.4528903499811321),
 ('monarch', 0.4504280242144696),
 ('jadwiga', 0.4475856183658562)]

## Load tests

In [13]:
def parse_analogies_file(path):
    tasks = {}
    with open(path) as fin:
        for line in fin:
            line = line.strip().lower()
            if ':' in line:
                current_task = line
                tasks[current_task] = []
            else:
                words = line.split()
                pos_words = [words[0], words[2]]
                neg_words = [words[1]]
                answer = words[3]
                tasks[current_task].append({'positive': pos_words,
                                            'negative': neg_words,
                                            'answer': answer})
    return tasks

intrinsic_file = osp.join(config.data_dir, 'questions-words.txt')
tasks = parse_analogies_file(intrinsic_file)

## Define analogies tests

In [14]:
def test_result(w2v_model, sample, top_accuracy=1):
    result_tuple = w2v_model.most_similar(sample['positive'],
                                          sample['negative'],
                                          topn=top_accuracy)
    result_list = [word for word, _ in result_tuple]
    return sample['answer'] in result_list
    

def run_test(test_list, w2v_model, top_accuracy=1, test_name=None):
    correct = 0
    for sample in tqdm.tqdm_notebook(test_list, desc=str(test_name)):
        result = test_result(w2v_model, sample, top_accuracy=top_accuracy)
        correct += result
    return len(test_list), correct
    
def run_tests(tests_dict, w2v_model, top_accuracy=1):
    results = []
    for test_name, test_list in tests_dict.items():
        total, correct = run_test(test_list, w2v_model, top_accuracy=top_accuracy, test_name=test_name)
        results.append({'name': test_name, 'total': total, 'correct': correct})
    return pd.DataFrame(results)

## Run tests

In [16]:
%%time
### It took 1.5 hours ###
df_res = run_tests(tasks, w2v, top_accuracy=1)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # This is added back by InteractiveShellApp.init_path()


HBox(children=(FloatProgress(value=0.0, description=': capital-common-countries', max=506.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description=': capital-world', max=4524.0, style=ProgressStyle(descrip…




















HBox(children=(FloatProgress(value=0.0, description=': currency', max=866.0, style=ProgressStyle(description_w…






HBox(children=(FloatProgress(value=0.0, description=': city-in-state', max=2467.0, style=ProgressStyle(descrip…






HBox(children=(FloatProgress(value=0.0, description=': family', max=506.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description=': gram1-adjective-to-adverb', max=992.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description=': gram2-opposite', max=812.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description=': gram3-comparative', max=1332.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description=': gram4-superlative', max=1122.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description=': gram5-present-participle', max=1056.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description=': gram6-nationality-adjective', max=1599.0, style=Progres…




HBox(children=(FloatProgress(value=0.0, description=': gram7-past-tense', max=1560.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description=': gram8-plural', max=1332.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description=': gram9-plural-verbs', max=870.0, style=ProgressStyle(des…


CPU times: user 1h 33min 16s, sys: 1min 7s, total: 1h 34min 23s
Wall time: 1h 34min 11s


In [17]:
df_res.to_csv(f'{output_task_name}_evaluation.csv')

In [18]:
df_res

Unnamed: 0,name,total,correct
0,: capital-common-countries,506,0
1,: capital-world,4524,1
2,: currency,866,0
3,: city-in-state,2467,11
4,: family,506,29
5,: gram1-adjective-to-adverb,992,7
6,: gram2-opposite,812,5
7,: gram3-comparative,1332,1
8,: gram4-superlative,1122,0
9,: gram5-present-participle,1056,14
