# Assignment 1.3: Naive word2vec (40 points)

This task can be formulated very simply. Follow this [paper](https://arxiv.org/pdf/1411.2738.pdf) and implement word2vec like a two-layer neural network with matrices $W$ and $W'$. One matrix projects words to low-dimensional 'hidden' space and the other - back to high-dimensional vocabulary space.

![word2vec](https://i.stack.imgur.com/6eVXZ.jpg)

You can use TensorFlow/PyTorch and code from your previous task.

## Results of this task: (30 points)
 * trained word vectors (mention somewhere, how long it took to train)
 * plotted loss (so we can see that it has converged)
 * function to map token to corresponding word vector
 * beautiful visualizations (PCE, T-SNE), you can use TensorBoard and play with your vectors in 3D (don't forget to add screenshots to the task)

## Extra questions: (10 points)
 * Intrinsic evaluation: you can find datasets [here](http://download.tensorflow.org/data/questions-words.txt)
 * Extrinsic evaluation: you can use [these](https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e)

Also, you can find any other datasets for quantitative evaluation.

Again. It is **highly recommended** to read this [paper](https://arxiv.org/pdf/1411.2738.pdf)

Example of visualization in tensorboard:
https://projector.tensorflow.org

Example of 2D visualisation:

![2dword2vec](https://www.tensorflow.org/images/tsne.png)

# README

## Installation

The whole training code for all this was written and saved as distinct package. To reproduce the code in the cells below please follow the installation guideline.

1. **[IMPORTANT]** Create new python environment with python3.6.9 and activate it.
2. Run the package installation command from the root of this repository. This command will install all the required packages from `requirements.txt` and the package `assignment1`.
```
pip install -e .
```
3. **[IMPORTANT]** Run this jupyter notebook from the new environment
4. Load text file and upzip it to the `./data/` folder.
4. Load tensorboard logs (if you wish to see the loss plots) from []() and put them to the `./runs/` folder.
5. Load weights of the model from []() and put them to the `./models/` folder.

## Manual tests

All commands should be run from the terminal with active environment created on the previous step.

1. Run `python assignment1/dataset.py` to test the Batcher creation.
2. Run `python assignment1/train.py --force_cpu --test_mode --test_size 10000` to test train loop the vanilla SkipGram model on cpu (`--force_cpu`) using only first 10000 tokens from the data file.
3. Run `tensorboard --logdir runs/`, then open `http://localhost:6006/` in your browser to see original logs of word2vec training.

## Vanilla SkipGram

I've runned 2 experiments, but the second one was iterrupted in the middle of 10th epoch (freaking electricity issue). So I continued the second experiment from the last checkpoint - that's why I have 3 learning curves instead of 2 -_-

1. First experiment: I didn't get rid of the low-frequency words. The training command:
```
python assignment1/train.py --task_name vanilla_skipgram_zerou_mc_5 --batch_size 1024 --min_count 0 --num_workers 3 --num_epochs 20
```
One epoch lasted 41-42 minutes, total training took ~13.3 hours on Nvidia 1060 Max-Q GPU.
2. Second experiment: minimal frequency threshold was 5. The first training command:
```
python assignment1/train.py --task_name vanilla_skipgram_zerou_mc_5 --batch_size 2048 --num_workers 2 --num_epochs 20
```
The second training command:
```
python assignment1/train.py --task_name vanilla_skipgram_zerou_mc_5_ep9 --batch_size 2048 --num_workers 2 --num_epochs 12 --checkpoint models/vanilla_skipgram_zerou_mc_5/model_8.pth 
```
One epoch lasted 25.5 minutes, total training took ~8.5 hours on Nvidia 1060 Max-Q GPU.


The batch size for the second experiment is twice as big as that of the first experiment, so loss curves for the second experiment are in the different scale. You may investigate loss curves for positive and negative components of loss in tensorboard (`manual test 3`).

<img src="./imgs/tensorboard/00_loss.png" width="400">

In [8]:
!python --version

Python 3.6.9


# Save weights from one of the checkpoints to .tsv for tensorboard visualizations

In [3]:
import os
import os.path as osp
import tqdm

import numpy as np
import pandas as pd
import torch

from Task_1_miffka.assignment1.config import config
from Task_1_miffka.assignment1.dataset import SkipGramDataset
from Task_1_miffka.assignment1.word2vec import Word2Vec

In [4]:
# task_name = 'vanilla_skipgram_zerou'
task_name = 'vanilla_skipgram_zerou_mc_5_ep9'

weights = osp.join(config.model_dir, task_name, 'model_best.pth')
checkpoint = torch.load(weights, map_location='cpu')

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Andrey\\PycharmProjects\\iPavlov\\Task_1_miffka\\models\\vanilla_skipgram_zerou_mc_5_ep9\\model_best.pth'

In [3]:
# We need dataset for the first checkpoint, but the second one already contains token list

# text_file = osp.join(config.data_dir, 'text8')
# dataset = SkipGramDataset(text_file, dict_size=100000, min_count=0,
#                               window_size=5)

## Save big vectors

In [10]:
# word_vectors = checkpoint['V.weight'].numpy()
word_vectors = checkpoint['state']['V.weight'].numpy()

# output_task_name = 'vanilla_zerou_100k'
output_task_name = 'vanilla_zerou_72k'
save_dir = osp.join(config.model_dir, 'final', output_task_name)
os.makedirs(save_dir, exist_ok=True)

### THESE VECTORS ARE ALREADY IN YOUR `./models/` FOLDER, BUT YOU CAN ALWAYS RE-GENERATE THEM ###

# np.savetxt(osp.join(save_dir, 'word_vectors.tsv'), word_vectors, delimiter='\t', fmt='%.8e')
# with open(osp.join(save_dir, 'meta.tsv'), 'w') as fout:
# #     fout.write('\n'.join(pd.Series(dataset.token2int).sort_values().index.to_list()))
#     fout.write('\n'.join(pd.Series(checkpoint['int2token']).sort_index().values))

## Save first 1000 vectors to play with them at https://projector.tensorflow.org/

In [11]:
### THESE VECTORS ARE ALREADY IN YOUR `./models/` FOLDER ###

# np.savetxt(osp.join(save_dir, 'word_vectors_small.tsv'), word_vectors[:1000], delimiter='\t', fmt='%.8e')
# with open(osp.join(save_dir, 'meta_small.tsv'), 'w') as fout:
# #     fout.write('\n'.join(pd.Series(dataset.token2int).sort_values().index.to_list()[:1000]))
#     fout.write('\n'.join(pd.Series(checkpoint['int2token']).sort_index().values[:1000]))

# Load vectors to handwritten word2vec class and play with it

In [2]:
%%time
output_task_name = 'vanilla_zerou_72k'
# output_task_name = 'vanilla_zerou_100k'
save_dir = osp.join(config.model_dir, 'final', output_task_name)

vectors_path = osp.join(save_dir, 'word_vectors.tsv')
meta_path = osp.join(save_dir, 'meta.tsv')

### THIS OPERATION TAKES ~10s ON MY NOTEBOOK for 72k corpus and ~15s for 100k corpus ###
w2v = Word2Vec(vectors_path, meta_path)

CPU times: user 10.1 s, sys: 421 ms, total: 10.5 s
Wall time: 10.4 s


In [3]:
### Let's see if vectors can solve simple analogy ###

w2v.most_similar(['king', 'woman'], ['man'])

[('queen', 0.614870586021524),
 ('matilda', 0.6062946666234997),
 ('dowager', 0.5950421862923319),
 ('isabella', 0.5887912449711504),
 ('aragon', 0.5747355074712591),
 ('throne', 0.5703062003666273),
 ('princess', 0.5675615174539018),
 ('boleyn', 0.5652797828990311),
 ('daughter', 0.5611477820572973),
 ('jadwiga', 0.5610703497998427)]

# Analysis of tSNE graphics

## Experiment 1

I've loaded the vectors with 1000 most freaquent words and runned tSNE for ~3200 iterations with some perturbations. I've made it multiple times so I hope you will be able to reproduce my visualizations because the web interface doesn't allow me to share the links =(

<img src="./imgs/tensorboard/01_tsne_all.png" width="800">

### Clusters with numbers and date names

The two clusters to the right are clearly visible - the first one contains numbers, the second one contains date names.

<table>
    <tr>
    <td><img src="./imgs/tensorboard/02_numbers.png" width="800"></td> <td><img src="./imgs/tensorboard/03_dates.png" width="800"></td>
    </tr>
</table>

### Area with names of the countries, areas of the world, and with politics terms

In the upper area lies interesting field. 

<img src="./imgs/tensorboard/04_countries.png" width="800">

### Area with information and computers

In the left area lies one more interesting field.

<table>
    <tr>
    <td><img src="./imgs/tensorboard/05_information.png" width="800"></td> <td><img src="./imgs/tensorboard/06_information_big.png" width="800"></td>
    </tr>
</table>

## Experiment 2

The word vectors for the second experiment had comparable structure under almost the same conditions.

<img src="./imgs/tensorboard/07_tsne_2.png" width="800">

## Extra: intrinsic evaluation of word vectors quality

In [6]:
def parse_analogies_file(path):
    tasks = {}
    with open(path) as fin:
        for line in fin:
            line = line.strip().lower()
            if ':' in line:
                current_task = line
                tasks[current_task] = []
            else:
                words = line.split()
                pos_words = [words[0], words[2]]
                neg_words = [words[1]]
                answer = words[3]
                tasks[current_task].append({'positive': pos_words,
                                            'negative': neg_words,
                                            'answer': answer})
    return tasks

intrinsic_file = osp.join(config.data_dir, 'questions-words.txt')
tasks = parse_analogies_file(intrinsic_file)

In [3]:
for task_name, task_list in tasks.items():
    print(task_name, len(task_list))

: capital-common-countries 506
: capital-world 4524
: currency 866
: city-in-state 2467
: family 506
: gram1-adjective-to-adverb 992
: gram2-opposite 812
: gram3-comparative 1332
: gram4-superlative 1122
: gram5-present-participle 1056
: gram6-nationality-adjective 1599
: gram7-past-tense 1560
: gram8-plural 1332
: gram9-plural-verbs 870


In [7]:
def test_result(w2v_model, sample, top_accuracy=1):
    result_tuple = w2v_model.most_similar(sample['positive'],
                                          sample['negative'],
                                          topn=top_accuracy)
    result_list = [word for word, _ in result_tuple]
    return sample['answer'] in result_list
    

def run_test(test_list, w2v_model, top_accuracy=1, test_name=None):
    correct = 0
    for sample in tqdm.tqdm_notebook(test_list, desc=str(test_name)):
        result = test_result(w2v_model, sample, top_accuracy=top_accuracy)
        correct += result
    return len(test_list), correct
    
def run_tests(tests_dict, w2v_model, top_accuracy=1):
    results = []
    for test_name, test_list in tests_dict.items():
        total, correct = run_test(test_list, w2v_model, top_accuracy=top_accuracy, test_name=test_name)
        results.append({'name': test_name, 'total': total, 'correct': correct})
    return pd.DataFrame(results)

In [24]:
%%time
### It took 1.5 hours ###
df_res = run_tests(tasks, w2v, top_accuracy=1)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # This is added back by InteractiveShellApp.init_path()


HBox(children=(FloatProgress(value=0.0, description=': capital-common-countries', max=506.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description=': capital-world', max=4524.0, style=ProgressStyle(descrip…




















HBox(children=(FloatProgress(value=0.0, description=': currency', max=866.0, style=ProgressStyle(description_w…






HBox(children=(FloatProgress(value=0.0, description=': city-in-state', max=2467.0, style=ProgressStyle(descrip…






HBox(children=(FloatProgress(value=0.0, description=': family', max=506.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description=': gram1-adjective-to-adverb', max=992.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description=': gram2-opposite', max=812.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description=': gram3-comparative', max=1332.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description=': gram4-superlative', max=1122.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description=': gram5-present-participle', max=1056.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description=': gram6-nationality-adjective', max=1599.0, style=Progres…




HBox(children=(FloatProgress(value=0.0, description=': gram7-past-tense', max=1560.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description=': gram8-plural', max=1332.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description=': gram9-plural-verbs', max=870.0, style=ProgressStyle(des…


CPU times: user 1h 33min 4s, sys: 15 s, total: 1h 33min 19s
Wall time: 1h 33min 7s


In [26]:
df_res.to_csv(f'{output_task_name}_evaluation.csv')

In [27]:
df_res

Unnamed: 0,name,total,correct
0,: capital-common-countries,506,0
1,: capital-world,4524,0
2,: currency,866,0
3,: city-in-state,2467,2
4,: family,506,65
5,: gram1-adjective-to-adverb,992,3
6,: gram2-opposite,812,33
7,: gram3-comparative,1332,0
8,: gram4-superlative,1122,0
9,: gram5-present-participle,1056,35


### Compare the results to real word2vec

In [2]:
def load_word2vec():
    """ Load Word2Vec Vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
    """
    import gensim.downloader as api
    wv_from_bin = api.load("glove-wiki-gigaword-200")
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin

wv_from_bin = load_word2vec()

Loaded vocab size 400000


In [8]:
### Comparison is not the most honest, since this model is GloVe.
###  But at least it is not so big and have the same number of dimensions.

df_res_glove = run_tests(tasks, wv_from_bin, top_accuracy=1)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # This is added back by InteractiveShellApp.init_path()


HBox(children=(FloatProgress(value=0.0, description=': capital-common-countries', max=506.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description=': capital-world', max=4524.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description=': currency', max=866.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description=': city-in-state', max=2467.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description=': family', max=506.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description=': gram1-adjective-to-adverb', max=992.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description=': gram2-opposite', max=812.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description=': gram3-comparative', max=1332.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description=': gram4-superlative', max=1122.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description=': gram5-present-participle', max=1056.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description=': gram6-nationality-adjective', max=1599.0, style=Progres…




HBox(children=(FloatProgress(value=0.0, description=': gram7-past-tense', max=1560.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description=': gram8-plural', max=1332.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description=': gram9-plural-verbs', max=870.0, style=ProgressStyle(des…




In [14]:
df_res_glove.to_csv('glove-wiki-gigaword-200_evaluation.csv')

In [13]:
df_res_glove

Unnamed: 0,name,total,correct
0,: capital-common-countries,506,0
1,: capital-world,4524,1
2,: currency,866,0
3,: city-in-state,2467,1
4,: family,506,31
5,: gram1-adjective-to-adverb,992,9
6,: gram2-opposite,812,1
7,: gram3-comparative,1332,0
8,: gram4-superlative,1122,0
9,: gram5-present-participle,1056,98
