# Word2vec preprocessing

Preprocessing is not the most interesting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/textdata). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parced from the internet)
1. tokenization
1. building the vocabulary and choosing its size
1. assigning each token a number (numericalization)
1. data structuring and batching

Your goal is to make SkipGramBatcher class which returns two numpy tensors with word indices. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpfull to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `(batch_size, 2*window_size)`, `(batch_size,)` for CBOW or `(batch_size,)`, `(batch_size,)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(batch_size, window_size, ...), SkipGram(num_skips, skip_window). You should implement only one batcher in this task, it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in separate file. It will be reused for the next task. Result of your work should represent that your batch have proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
bag_window = 2

batch = [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

labels = ['against', 'early', 'working', 'class']
```

If you struggle with somethng, ask your neighbour. If it is not obvious for you, probably someone else is looking for the answer too. And in contrast, if you see that you can help someone - just do it! Good luck!

In [1]:
from pathlib import Path
from collections import Counter
from itertools import takewhile

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from torch.utils.data import Dataset, DataLoader

In [2]:
DATA_PATH = Path('./data')

In [9]:
class Vocabulary():
    def __init__(self, token_to_idx=None):
        if token_to_idx is None:
            token_to_idx = {}
            
        self.token_to_idx = token_to_idx
        self.idx_to_token = {idx: token for token, idx in self.token_to_idx.items()}
        
    def add_token(self, token):
        if token in self.token_to_idx:
            index = self.token_to_idx[token]
        else:
            index = len(self.token_to_idx)
            self.token_to_idx[token] = index
            self.idx_to_token[index] = token
            
    def __len__(self):
        return len(self.token_to_idx)

In [17]:
class Vectorizer():
    def __init__(self, vocab):
        self.vocab = vocab
        
    @classmethod
    def from_tokens(cls, tokens, cutoff=0):
        token_counts = Counter(tokens)
        
        if cutoff > 0:
            token_counts = dict(filter(lambda x: x[1] >= cutoff, 
                                         token_counts.items()))
        print(token_counts)

        vocab = Vocabulary()
        
        for token in token_counts:
            vocab.add_token(token)
        
        return cls(vocab)

In [18]:
class MyDataset(Dataset):
    def __init__(self, tokens, vectorizer):
        self.tokens = tokens
        self.vectorizer = vectorizer
    
    @classmethod
    def prepare_dataset(cls, file_path):
        with open(file_path) as f:
            tokens = [token for token in f.read().split()[:100]]
        
        vectorizer = Vectorizer.from_tokens(tokens, cutoff=3)
        print(tokens)
        print(vectorizer.vocab.token_to_idx)
        print(vectorizer.vocab.idx_to_token)
        tokens = list(filter(lambda token: token in vectorizer.vocab.token_to_idx, tokens))
        print(tokens)
        
        return cls(tokens, vectorizer)

In [19]:
dataset = MyDataset.prepare_dataset(DATA_PATH/'text8')
dataset

{'anarchism': 3, 'as': 3, 'a': 4, 'of': 4, 'used': 3, 'the': 9, 'is': 3}
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing']
{'anarchism': 0, 'as': 1, 'a

<__main__.MyDataset at 0x7f2613296518>

In [185]:
'as' in dataset.vectorizer.vocab.token_to_idx

False

In [179]:
dataset.vectorizer.vocab.token_to_idx

{}

In [31]:
c = Counter('sdlfj sldkjdf sdlkfjdlfkjdl sdlfkjfd')
c.items()

dict_items([('s', 4), ('d', 8), ('l', 6), ('f', 6), ('j', 5), (' ', 3), ('k', 4)])

In [34]:
dict(filter(lambda x: x[1] > 5, ))

IndexError: string index out of range

In [54]:
c

Counter({'s': 4, 'd': 8, 'l': 6, 'f': 6, 'j': 5, ' ': 3, 'k': 4})

In [62]:
dict(takewhile(lambda x: x[1] > 3, c.most_common()))

Counter({('d', 8): 1,
         ('l', 6): 1,
         ('f', 6): 1,
         ('j', 5): 1,
         ('s', 4): 1,
         ('k', 4): 1})

In [52]:
c.most_common()

[('d', 8), ('l', 6), ('f', 6), ('j', 5), ('s', 4), ('k', 4), (' ', 3)]