# Assignment 1.2: Word2vec preprocessing (20 points)

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make **Batcher** class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size, 2*window_size)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']
```

In [0]:
from collections import Counter

from torch.utils.data import Dataset

import numpy as np
import spacy
from spacy.symbols import ORTH

spacy_en = spacy.load('en')
spacy_en.tokenizer.add_special_case("don't", [{ORTH: "do"}, {ORTH: "not"}])
spacy_en.tokenizer.add_special_case("didn't", [{ORTH: "did"}, {ORTH: "not"}]) #adding special case so that tokenizer("""don't""") != 'do'

In [2]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

--2020-02-24 20:36:15--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.75
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2020-02-24 20:38:16 (267 KB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   


# The main part

In [3]:
# Opening data
with open('text8', encoding='utf-8') as f:
    text_original = f.read()
print(text_original[:100])

 anarchism originated as a term of abuse first used against early working class radicals including t


In [4]:
# Preprocessing stuff
def tokenizer(text):
    """
    return: list of lemmas (without punctuation and numbers)
    """
    return [tok.text for tok in spacy_en.tokenizer(text) if tok.text.isalpha()]

tokens = tokenizer(text_original)
print(len(tokens), len(set(tokens)))
print(tokens[:10])

17008373 253830
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


In [0]:
unk_token = '<unk>'
pad_token = '<pad>'

In [0]:
class Batcher(Dataset):
    """
    Preprocessed list of tokens passed  here
    """

    def __init__(self, tokens, vocab_size):

        super().__init__()

        self.tokens = tokens
        self.tokens_freq = []
        self.vocab_size = vocab_size

        self.word2index = {}
        self.index2word = {}

        print('Initial length of tokens: {}'.format(self.vocab_size))

        self.build_vocab(min_freq=5)
        self.numericalization()
        self.x, self.y = self.cbow_batching(batch_size=64, window_size=4)

    def __len__(self):        
        return self.x.shape[0]

    def __getitem__(self, idx):
        
        x = self.x[idx]
        #x = torch.FloatTensor(x) # преобразуем в тензор с флоат величинами
        y = self.y[idx]        
        return x, y
    
    
    def build_vocab(self, min_freq = 10):
        """
        builds vocab (self.tokens_freq) from self.tokens
        param: min_freq (int) - minimum frequency for token in list to get to vocab
        """
        counter = Counter(self.tokens)
        mask = list(map(lambda x: x[1] > min_freq, counter.items()))
        self.tokens_freq = np.array(list(counter.items()))[mask]
        self.tokens_freq = list(map(lambda x: x[0], self.tokens_freq)) + [unk_token] + [pad_token]
        self.vocab_size = len(self.tokens_freq)

        print('After building vocab, vocab_size: {}'.format(self.vocab_size))

    def numericalization(self):
        """
        creates word2index and index2word, replaces not frequent tokens with 'unk' token
        """
        self.word2index = {word : ind for ind, word in enumerate(self.tokens_freq)}
        self.index2word = {value : key for key, value in self.word2index.items()}



        self.tokens = [self.word2index[token] if token in self.word2index else self.word2index[unk_token] 
                       for token in self.tokens]

        print('Numeralization done. Example of self.tokens: {}'.format(self.tokens[:10]))          
        

    def cbow_batching(self, batch_size, window_size):
        """
        adds pad_token, creates batches
        """
        
        self.tokens = [self.word2index[pad_token]] * window_size + self.tokens + [self.word2index[pad_token]] * window_size
        x_batches = []
        y_batches = []

        for i in np.arange(window_size, len(self.tokens)-window_size):
            y_batches.append(self.tokens[i])

            context = self.tokens[i-window_size:i] + self.tokens[i+1:i+1+window_size]
            x_batches.append(context)
        x_batches = np.array(x_batches)
        y_batches = np.array(y_batches)

        try:
            x_batches = x_batches.reshape((-1, batch_size,2*window_size))
            y_batches = y_batches.reshape((-1,batch_size))
        except Exception:
            print('Could not reshape directly so deleted something')
            total = len(y_batches)
            x_batches = x_batches[:-(total % batch_size),:]
            y_batches = y_batches[:-(total % batch_size)]

            x_batches = x_batches.reshape((-1, batch_size,2*window_size))
            y_batches = y_batches.reshape((-1,batch_size))

        return x_batches, y_batches

In [13]:
batcher=Batcher(tokens=tokens, vocab_size=len(tokens))
#batcher.build_vocab(min_freq=5)
#batcher.numericalization()
#x, y = batcher.cbow_batching(batch_size=64, window_size=4)

Initial length of tokens: 17008373
After building vocab, vocab_size: 63632
Numeralization done. Example of self.tokens: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Could not reshape directly so deleted something


In [14]:
batcher.x.shape, batcher.y.shape

((265755, 64, 8), (265755, 64))