# Word2vec preprocessing

Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or [this one](http://mattmahoney.net/dc/text8.zip). For this task text preprocessing mostly consists of:

1. cleaning (mostly, if your dataset is from social media or parsed from the internet)
1. tokenization
1. building the vocabulary and choosing its size. Use only high-frequency words, change all other words to UNK or handle it in your own manner. You can use `collections.Counter` for that.
1. assigning each token a number (numericalization). In other words, make word2index и index2word objects.
1. data structuring and batching - make X and y matrices generator for word2vec (explained in more details below)

**ATTN!:** If you use your own data, please, attach a download link. 

Your goal is to make SkipGramBatcher class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.

![text](https://raw.githubusercontent.com/deepmipt/deep-nlp-seminars/651804899d05b96fc72b9474404fab330365ca09/seminar_02/pics/architecture.png)

There are several ways to do it right. Shapes could be `x_batch.shape = (batch_size, 2*window_size)`, `y_batch.shape = (batch_size,)` for CBOW or `(batch_size,)`, `(batch_size,)` for Skip-Gram. You should **not** do negative sampling here.

They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.

Useful links:
1. [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
1. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
1. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:

```
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']

window_size = 2

# CBOW:
indices_to_words(x_batch) = \
        [['first', 'used', 'early', 'working'],
        ['used', 'against', 'working', 'class'],
        ['against', 'early', 'class', 'radicals'],
        ['early', 'working', 'radicals', 'including']]

indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']

# Skip-Gram

indices_to_words(x_batch) = ['against', 'early', 'working', 'class']

indices_to_words(labels_batch) = ['used', 'working', 'early', 'radicals']]

```

If you struggle with something, ask your neighbor. If it is not obvious for you, probably someone else is looking for the answer too. And in contrast, if you see that you can help someone - do it! Good luck!

In [3]:
# utils.py

import random 
import re
from collections import Counter
from typing import Iterable, List, Tuple, Any

tokenizer_regex = re.compile("\w+")

def preprocess(text: str) -> List[str]:
    return tokenizer_regex.findall(text.lower())

class Vocab:
    """
    A utility to convert tokens to indices and vice versa.
    Notation is loosely based on fast.ai that I spent a lot of time
    studying recently.
    """
    def __init__(self, vocab_size: int) -> None:
        """
        Initialize a vocab of size `vocab_size`. Actual size will be
        `vocab_size` + 1 to account for the unknown token.
        """
        self.vocab_size = vocab_size
        self.unk_s = "xxunk"
        self.size = vocab_size
        self.stoi = dict() # map string to index
        self.itos = list() # map index to string
        self.freqs = Counter()
   
    def build(self, texts: Iterable[Iterable[str]]) -> None:
        """
        Process an iterable of iterables to create a vocab.
        After counting the tokens, the vocab is trimmed to `vocab_size` 
        according to tokens' frequency.
        """
        for text in texts:
            self.freqs.update(text)
        
        words, _ = zip(*self.freqs.most_common(self.vocab_size))
        words = list(words)
        words.append(self.unk_s)
        self.itos = words
        self.stoi = {s: i for i,s in enumerate(words)}
        
    def numericalize(self, tokens: Iterable[str]) -> List[int]:
        """
        Convert tokens into ids
        """
        return [self.stoi[t] if t in self.stoi 
                else self.stoi[self.unk_s] 
                for t in tokens]
    
    def textify(self, ids: Iterable[int]) -> List[str]:
        """
        Convert ids back into tokens
        """
        return [self.itos[i] for i in ids]
    
    def __repr__(self):
        return f"Vocab of size {len(self.itos)}"
        

class SkipGramDataGen:
    """
    Generate data for skip-gram algorithm.
    """
    def __init__(self, texts: Iterable[List], window_size: int):
        """
        Initialize a dataset to get SkipGram batches.
        :param texts: an iterable of lists of (maybe numericalized) tokens
        """
        self.texts = texts
        self.ws = window_size
        
    def iter_line(self) -> Tuple[Any, Any]:
        """
        Draw a pair of (center_word, context_word) from the dataset.
        """
        for text in self.texts:
            # skip the first two words, leave two in the end
            # so that not to overflow indices
            for i in range(self.ws, len(text) - self.ws):
                center_word = text[i]
                context_indices = [i-1-n for n in range(self.ws)] + [i+1+n for n in range(self.ws)]
                context_word = text[random.choice(context_indices)]
                yield (center_word, context_word)
                
    def iter_batch(self, bs: int) -> Tuple[List, List]:
        batch = list()
        i = 0
        
        for word_pair in self.iter_line():
            batch.append(word_pair)
            i += 1
            if i == bs:
                yield zip(*batch)
                i = 0
                batch = list()
        else:
            if batch:
                yield zip(*batch)

Let's create the vocab and the batcher:

In [4]:
with open("text8/text8", encoding="utf-8") as f:
    text8 = f.read()

In [5]:
text8_tokenized = preprocess(text8)

In [13]:
len(text8_tokenized)

17005207

In [6]:
vocab = Vocab(50000)
vocab.build([text8_tokenized])

In [7]:
# 20 most common word in the vocab:
vocab.itos[:20]

['the',
 'of',
 'and',
 'one',
 'in',
 'a',
 'to',
 'zero',
 'nine',
 'two',
 'is',
 'as',
 'eight',
 'for',
 's',
 'five',
 'three',
 'was',
 'by',
 'that']

The datagen was meant to handle multiple texts, but this file contains only one, so we have to wrap it in a list:

In [8]:
ds = SkipGramDataGen([vocab.numericalize(text8_tokenized)], window_size=2)

In [9]:
text8_tokenized[:8]

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']

In [139]:
for center, context in ds.iter_batch(4):
    print(center)
    print(context)
    break

(11, 5, 194, 1)
(194, 11, 3133, 194)


In [140]:
for center, context in ds.iter_batch(4):
    print(vocab.textify(center))
    print(vocab.textify(context))
    break

['as', 'a', 'term', 'of']
['a', 'term', 'a', 'abuse']
