# Practice Word2Vec Lite

Write a program that trains [Word2Vec](https://jalammar.github.io/illustrated-word2vec/) model. Do _not_ use `print()` instructions in your code, otherwise test procedure will not succeed; the message "_Wrong Answer_" indicates answer format is incorrect (`print()` in the code, missing words in the dictionary, etc.). The message "_Embeddings are not good enough_" means you're on the right track and you should focus on the model improvement. In this version of the assignment the checks on the embeddings are easier.

You may think of the input string as being pre-processed with the following function:

```
import re
import string

def clean(inp: str) -> str:
    inp = inp.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))
    inp = re.sub(r'\s+', ' ', inp.lower())
    return inp
```

I.e. given the input "Your string!" the output will be "your string ".

  
**Input**: data (string) - cleaned documents without punctuation in one line  
**Output**: w2v_dict (dict: key (string) - a word from vocabulary, value (numpy array) - the word's embedding)

Time limit: 50 seconds  
Memory limit: 128 MB

## Решение

In [18]:
# импортируем нужные библиотеки
from typing import Tuple, List, Dict
import re
import string

import numpy as np
import matplotlib.pyplot as plt
import torch
from torch import nn
import torch.optim as optim
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

### Preprocessing

In [77]:
text = '''Machine learning is the study of computer algorithms that \
improve automatically through experience. It is seen as a \
subset of artificial intelligence. Machine learning algorithms \
build a mathematical model based on sample data, known as \
training data, in order to make predictions or decisions without \
being explicitly programmed to do so. Machine learning algorithms \
are used in a wide variety of applications, such as email filtering \
and computer vision, where it is difficult or infeasible to develop \
conventional algorithms to perform the needed tasks.'''


In [84]:
# Функция для очистки текста (уже дана)
def clean(inp: str) -> str:
    inp = inp.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))
    inp = re.sub(r'\s+', ' ', inp.lower())
    return inp

text = clean(text)
print(text)

machine learning is the study of computer algorithms that improve automatically through experience it is seen as a subset of artificial intelligence machine learning algorithms build a mathematical model based on sample data known as training data in order to make predictions or decisions without being explicitly programmed to do so machine learning algorithms are used in a wide variety of applications such as email filtering and computer vision where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks 


In [85]:
# функция для токенизации
def tokenize(text) -> List[str]:
    return re.findall(r'[a-z]+', text)

tokens = tokenize(text)
print(tokens)

['machine', 'learning', 'is', 'the', 'study', 'of', 'computer', 'algorithms', 'that', 'improve', 'automatically', 'through', 'experience', 'it', 'is', 'seen', 'as', 'a', 'subset', 'of', 'artificial', 'intelligence', 'machine', 'learning', 'algorithms', 'build', 'a', 'mathematical', 'model', 'based', 'on', 'sample', 'data', 'known', 'as', 'training', 'data', 'in', 'order', 'to', 'make', 'predictions', 'or', 'decisions', 'without', 'being', 'explicitly', 'programmed', 'to', 'do', 'so', 'machine', 'learning', 'algorithms', 'are', 'used', 'in', 'a', 'wide', 'variety', 'of', 'applications', 'such', 'as', 'email', 'filtering', 'and', 'computer', 'vision', 'where', 'it', 'is', 'difficult', 'or', 'infeasible', 'to', 'develop', 'conventional', 'algorithms', 'to', 'perform', 'the', 'needed', 'tasks']


In [86]:
# функция создания словарей с индексами слов и
# словами
def mapping(tokens: List[str]) -> Tuple[Dict[int, str], Dict[str, int]]:
    id_to_token = {}
    token_to_id = {}
    for i, token in enumerate(set(tokens)):
        id_to_token[i] = token
        token_to_id[token] = i

    return id_to_token, token_to_id

id_to_token, token_to_id = mapping(tokens)
id_to_token

{0: 'or',
 1: 'intelligence',
 2: 'perform',
 3: 'through',
 4: 'data',
 5: 'it',
 6: 'variety',
 7: 'on',
 8: 'improve',
 9: 'develop',
 10: 'based',
 11: 'without',
 12: 'seen',
 13: 'sample',
 14: 'make',
 15: 'needed',
 16: 'wide',
 17: 'the',
 18: 'order',
 19: 'filtering',
 20: 'programmed',
 21: 'are',
 22: 'of',
 23: 'algorithms',
 24: 'tasks',
 25: 'such',
 26: 'in',
 27: 'do',
 28: 'email',
 29: 'where',
 30: 'vision',
 31: 'build',
 32: 'predictions',
 33: 'decisions',
 34: 'is',
 35: 'study',
 36: 'experience',
 37: 'computer',
 38: 'that',
 39: 'infeasible',
 40: 'a',
 41: 'mathematical',
 42: 'learning',
 43: 'being',
 44: 'applications',
 45: 'model',
 46: 'used',
 47: 'machine',
 48: 'training',
 49: 'so',
 50: 'conventional',
 51: 'and',
 52: 'automatically',
 53: 'as',
 54: 'to',
 55: 'known',
 56: 'difficult',
 57: 'explicitly',
 58: 'subset',
 59: 'artificial'}

## Data Generation

Чтобы сгенерировать данные воспользуемся схемой skipgram.

In [89]:
# функция для one-hot encoding
def one_hot_encoding(i: int, vocab_size: int) -> List[int]:
    one_hot = [0] * vocab_size
    one_hot[i] = 1
    return one_hot

# функция для генерации тренировочных данных
def generate_skipgram_training_data(tokens: List[str], 
                                    token_to_id: Dict[str, int], 
                                    n_neighbors: int) -> Tuple[np.array, np.array]:
    len_tokens = len(tokens)
    assert len_tokens > n_neighbors

    X = []
    y = []
    for i in range(len_tokens):
        ind = list(range(max(0, i-n_neighbors), min(len_tokens, i+n_neighbors+1)))
        token_id = token_to_id[tokens[i]]
        one_hot_target = one_hot_encoding(token_id, len(token_to_id))
        
        for j in ind:
            if i == j:
                continue
            X.append(one_hot_target)
            neighbor_id = token_to_id[tokens[j]]
            one_hot_neighbors = one_hot_encoding(neighbor_id, len(token_to_id))
            y.append(one_hot_neighbors)
    return np.array(X), np.array(y)

X, y = generate_skipgram_training_data(tokens, token_to_id, 2)
print(X.shape, y.shape)
print(X)

(330, 60) (330, 60)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Model

In [12]:
class Word2Vec:

    def __init__(self, 
                 vocab_size: int, 
                 embedding_size: int, 
                 id_to_token: Dict[int, str], 
                 token_to_id: Dict[str, id], 
                 lr = 0.05):
        self.w1 = np.random.randn(vocab_size, embedding_size)
        self.w2 = np.random.randn(embedding_size, vocab_size)
        self.lr = lr
        self.id_to_token = id_to_token
        self.token_to_id = token_to_id
        self.cache = {}

    def forward(self, X: np.array): 
        self.cache["a1"] = X @ self.w1
        self.cache["a2"] = self.cache["a1"] @ self.w2
        self.cache["z"] = self._softmax(self.cache["a2"])

    def backward(self, X: np.array, y: np.array) -> float:
        self.forward(X)
        self._derivatives(X, y)
        return self._cross_entropy(y)

    def get_embeddings(self) -> Dict[str, np.array]:
        embeddings = {}
        for id in self.id_to_token:
            one_hot = one_hot_encoding(id, len(self.id_to_token))
            self.forward(one_hot)
            embeddings[self.id_to_token[id]] = self.cache["a1"]
        return embeddings

    def _derivatives(self, X: np.array, y: np.array):
        da2 = self.cache["z"] - y
        dw2 = self.cache["a1"].T @ da2
        da1 = da2 @ self.w2.T
        dw1 = X.T @ da1

        self.w1 -= self.lr * dw1
        self.w2 -= self.lr * dw2

    def _softmax(self, X: np.array) -> List[float]:
        res = []
        for x in X:
            e = np.exp(x)
            sm = e / e.sum()
            res.append(sm)
        return res

    def _cross_entropy(self, y: np.array) -> float:
        return -(1/len(self.cache["z"]))*np.sum(np.log(self.cache["z"]) * y)

### Training

In [13]:
n_epoch = 300

wv = Word2Vec(vocab_size=len(token_to_id), embedding_size=10, lr=0.02)
losses = []

for epoch in range(n_epoch):
    loss = wv.backward(X, y)
    losses.append(loss)

plt.plot(range(len(losses)), losses, color="skyblue")
plt.show()

NameError: name 'token_to_id' is not defined

In [116]:
learning = one_hot_encoding(token_to_id["learning"], len(token_to_id))
wv.forward([learning])
result = wv.cache["z"][0]

for word in (id_to_token[id] for id in np.argsort(result)[::-1]):
    print(word)


machine
the
so
build
intelligence
are
is
algorithms
computer
learning
mathematical
artificial
conventional
subset
seen
programmed
known
wide
it
based
difficult
study
develop
order
experience
automatically
a
do
that
in
perform
where
used
model
improve
through
infeasible
explicitly
vision
predictions
variety
as
such
sample
of
training
needed
tasks
make
decisions
on
without
applications
or
data
and
filtering
email
to
being


### Get embeddings

In [120]:
embeddings = wv.get_embeddings()
len(embeddings['or'])

10

## Имплементация

In [15]:
from typing import Tuple, List, Dict
import re
import string

import numpy as np


def generate_skipgram_training_data(tokens: List[str], 
                                    n_neighbors: int,
                                    token_to_id: Dict[str, int]) -> Tuple[np.array, np.array]:
    len_tokens = len(tokens)
    assert len_tokens > n_neighbors

    X = []
    y = []
    for i in range(len_tokens):
        ind = list(range(max(0, i-n_neighbors), min(len_tokens, i+n_neighbors+1)))
        token_id = token_to_id[tokens[i]]
        one_hot_target = one_hot_encoding(token_id, len(token_to_id))
        
        for j in ind:
            if i == j:
                continue
            X.append(one_hot_target)
            neighbor_id = token_to_id[tokens[j]]
            one_hot_neighbors = one_hot_encoding(neighbor_id, len(token_to_id))
            y.append(one_hot_neighbors)
    return np.array(X), np.array(y)

def clean(inp: str) -> str:
    inp = inp.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))
    inp = re.sub(r'\s+', ' ', inp.lower())
    return inp

def tokenize(text) -> List[str]:
    return re.findall(r'[a-z]+', text)

def mapping(tokens: List[str]) -> Tuple[Dict[int, str], Dict[str, int]]:
    id_to_token = {}
    token_to_id = {}
    for i, token in enumerate(set(tokens)):
        id_to_token[i] = token
        token_to_id[token] = i

    return id_to_token, token_to_id

def one_hot_encoding(i: int, vocab_size: int) -> List[int]:
    one_hot = [0] * vocab_size
    one_hot[i] = 1
    return one_hot

class Word2Vec:

    def __init__(self, 
                 vocab_size: int, 
                 embedding_size: int, 
                 id_to_token: Dict[int, str], 
                 token_to_id: Dict[str, id], 
                 lr = 0.05):
        self.w1 = np.random.randn(vocab_size, embedding_size)
        self.w2 = np.random.randn(embedding_size, vocab_size)
        self.lr = lr
        self.id_to_token = id_to_token
        self.token_to_id = token_to_id
        self.cache = {}

    def forward(self, X: np.array): 
        self.cache["a1"] = X @ self.w1
        self.cache["a2"] = self.cache["a1"] @ self.w2
        self.cache["z"] = self._softmax(self.cache["a2"])

    def backward(self, X: np.array, y: np.array) -> float:
        self.forward(X)
        self._derivatives(X, y)
        return self._cross_entropy(y)

    def get_embeddings(self) -> Dict[str, np.array]:
        embeddings = {}
        for id in self.id_to_token:
            one_hot = one_hot_encoding(id, len(self.id_to_token))
            self.forward(one_hot)
            embeddings[self.id_to_token[id]] = self.cache["a1"]
        return embeddings

    def _derivatives(self, X: np.array, y: np.array):
        da2 = self.cache["z"] - y
        dw2 = self.cache["a1"].T @ da2
        da1 = da2 @ self.w2.T
        dw1 = X.T @ da1

        self.w1 -= self.lr * dw1
        self.w2 -= self.lr * dw2

    def _softmax(self, X: np.array) -> List[float]:
        res = []
        for x in X:
            e = np.exp(x)
            sm = e / e.sum()
            res.append(sm)
        return res

    def _cross_entropy(self, y: np.array) -> float:
        return -(1/len(self.cache["z"]))*np.sum(np.log(self.cache["z"]) * y)

    
def train(data: str):
    """
    return: w2v_dict: dict
            - key: string (word)
            - value: np.array (embedding)
    """

    text = clean(data)
    tokens = tokenize(text)
    id_to_token, token_to_id = mapping(tokens)
    X, y = generate_skipgram_training_data(tokens, 2, token_to_id)
    
    wv = Word2Vec(vocab_size=len(token_to_id), embedding_size=10, id_to_token=id_to_token, token_to_id=token_to_id, lr=0.05)

    n_epoch = 50
    for _ in range(n_epoch):
        _ = wv.backward(X, y)

    embeddings = wv.get_embeddings()
    return embeddings

text = '''Now that we have tokenized the text and created lookup tables, we can now proceed to generating the actual training data, which are going to take the form of matrices. Since tokens are still in the form of strings, we need to encode them numerically using one-hot vectorization. We also need to generate a bundle of input and target values, as this is a supervised learning technique.

This then begs the question of what the input and target values are going to look like. What is the value that we are trying to approximate, and what sort of input will we be feeding into the model to generate predictions? The answer to these questions and how they tie into word2vec is at the heart of understanding word embeddings—as you may be able to tell, word2vec is not some sort of blackbox magic, but a result of careful training with input and output values, just like any other machine learning task.'''
embeddings = train(text)
embeddings

{'tokenized': array([ 0.80259933, -0.56195691, -1.23496325,  0.64312449, -0.28062969,
        -0.55299999,  1.89906804, -0.30005229, -0.85341851,  0.04752365]),
 'careful': array([-1.40595156, -1.65376121,  1.27840585,  1.07453342, -0.21095158,
        -0.23241511,  0.2024018 ,  0.92059893,  0.78436003, -0.56534324]),
 'since': array([ 1.21649575, -1.91500802,  0.82886996,  2.32844402,  1.53424776,
        -0.23393498, -0.1561125 , -0.99108843,  0.57167369,  0.5772824 ]),
 'may': array([-1.04888127,  0.88197771,  0.00810673, -1.01314758,  1.74901887,
        -2.35533384, -0.73804228,  1.06384545, -1.35517123, -1.24357045]),
 'they': array([-5.31195077e-01,  4.32396734e-01, -3.28276559e+00, -4.46890117e-04,
        -1.11069010e+00, -1.12281080e+00,  6.12294997e-01,  4.36559105e-01,
        -8.07240964e-01,  2.02868278e-01]),
 'values': array([ 1.47153783, -0.12877025, -0.65009727, -0.9700206 ,  0.57348369,
        -0.38631015, -1.07416369,  1.38494936,  1.85533465, -0.71901027]),
 'to':

# Practice Word2Vec Hard

In [39]:
from collections import Counter
from typing import Tuple, List, Dict
import re
import string
import random

import numpy as np
import torch
from torch import nn
import torch.optim as optim

LR = 0.001
BATCH_SIZE = 64
N_NEIGHBORS = 5
EMB_SIZE = 32
N_EPOCH = 100

def generate_skipgram_training_data(tokens: List[str], 
                                    n_neighbors: int,
                                    token_to_id: Dict[str, int]) -> Tuple[np.array, np.array]:
    len_tokens = len(tokens)
    assert len_tokens > n_neighbors

    X = []
    y = []
    for i in range(len_tokens):
        ind = list(range(max(0, i-n_neighbors), min(len_tokens, i+n_neighbors+1)))
        token_id = token_to_id[tokens[i]]
        
        for j in ind:
            if i == j:
                continue
            X.append(token_id)
            neighbor_id = token_to_id[tokens[j]]
            y.append(neighbor_id)
    return np.array(X), np.array(y)

def get_batches(X, y, batch_size):
    for idx in range(0, len(X), batch_size):
        yield X[idx:idx+batch_size], y[idx:idx+batch_size]

def clean(inp: str) -> str:
    inp = inp.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))
    inp = re.sub(r'\s+', ' ', inp.lower())
    return inp

def tokenize(text) -> List[str]:
    return re.findall(r'[a-z]+', text)

def mapping(tokens: List[str]) -> Tuple[Dict[int, str], Dict[str, int]]:
    id_to_token = {}
    token_to_id = {}
    for i, token in enumerate(set(tokens)):
        id_to_token[i] = token
        token_to_id[token] = i

    return id_to_token, token_to_id


class Word2Vec(nn.Module):

    def __init__(self, vocab_size, embedding_size):
        super().__init__()
        
        self.embed = nn.Embedding(vocab_size, embedding_size)
        self.output = nn.Linear(embedding_size, vocab_size)
        self.log_softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, x):
        x = self.embed(x)
        scores = self.output(x)
        log_ps = self.log_softmax(scores)
        
        return log_ps

def get_embeddings(id_to_token, embeddings):
    embeddings = {id_to_token[i]: embeddings[i] for i in range(len(id_to_token))}
    return embeddings

    
def train(data: str):
    """
    return: w2v_dict: dict
            - key: string (word)
            - value: np.array (embedding)
    """
    
    text = clean(data)
    tokens = tokenize(text)
    id_to_token, token_to_id = mapping(tokens)
    X, y = generate_skipgram_training_data(tokens, N_NEIGHBORS, token_to_id)
    
    wv = Word2Vec(vocab_size=len(token_to_id), embedding_size=EMB_SIZE)
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    criterion = nn.NLLLoss()
    optimizer = optim.AdamW(wv.parameters(), lr=LR)
    
    for _ in range(N_EPOCH):
        # get input and target batches
        for inputs, targets in get_batches(X, y, BATCH_SIZE):
            inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
            inputs, targets = inputs.to(device), targets.to(device)
            
            log_ps = wv(inputs)
            loss = criterion(log_ps, targets)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    embeddings = wv.embed.weight.to('cpu').data.numpy()
    embeddings = get_embeddings(id_to_token, embeddings)
    return embeddings

text = '''Now that we have tokenized the text and created lookup tables, we can now proceed to generating the actual training data, which are going to take the form of matrices. Since tokens are still in the form of strings, we need to encode them numerically using one-hot vectorization. We also need to generate a bundle of input and target values, as this is a supervised learning technique.

This then begs the question of what the input and target values are going to look like. What is the value that we are trying to approximate, and what sort of input will we be feeding into the model to generate predictions? The answer to these questions and how they tie into word2vec is at the heart of understanding word embeddings—as you may be able to tell, word2vec is not some sort of blackbox magic, but a result of careful training with input and output values, just like any other machine learning task.'''
embeddings = train(text)
embeddings

{'able': array([ 1.0360265 ,  1.2592756 , -1.7382044 , -0.9255914 , -1.4748343 ,
        -0.761087  , -0.52008533,  0.03287016,  0.8205714 , -0.84094334,
         0.5236916 , -0.03391146,  1.4279349 ,  1.0823654 , -1.1519045 ,
         0.91479003, -1.4376178 , -0.01284576, -0.43107465,  0.2211382 ,
         0.64092225, -0.83209544,  0.00872847, -0.3271192 ,  0.21419977,
        -2.6025558 ,  0.14706932,  0.08581059, -0.24141347,  1.5465369 ,
         2.1173382 , -0.093008  ], dtype=float32),
 'learning': array([ 0.41807917, -1.7490933 , -0.03405838, -1.2870965 ,  1.215351  ,
         0.5127766 , -2.5150673 ,  0.30145097,  1.7711991 , -0.11772451,
         0.8761567 ,  0.35972893,  0.47273678, -0.38144562,  0.691942  ,
        -0.6183901 ,  1.2788106 ,  0.33727142, -1.5197242 ,  0.11188744,
        -1.209866  ,  2.47572   ,  2.2331083 , -0.77228403, -0.7701088 ,
         1.3658247 ,  1.1134578 , -0.749292  ,  1.4852693 , -1.3693407 ,
         1.2494949 ,  0.26874423], dtype=float32),
 '