In [None]:
!pip install -qq nltk==3.4
!pip install -qq gensim==3.6.0
!pip install -qq pandas==0.23.4
!pip install -qq bokeh==1.0.3

!wget -O quora.zip -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1ERtxpdWOgGQ3HOigqAMHTJjmOE_tWvoF"
!unzip quora.zip

In [None]:
import numpy as np
import nltk

np.random.seed(42)

nltk.download('punkt')
nltk.download('stopwords')

# Word Embeddings

*NB. This notebook is somewhat based on the YSDA NLP course [notebook](https://github.com/yandexdataschool/nlp_course/tree/master/week01_embeddings).*

Guess, you've seen such pictures already:  
![embeddings relations](https://www.tensorflow.org/images/linear-relationships.png)
*From [Vector Representations of Words, Tensorflow tutorial](https://www.tensorflow.org/tutorials/representation/word2vec)*

We are going to use these thingies alo-o-ot in the course.

Well, we need a proper introduction, nevertheless. Do you remember how we represented sentences last time?

We converted a sentence to the bag-of-words:  
![](https://i.ibb.co/Tvw1c8S/BOW.png)

And each word was represented using one-hot encoding (a vector with one at the position corresponding to the word's index and zeros at all others positions).

These one-hot encoding vectors have extremely high dimensions (like, hundreds of thousands or millions). They fit their purpose - to encode information about words. But they have several disadvantages.

First of all, they are almost uninterpretable. I mean, all one-hot encoding vectors are orthonormal, so you cannot say that, e.g. `man` and `men` are more similar words than `man` and `crocodiles`.

But we want to. Well, NLP researchers in the past few years wanted to, cannot really speak for you.

And we're gonna build vectors, that encode semantics!

Look at the first picture. It shows relations encoded in the word embeddings space. Such as male-female or verb tense... whatever, just check these two links: http://bionlp-www.utu.fi/wv_demo/, https://lamyiowce.github.io/word2viz/. Go and play with this relations right now! They are funny and you'll get an insight into what the word embeddings can.

There is another disadvantage of one-hot encoding vectors: their size. The word embedding vectors we are going to play with have dimensions from 50 to 600 usually. That is by a few orders of magnitude smaller than one-hot encoding vectors.

This is crucial for neural networks - they can work only with sufficiently small dense vectors. Well, we'll speak about it later.

---

In this notebook, we are going to work with [gensim](https://radimrehurek.com/gensim/) - somewhat standard word embeddings python library. We'll just superficially discuss how it works, but we'll train our model and apply a pretrained one. As a result, you're (probably) gonna understand how to work with word embeddings.

In the next notebook, we'll try to work out how word embeddings work and how to implement a module to train word embeddings.

## Training Model

Well, nothing is interesting in mere training of the word embeddings model. We are gonna apply it to a very concrete task: [Quora Question Pairs](https://www.kaggle.com/c/quora-question-pairs) from kaggle:

In [None]:
import pandas as pd

quora_data = pd.read_csv('train.csv')

quora_data.sample(20)[['question1', 'question2', 'is_duplicate']]

You see, the dataset consists of question pairs and you have to determine which of them are duplicates and which are not.

Well, I'm not promising that we'll achieve good results right now, but still... Let's train Word2vec gensim model!

*Word2vec is the most popular method of building word embeddings. We'll implement it next time, right now let's believe that it just do whatever we want.*

First of all, we need to collect available texts to pass them to Word2vec model:

In [None]:
import numpy as np

quora_data.question1 = quora_data.question1.replace(np.nan, '', regex=True)
quora_data.question2 = quora_data.question2.replace(np.nan, '', regex=True)

texts = list(pd.concat([quora_data.question1, quora_data.question2]).unique())
texts[:10]

Next, we have to tokenize the texts. Remember, last time we used `spacy` for this purpose. Well, this time we'll use `nltk` - another great NLP library.

It goes this way:

In [None]:
from nltk.tokenize import word_tokenize

word_tokenize(texts[0])

**Task** Your turn: lowercase all the texts and tokenize them:

In [None]:
tokenized_texts = [<do it>]

assert all(isinstance(row, (list, tuple)) for row in tokenized_texts), \
    "please convert each line into a list of tokens"
assert all(all(isinstance(tok, str) for tok in row) for row in tokenized_texts), \
    "please convert each line into a list of tokens"

is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(not is_latin(token) or token.islower() for tokens in tokenized_texts for token in tokens),\
    "please lowercase each line"

In [None]:
print([' '.join(row) for row in tokenized_texts[:2]])

And we are ready to train a small model:

In [None]:
from gensim.models import Word2Vec

model = Word2Vec(tokenized_texts,
                 size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5,     # define context as a 5-word window around the target word
                 seed=0,       # + workers=1 is to make model reproducible
                 workers=1).wv

## Analyzing Model

Yay, we have our own model, let's play with it!

To get word's vector, well, call `get_vector`:

In [None]:
model.get_vector('anything')

To get most similar words for the given one (guess, what):

In [None]:
model.most_similar('bread')

And it can do such magic:

In [None]:
model.most_similar(positive=['coder', 'money'], negative=['brain'])

That is, who is like coder, with money and without brains.

And this too:

In [None]:
model.most_similar([model.get_vector('politician') - model.get_vector('power') + model.get_vector('honesty')])

Honest politician without power, isn't it just cute.

**Task** Play with it. And yes, I'm serious.

## Visualizing Model

Let's now look at the projection of the first 1000 the most frequent words.

In [None]:
words = sorted(model.vocab.keys(),
               key=lambda word: model.vocab[word].count,
               reverse=True)[:1000]

print(words[::100])

**Task** Build the matrix from these words' vectors.

In [None]:
word_vectors = model.vectors[[model.vocab[word].index for word in words]]

assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), model.vectors.shape[1])
assert np.isfinite(word_vectors).all()

Now we would try to project this 32 dimensional vectors to the more convenient 2D space to be able to look on them.

### PCA

The simplest linear method of dimension reduction is __P__rincipial __C__omponent __A__nalysis.

PCA builds so called principal components - set of variables along which our data has the largest variance:  

![pca](https://i.stack.imgur.com/Q7HIP.gif)
*From the great answer [https://stats.stackexchange.com/a/140579](https://stats.stackexchange.com/a/140579)*

For instance, in the picture, the rotating line represents possible variants of the first principal component. If we want to project 2D set of dots to one dimension, we'll probably want to save as much information as possible. The maximum variance position of the rotating line gives us more information about the dots than all other positions.

Really nice illustrations of this mechanism live [here](http://setosa.io/ev/principal-component-analysis/).

To be short, project multi-dimensional space on the first two or three components and enjoy fast-and-dirty dimensional reduction.

**Task** Use [sklearn PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to project data to 2D. Centre and normalize the output.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def get_pca_projection(word_vectors):
    <implement me>

In [None]:
word_vectors_pca = get_pca_projection(word_vectors)

assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1 - word_vectors_pca.std(0))) < 1e-5, "points must have unit variance"

Let's visualize the embeddings:

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()

    if isinstance(color, str):
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show:
        pl.show(fig)
    return fig

In [None]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

### T-SNE

There is a more complicated method of data visualization. It's called t-SNE. You can gain an intuition behind it from [this](https://distill.pub/2016/misread-tsne/) article (warning: even more beautiful illustrations).

**Task** Well, the same as the previous one: apply [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), normalize and center the result.

In [None]:
from sklearn.manifold import TSNE

def get_tsne_projection(word_vectors):
    <fill me>

In [None]:
word_tsne = get_tsne_projection(word_vectors)
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

## Using Pretrained Embeddings

We can also use a pretrained embeddings model. There are a number of such models in gensim, you can call `api.info()` to get the list.

Let's load a model:

In [None]:
import gensim.downloader as api

model = api.load('glove-twitter-100')

## Building Phrase Embeddings

The simplest way to obtain a phrase embedding is to average embeddings of the words in the phrase.

*You are probably thinking, 'What a dumb idea, why on earth the average of embedding should contain any useful information'. Well, check [this paper](https://arxiv.org/pdf/1805.09843.pdf).*

Let's do it: tokenize and lowercase the texts, calc the mean embedding for the words with known embeddings.

**Task** Implement the following function.

In [None]:
def get_phrase_embedding(model, phrase):
    """ Calcs phrase embedding as a mean of known word embeddings in the phrase.
    If all the words are unknown, returns zero vector.
    :param model: KeyedVectors instance
    :param phrase: str or list of str (tokenized text)
    """
    embedding = np.zeros([model.vector_size], dtype='float32')

    if isinstance(phrase, str):
        words = word_tokenize(phrase.lower())
    else:
        words = phrase

    <implement me>

    return embedding

In [None]:
vector = get_phrase_embedding(model, "I'm very sure. This never happened to me before...")

assert np.allclose(vector[::10],
                   np.array([ 0.30757686, -0.05861897,  0.143751  , -0.11104885, -0.96929336,
                             -0.21928601,  0.21652265,  0.14978765,  1.4842536 ,  0.017826  ],
                              dtype=np.float32))

Well, we are ready to embed all the sentences in our corpus.

In [None]:
text_vectors = np.array([get_phrase_embedding(model, phrase) for phrase in tokenized_texts])

What can we do with it? Now we are able perform search of the nearest neighbours to the given phrase in our base!

How are we going to define the distance?

We'll use cosine similarity of two vectors:
$$\text{cosine_similarity}(x, y) = \frac{x^{T} y}{||x||\cdot ||y||}$$

*It's not a [distance](https://www.encyclopediaofmath.org/index.php/Metric) strictly speaking but we still can use it to search for the vectors.*

**Task** Calc the similarity between `query` embedding and `text_vectors` using `cosine_similarity` function. Find `k` vectors with highest scores and return corresponding texts from `texts` list.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def find_nearest(model, text_vectors, texts, query, k=10):
    <implement me too>

In [None]:
results = find_nearest(model, text_vectors, texts, query="How do i enter the matrix?", k=10)

print('\n'.join(results))

assert len(results) == 10 and isinstance(results[0], str)
assert results[1] == 'How do I get to the dark web?'
assert results[4] == 'What can I do to save the world?'

In [None]:
find_nearest(model, text_vectors, texts, query="How does Trump?", k=10)

In [None]:
find_nearest(model, text_vectors, texts, query="Why don't i ask a question myself?", k=10)

## Starting Classification

### Bag-of-Words

Finally, we are ready to return to the classification task.

We have two sentences and we are going calculate their similarity and compare it with some threshold. If the value is higher than the threshold than we'll call the sentences similar.

Let's start with tokenization of the questions.

In [None]:
tokenized_question1 = [word_tokenize(question.lower()) for question in quora_data.question1]
tokenized_question2 = [word_tokenize(question.lower()) for question in quora_data.question2]

In [None]:
assert tokenized_question1[0] == ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india', '?']
assert tokenized_question2[2] == ['how', 'can', 'internet', 'speed', 'be', 'increased', 'by', 'hacking', 'through', 'dns', '?']

**Task** Calc the cosine similarity between the questions.

In [None]:
question1_vectors = <calc vectors for tokenized_question1>
question2_vectors = <calc vectors for tokenized_question2>

cosine_similarities = <calc similarities between the vectors in question1_vectors and question2_vectors>

In [None]:
assert cosine_similarities.shape == (len(quora_data),), 'Check the shapes'

target_similarity = cosine_similarity([get_phrase_embedding(model, tokenized_question1[1])],
                                      [get_phrase_embedding(model, tokenized_question2[1])])[0, 0]
assert np.allclose(cosine_similarities[1], target_similarity), 'Check your calculations'

Let's find the texts' similarity threshold.

We are going to optimize accuracy of the similarity prediction. For instance, accuracy with threshold equal to 0 would be equal to the fraction ones in the dataset:

In [None]:
(quora_data.is_duplicate == 1).mean()

**Task** Implement the `accuracy` function that calculates accuracy with the given threshold.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def accuracy(cosine_similarities, threshold, labels):
    return <implement me>

thresholds = np.linspace(0, 1, 100, endpoint=False)
plt.plot(thresholds, [accuracy(cosine_similarities, th, quora_data.is_duplicate) for th in thresholds])

Let's optimize over this function to find the optimal threshold.

In [None]:
from scipy.optimize import minimize_scalar

res = minimize_scalar(
    lambda th: -accuracy(cosine_similarities, th, quora_data.is_duplicate), bounds=(0.5, 0.99), method='bounded'
)

best_threshold = res.x
best_accuracy = accuracy(cosine_similarities, best_threshold, quora_data.is_duplicate)
print('Threshold = {:.5f}, Accuracy = {:.2%}'.format(best_threshold, best_accuracy))

assert best_accuracy > 0.65, 'Check yourself'

Well, we are a bit better than random :)

### Tf-idf Weights

The averaging of vectors is boring. We can use weighted average - with tf-idf weights.

Let's use `TfidfVectorizer` for this task.

You see, `TfidfVectorizer` returns matrix `(samples_count, words_count)`. Our embeddings is a matrix `(words_count, embedding_dim)`:

In [None]:
model.vectors.shape

The embedding of a sequence of words $w_1, \ldots, w_k$, as we defined, it is vector $\sum_i \text{idf}(w_i) \cdot \text{embedding}(w_i)$.

That means that we can multiply matrices `(samples_count, words_count) x (words_count, embedding_dim)` to obtain the embeddings for all phrases we have.

But we need to have corresponding words in both matrices. That is i-th row in the first matrix correspond to the i-th column in the second matrix.

To achieve it, we are going to use `vocabulary` argument of `TfidfVectorizer`.

We can extract the vocabulary this way from the gensim model:

In [None]:
vocabulary = {word: vocab_element.index for word, vocab_element in model.vocab.items()}

Initialize the vectorizer:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(vocabulary=vocabulary)

vectorizer.fit(texts)

**Task** Apply `vectorizer` to the `quora_data` questions and obtain the phrase vectors by multiplying them on `model.vectors`.

In [None]:
tfidf_question1 = <calc it>
tfidf_question2 = <and it>

In [None]:
assert tfidf_question1.shape == tfidf_question2.shape == (len(quora_data), len(vocabulary))

Check, that the text in matrices is correctly encoded:

In [None]:
for col in tfidf_question1[0].tocoo().col:
    print(model.index2word[col], end=' ')

print('\n' + ' '.join(tokenized_question1[0]))

Now we are able to convert the vectors matrices to vectors. That is, multiply tfidf and word2vec matrices and nomalize the result by the number of words in each sentence.

**Task** Build the question vectors.

In [None]:
EPS = 1e-9

question1_elements_count = <calc it, add EPS to ensure you don't divide by zero>
question2_elements_count = <and it too>

assert question1_elements_count.shape == question2_elements_count.shape == (len(quora_data), 1)
assert np.all(question1_elements_count > 0) and np.all(question2_elements_count > 0.)

question1_vectors = <calc mean tfidf-weighted vectors>
question2_vectors = <and these too>

assert question1_vectors.shape == question2_vectors.shape == (len(quora_data), model.vectors.shape[1])

assert np.allclose(question1_vectors[0][:10], [ 0.04672134, -0.00910798,  0.06817335,  0.00792347,  0.00907249,
                                                0.05163505,  0.02648487, -0.05109346,  0.04752091, -0.01203835])

**Task** Evaluate the quality of these embeddings.

In [None]:
cosine_similarities = <calc them>
assert cosine_similarities.shape == (len(quora_data),), 'Check the shapes'
assert np.allclose(cosine_similarities[:5], [0.99604267, 0.9558047 , 0.973884  , 0.79243606, 0.92760015])

In [None]:
res = minimize_scalar(
    lambda th: -accuracy(cosine_similarities, th, quora_data.is_duplicate), bounds=(0.5, 0.99), method='bounded'
)

best_threshold = res.x
best_accuracy = accuracy(cosine_similarities, best_threshold, quora_data.is_duplicate)
print('Threshold = {:.5f}, Accuracy = {:.2%}'.format(best_threshold, best_accuracy))

## Implementing Word-level Machine Translation

In [None]:
!wget -O ukr_rus.train.txt -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1vAK0SWXUqei4zTimMvIhH3ufGPsbnC_O"
!wget -O ukr_rus.test.txt -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1W9R2F8OeKHXruo2sicZ6FgBJUTJc8Us_"
!wget -O fairy_tale.txt -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1sq8zSroFeg_afw-60OmY8RATdu_T1tej"

# Install the PyDrive wrapper & import libraries.
# This only needs to be done once per notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

downloaded = drive.CreateFile({'id': '1d7OXuil646jUeDS1JNhP9XWlZogv6rbu'})
downloaded.GetContentFile('cc.ru.300.vec.zip')

downloaded = drive.CreateFile({'id': '1yAqwqgUHtMSfGS99WLGe5unSCyIXfIxi'})
downloaded.GetContentFile('cc.uk.300.vec.zip')

!unzip cc.ru.300.vec.zip
!unzip cc.uk.300.vec.zip

Let's implement a simple machine translator.

The idea is based on the paper [Word Translation Without Parallel Data](https://arxiv.org/pdf/1710.04087.pdf). There are lots of interesting things in the repo: [https://github.com/facebookresearch/MUSE](https://github.com/facebookresearch/MUSE).

And we are going to translate from Ukrainian to Russian. They are quite similar languages with similar syntax. This is why we can substitute words from one language with words from another and expect something coherent in the result.

That is, we are going to learn how embeddings from one language correspond to embeddings from another, like this:

![](https://raw.githubusercontent.com/facebookresearch/MUSE/master/outline_all.png)

Than we will simply map the source word (the word in the sentence we want to translate) to the target embedding space and take the word with the nearest embedding.

In [None]:
from gensim.models import KeyedVectors

ru_emb = KeyedVectors.load_word2vec_format("cc.ru.300.vec")
uk_emb = KeyedVectors.load_word2vec_format("cc.uk.300.vec")

Look at the pair `серпень-август` (which are translation, means august).

In [None]:
ru_emb.most_similar([ru_emb["август"]])

In [None]:
uk_emb.most_similar([uk_emb["серпень"]])

In [None]:
ru_emb.most_similar([uk_emb["серпень"]])

In [None]:
def load_word_pairs(filename):
    uk_ru_pairs = []
    uk_vectors = []
    ru_vectors = []
    with open(filename, "r", encoding='utf8') as inpf:
        for line in inpf:
            uk, ru = line.rstrip().split("\t")
            if uk not in uk_emb or ru not in ru_emb:
                continue
            uk_ru_pairs.append((uk, ru))
            uk_vectors.append(uk_emb[uk])
            ru_vectors.append(ru_emb[ru])
    return uk_ru_pairs, np.array(uk_vectors), np.array(ru_vectors)


uk_ru_train, X_train, Y_train = load_word_pairs("ukr_rus.train.txt")
uk_ru_test, X_test, Y_test = load_word_pairs("ukr_rus.test.txt")

### Learning the mapping from the embedding spaces

We have pairs of corresponding words. So we have to find a mapping which would map their embeddings to be as near as possible.

$$W^*= \arg\min_W ||WX - Y||_F, \text{where} ||*||_F - \text{Frobenius norm}$$

This function is similar to the linear regression (without bias).

**Task** Implement it - use `LinearRegression` from sklearn with `fit_intercept=False`:

In [None]:
from sklearn.linear_model import LinearRegression

mapping = LinearRegression(fit_intercept=False).fit(X_train, Y_train)

Check it:

In [None]:
august = mapping.predict(uk_emb["серпень"].reshape(1, -1))
ru_emb.most_similar(august)

Expected that the top contains different months, but `август` is not the first.

We are going to evaluate the mapping by precision@k metric with k = 1, 5, 10.

**Task** Implement following function:

In [None]:
def precision(pairs, mapped_vectors, topn=1):
    """
    :args:
        pairs = list of right word pairs [(uk_word_0, ru_word_0), ...]
        mapped_vectors = list of embeddings after mapping from source embedding space to destination embedding space
        topn = the number of nearest neighbours in destination embedding space to choose from
    :returns:
        precision_val, float number, total number of words for those we can find right translation at top K.
    """
    assert len(pairs) == len(mapped_vectors)
    <implement it>
    return precision_val

In [None]:
assert precision([("серпень", "август")], august, topn=5) == 0.0
assert precision([("серпень", "август")], august, topn=9) == 1.0
assert precision([("серпень", "август")], august, topn=10) == 1.0

In [None]:
assert precision(uk_ru_test, X_test) == 0.0
assert precision(uk_ru_test, Y_test) == 1.0

In [None]:
precision_top1 = precision(uk_ru_test, mapping.predict(X_test), 1)
precision_top5 = precision(uk_ru_test, mapping.predict(X_test), 5)

assert precision_top1 >= 0.635
assert precision_top5 >= 0.813

### Improving Mapping

It can be proven that the mapping with orthogonal constraint is better:
$$W^*= \arg\min_W ||WX - Y||_F \text{, where: } W^TW = I$$

You can find it using SVD:
$$X^TY=U\Sigma V^T\text{, singular value decompostion}$$

$$W^*=UV^T$$

**Task** Implement the function:

In [None]:
def learn_transform(X_train, Y_train):
    """
    :returns: W* : float matrix[emb_dim x emb_dim] as defined in formulae above
    """
    <calculate it>

In [None]:
W = learn_transform(X_train, Y_train)

In [None]:
ru_emb.most_similar([np.matmul(uk_emb["серпень"], W)])

In [None]:
assert precision(uk_ru_test, np.matmul(X_test, W)) >= 0.653
assert precision(uk_ru_test, np.matmul(X_test, W), 5) >= 0.824

### Writing the translator

Now we are ready to implement the translation function. It should find the nearest vector in the target (Russian) embedding space and return the source word if it is not in the embeddings.

In [None]:
with open("fairy_tale.txt", "r") as in f:
    uk_sentences = [line.rstrip().lower() for line in in f]

In [None]:
def translate(sentence):
    """
    :args:
        sentence - sentence in Ukrainian (str)
    :returns:
        translation - sentence in Russian (str)

    * find ukrainian embedding for each word in sentence
    * transform ukrainian embedding vector
    * find nearest russian word and replace
    """
    <implement it>

In [None]:
assert translate(".") == "."
assert translate("1 , 3") == "1 , 3"
assert translate("кіт зловив мишу") == "кот поймал мышку"

In [None]:
for sentence in uk_sentences:
    print("src: {}\ndst: {}\n".format(sentence, translate(sentence)))

# Supplementary Materials

## To read
### Basic knowledge:  
[On word embeddings - Part 1, Sebastian Ruder](http://ruder.io/word-embeddings-1/)  
[Deep Learning, NLP, and Representations, Christopher Olah](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)  

### How to clusterize embeddings:  
[Making Sense of Word Embeddings (2016), Pelevina et al](http://anthology.aclweb.org/W16-1620)    

### How to evaluate embeddings:
[Evaluation methods for unsupervised word embeddings (2015), T. Schnabel](http://www.aclweb.org/anthology/D15-1036)  
[Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance (2016), B. Chiu](https://www.aclweb.org/anthology/W/W16/W16-2501.pdf)  
[Problems With Evaluation of Word Embeddings Using Word Similarity Tasks (2016), M. Faruqui](https://arxiv.org/pdf/1605.02276.pdf)  
[Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure (2016), Oded Avraham, Yoav Goldberg](https://arxiv.org/pdf/1611.03641.pdf)  
[Evaluating Word Embeddings Using a Representative Suite of Practical Tasks (2016), N. Nayak](https://cs.stanford.edu/~angeli/papers/2016-acl-veceval.pdf)  


## To watch
[Word Vector Representations: word2vec, Lecture 2, cs224n](https://www.youtube.com/watch?v=ERibwqs9p38)