# Word2Vec

In TF-IDF and Bag-of-words embedding models nearness of embeddings does not mean semantic nearness of embedded words. The motivation of Word2Vec models is to build embeddings in the space of a given dimension and catch words' semantic nearness. For example, vectors for words "cat" and "dog" might be nearer to each other than to "window" word.

<div>
<img src="https://sketch.io/render/sk-7be0ce4db7bf773968b68aee3eab9a25.jpeg" width="500"/>
</div>

Technically, we can introduce embeddings of given dimension as follows. Let's take an ordered dictionary of all words, let its size be $N$. By the word of $i$ index number we build a vector wich consists of zeroes but at the $i$ place we put 1.
Next, we multiply the vector by $N\times M$ matrix $W_{IN}$. That's how we obtain word's representation of $M$ dimension. To get back from latent representation to initial one, we take $M\times N$ matrix $W_{OUT}$. We multiply the latent representasion vector and apply $softmax$ to the output. As a result, we obtain a probability distribution on the dictionary that maps vectors from latent space and words from dictionary.

<div>
<img src="https://i.stack.imgur.com/OpupG.png" width="500"/>
</div>

By the way, we obtained a fully connected neural network with 1 hidden layer.

The only thing left is to train $W_{IN}$ and $W_{OUT}$ so that hidden representation was meaningful. The idea of learning Word2Vec models is to maximize corpora likelihood predicting context (skip-gram model):

![\arg \max_{\theta} \prod_{w\in texts}\left[\prod_{w' \in context(w)} \rm p(w' | w, \theta)\right]](https://latex.codecogs.com/gif.latex?%5Carg%20%5Cmax_%7B%5Ctheta%7D%20%5Cprod_%7Bw%5Cin%20texts%7D%5Cleft%5B%5Cprod_%7Bw%27%20%5Cin%20context%28w%29%7D%20%5Crm%20p%28w%27%20%7C%20w%2C%20%5Ctheta%29%5Cright%5D)

or a word by its context (CBOW model):

![\arg \max_{\theta} \prod_{w\in texts} \rm p (w | context(w), \theta)](https://latex.codecogs.com/gif.latex?%5Carg%20%5Cmax_%7B%5Ctheta%7D%20%5Cprod_%7Bw%5Cin%20texts%7D%20%5Crm%20p%20%28w%20%7C%20context%28w%29%2C%20%5Ctheta%29)

Illustration of both approaches:
![img](https://www.researchgate.net/profile/Nailah_Al-Madi/publication/319954363/figure/fig1/AS:552189871353858@1508663732919/CBOW-and-Skip-gram-models-architecture-1.png)

The next step is learning the model, it was explained in very clear manner in the original source:
[Tomas Mikolov et al, Distributed Representations of Words and Phrases
and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). 

Some interesting materials:
* https://arxiv.org/pdf/1402.3722.pdf
* https://medium.com/analytics-vidhya/maths-behind-word2vec-explained-38d74f32726b
* http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/


Let's move to practice. Here are several ways: to train our own word2vec model, to use a pre-trained model or to train starting from pre-trained wheights. Some frameworks for working with texts and applying pre-trained models:
* [gensim](https://radimrehurek.com/gensim/)
* [fasttext](https://fasttext.cc/)
* [tensorflow](https://tfhub.dev/s?module-type=text-embedding&publisher=google)

As an example, gensim will be used below to try a pre-trained model. 

## Dataset
The problem is to classify IMDB reviews 

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('train.csv', index_col=0)
df.head()

Unnamed: 0,review,label
0,I think they really let the quality of the DVD...,0
1,I'm sorry but this is just awful. I have told ...,0
2,"The Japenese sense of pacing, editing and musi...",0
3,"In the '60's/'70's, David Jason was renowned f...",1
4,"""Hail The Woman"" is one of the most moving fil...",1


## Data preprocessing

The purpose of preprocessing is to get tokens (words) from texts which represent a dictionary. Basic steps include:
* Tokenization 
* Filtration of non-words, stop-words and short words
* Lemmatization (e.g., "likes" to "like")

To walk through those steps we will use nltk:

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

en_stop = list(stopwords.words('english'))
lemmatizer = WordNetLemmatizer() 

def tokenize(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [t for t in tokens if
              re.match(r'[^\W\d]*$', t) and (len(t) > 2) and (t not in en_stop)]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

tokens = df['review'].apply(tokenize)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
tokens

0        [think, really, let, quality, dvd, production,...
1        [sorry, awful, told, people, film, bad, acting...
2        [japenese, sense, pacing, editing, musical, sc...
3        [david, jason, renowned, many, supporting, rol...
4        [hail, woman, one, moving, film, ever, seen, e...
                               ...                        
39995    [come, across, gem, movie, like, realize, grea...
39996    [often, way, write, comment, warn, anyone, mig...
39997    [extremely, silly, little, seen, film, slavery...
39998    [saw, movie, scary, thing, people, talking, mo...
39999    [though, film, seems, trying, market, horror, ...
Name: review, Length: 40000, dtype: object

## **Learning the Word2Vec model**
We will use IMDB texts corpus for learning Word2Vec model. Latent space dimension is 64, window size of 3


In [None]:
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser

In [None]:
bigrams = Phrases(sentences=tokens)
trigrams = Phrases(sentences=bigrams[tokens])



In [None]:
bigrams = Phraser(bigrams)
trigrams = Phraser(trigrams)

In [None]:
model = Word2Vec(tokens, size=300, window=6, min_count=4, iter=100, sg=0, sample=1e-5, workers=4)

## **Векторное представление текста**
We obtained vector embeddings for single words. There are several ways represent a text. We will try the way of taking an avarage vector:



In [None]:
def encode(list_of_tokens):
    x = np.array([model.wv[t] for t in list_of_tokens if t in model.wv.vocab])

    return np.concatenate((np.mean(x, axis=0), np.median(x, axis=0)))

fts = np.array([encode(t) for t in tokens])
fts.shape

(40000, 600)

Finally, we obtained 64 features for each text. Now we can move to classification.

**Train-test split**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(fts, df.label.values,
                                                    test_size=0.2, shuffle=True)

**Classification model** <br>

As an example, let's take logistic regression:

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', max_iter=3000).fit(X_train, y_train)

Check metrics:

In [None]:
from sklearn.metrics import classification_report

predicts = clf.predict(X_train)
print('Train\n', classification_report(y_train, predicts, digits=4))

predicts = clf.predict(X_test)
print('Test\n', classification_report(y_test, predicts, digits=4))

Train
               precision    recall  f1-score   support

           0     0.8945    0.8921    0.8933     16057
           1     0.8917    0.8941    0.8929     15943

    accuracy                         0.8931     32000
   macro avg     0.8931    0.8931    0.8931     32000
weighted avg     0.8931    0.8931    0.8931     32000

Test
               precision    recall  f1-score   support

           0     0.8796    0.8820    0.8808      4010
           1     0.8811    0.8787    0.8799      3990

    accuracy                         0.8804      8000
   macro avg     0.8804    0.8804    0.8804      8000
weighted avg     0.8804    0.8804    0.8804      8000



Let's try Support Vector Classification:

In [None]:
from sklearn.svm import SVC

clf = SVC().fit(fts, df.label.values)

In [None]:
from sklearn.metrics import classification_report

predicts = clf.predict(X_train)
print('Train\n', classification_report(y_train, predicts, digits=4))

predicts = clf.predict(X_test)
print('Test\n', classification_report(y_test, predicts, digits=4))

Train
               precision    recall  f1-score   support

           0     0.9219    0.9168    0.9194     16048
           1     0.9168    0.9219    0.9193     15952

    accuracy                         0.9193     32000
   macro avg     0.9194    0.9194    0.9193     32000
weighted avg     0.9194    0.9193    0.9193     32000

Test
               precision    recall  f1-score   support

           0     0.8995    0.8910    0.8952      4019
           1     0.8910    0.8995    0.8952      3981

    accuracy                         0.8952      8000
   macro avg     0.8953    0.8953    0.8952      8000
weighted avg     0.8953    0.8952    0.8952      8000



The score is better than Logistic Regression produced

In [None]:
test = pd.read_csv('test.csv', index_col=0)

## Pre-trained model
We will use pre-trained Word2Vec model trained on Wikipedia articles ("Glove-wiki-gigaword-300")

In [None]:
tok = test['review'].apply(tokenize)
mahmax = np.array([encode(t) for t in tok])
predicted = clf.predict(mahmax)
pd.DataFrame({'Predicted': predicted}).to_csv('/content/drive/My Drive/Colab Notebooks/solution.csv', index_label='Id')

In [None]:
import gensim.downloader as api

model_pre = api.load("glove-wiki-gigaword-300")  # load glove vectors



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
def encode1(list_of_tokens):
    x = np.array([model_pre.wv[t] for t in list_of_tokens if t in model_pre.wv.vocab])

    return np.concatenate((np.mean(x, axis=0), np.max(x, axis=0), np.median(x, axis=0)))

fts_pre = np.array([encode1(t) for t in tokens])
fts_pre.shape

  


(40000, 900)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(fts_pre, df.label.values,
                                                    test_size=0.2, shuffle=True)

Check metrics for pre-trained Word2Vec model

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', max_iter=1500).fit(X_train, y_train)

In [None]:
from sklearn.metrics import classification_report

predicts = clf.predict(X_train)
print('Train\n', classification_report(y_train, predicts, digits=4))

predicts = clf.predict(X_test)
print('Test\n', classification_report(y_test, predicts, digits=4))

Train
               precision    recall  f1-score   support

           0     0.8577    0.8500    0.8538     16085
           1     0.8498    0.8575    0.8536     15915

    accuracy                         0.8537     32000
   macro avg     0.8537    0.8537    0.8537     32000
weighted avg     0.8538    0.8537    0.8537     32000

Test
               precision    recall  f1-score   support

           0     0.8361    0.8478    0.8419      3982
           1     0.8470    0.8352    0.8411      4018

    accuracy                         0.8415      8000
   macro avg     0.8416    0.8415    0.8415      8000
weighted avg     0.8416    0.8415    0.8415      8000



In this case, pre-trained model shows worse results than our own model

In [None]:
print(list(tokens)[0][:10])

['think', 'really', 'let', 'quality', 'dvd', 'production', 'get', 'away', 'rented', 'dvd']
