## Word embeddings with Gensim

The importance of encoding text data is crucial for Deep Learning models. A model that encodes the similarity and proximity between words in the representation itself intuitively should work better for many tasks and it has been proved to be so - it is not always the best choice though: it's no silver bullet.

Two of the most important models for word representation in the n-dimensional space are [word2vec](https://arxiv.org/abs/1310.4546) and [GloVe](https://nlp.stanford.edu/projects/glove/). 

In this tutorial, I will show how to use [Gensim](https://radimrehurek.com/gensim/index.html) in order to use both word2vec and GloVe encodings for text data.

I assume you already know how to setup an environment for machine learning development with Python. If you don't, take a look at [this tutorial](https://medium.com/cocoaacademymag/basic-tools-for-machine-learning-85e887224ee4) on the basic tools for Machine Learning, which has everything you will need to follow this one.

### Summary

* ✅ Installing and importing Gensim
* ✅ Creating a word2vec model from text data
* Creating a GloVe model from text data
* Intrinsic evaluation for both models
* Extrinsic evaluation for both models

# word2vec

## Preparing the text to train the model

In this example, I will open a csv file, get the text from it, split it into the different lines, then I split each line into "words" - actually, I should use a more sophisticated method to separate the words, but since this is just an example, I will use the space as a boundary between words, *which is absolutely naive and should not be done in production* - stripping the ponctuation in order to clean the corpus a little. In _real life_ you should use a tokenizer in order to separate the tokens to be vectorized and also in order to handle ponctuation properly. Depending on the task, it might also be helpful to lemmatize the tokens.

My `sentences` variable will store a list of lists of strings, where each string ~roughly~ represents a word.

In [None]:
import pandas as pd
df = pd.read_csv('../input/train.csv')
corpus_text = '\n'.join(df[:5000]['comment_text'])
sentences = corpus_text.split('\n')
sentences = [line.lower().split(' ') for line in sentences]

In [None]:
def clean(s):
    return [w.strip(',."!?:;()\'') for w in s]
sentences = [clean(s) for s in sentences if len(s) > 0]

## Training the model

Once we have the sentences, we can use `Gensim` to create a model for us.
Here's a simple way to do it:

In [None]:
from gensim.models import Word2Vec

model = Word2Vec(sentences, size=100, window=5, min_count=3, workers=4)

Of course, you can change the hyperparameters such as window size or the dimensions of the resulting vectors to get better results.
If our model is too big, and we're done training it we can delete it keeping only the vectors.

In [None]:
vectors = model.wv
del model

## Using the vectors

Now, for each word (as represented in a string), we can get its appropriate vector.

In [None]:
vectors['good']

We can also compare words in order to assess their similarity, 
check which word is the most similar to a given word - i.e. the 
one with the least distant vector.

In [None]:
print(vectors.similarity('you', 'your'))
print(vectors.similarity('you', 'internet'))

In [None]:
vectors.most_similar('i')