This Jupyter notebook shows how to train and use word embeddings with the [Gensim](https://radimrehurek.com/gensim/) library. Some snippets have been adapted from the exercises of the course of Dr Luis Espinosa and from the [Gensim's Word2Vec tutorial](https://rare-technologies.com/word2vec-tutorial/).

Word embeddings are vector representations of words which are generally low-dimensional (often less than 1000 dimensions).


## TRAINING WORD EMBEDDINGS (Word2Vec)

---

As usual, we first import the libraries that we are going to use, including now Gensim.

**Note:** All these libraries need to be downloaded beforehand if not using Google Colab. Check their official websites for details on how to install them.

In [0]:
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import gensim
import requests
import random
nltk.download('punkt')

To learn word embeddings, in this case Word2Vec (Mikolov et al. 2013 [link text](https://arxiv.org/pdf/1301.3781.pdf)), we first need a sufficiently large text corpus. To this end we are going to use the IMDb review corpus (the same used in Coursework 1), which includes all sentences.






In [0]:
url_train="http://josecamachocollados.com/imdb_train.txt" # Containing all sentences of the imdb training set

#Load training set
response_train = requests.get(url_train)
dataset_file_train = response_train.text.split("\n")
random.shuffle(dataset_file_train) # We shuffle all sentences of our corpus

Let's check what the IMDb dataset looks like

In [0]:
for line in dataset_file_train[:5]:
  print (line)

Once the dataset is loaded, we are going to tokenize and store our corpus into sentences.

In [0]:
gensim_imdb_corpus=[]
for line in dataset_file_train:
  gensim_imdb_corpus.append(word_tokenize(line))
# This could take a lot of memory. If it's the case you can reduce the number of lines

**Note:** The corpus may be further preprocessed if necessary (e.g. lowercased) or further cleaned. In this case the version of the IMDb corpus that we use was already lowercased.

Finally, we can train our Word2Vec word embedding model! For more information on training Word2Vec with gensim, you can check [here](https://radimrehurek.com/gensim/models/word2vec.html).

In [0]:
from gensim.models import Word2Vec

In [0]:
model = Word2Vec(gensim_imdb_corpus, size=100, window=5, min_count=3) 
# Size is the number of dimensions of the embeddings we are going to learn
# Window is the size considered for context of a target word
# Min count is the minimum number of times that a word need to occur to be learnt

**Note:** You can save and load models using `model.save` and `model.load` functions. This enables you to export your models and use them anytime, or to use models training by someone else (i.e. pre-trained models). 

**Exercise (Optional):** Train the same model using [FastText](https://radimrehurek.com/gensim/models/fasttext.html) instead of Word2Vec. FastText is a model similar to Word2Vec but takes also into account character information, which can be useful for noisy text such as the one we find in social media.

## PLAYING WITH WORD2VEC

---

Now that our model has been trained, we can check the vectors for each word, which should have 100 dimensions.

In [0]:
vector_movie=model['movie']
print ("Number of dimensions: "+str(len(vector_movie)))
print (vector_movie)

We can also check the similarity (measured by cosine similarity) between some words. Let's start with finding the most similar words to *film* or *casablanca* in our vector space. We can find the most similar words of any input word by using the `.most_similar` command. 

In [0]:
model.most_similar('movie')

In [0]:
model.most_similar('casablanca')

We can also check the similarity between two given words.

In [0]:
print(model.similarity('movie', 'film'))
print(model.similarity('movie', 'popcorn'))
print(model.similarity('movie', 'table'))

Here we can see how words like *movie* and *film* are very close (in fact they are synonyms). Then other words like *movie* and *popcorn* are somehow related, while *movie* and *table* do not seem to be similar at all in this corpus.

**Note:** In this notebook we have learned our own word embeddings in IMDb. However, please note that in many cases we are going to directly use an available pre-trained word embedding model. These are generally trained on large corpora and are therefore more complete/accurate. For example, there are pre-trained models for [Word2Vec](https://code.google.com/archive/p/word2vec/), [GloVe](https://nlp.stanford.edu/projects/glove/) or even [FastText trained on Twitter](https://github.com/pedrada88/crossembeddings-twitter).

**Exercise (optional):** Choose a pre-trained model from Word2Vec, GloVe or FastText (there are many available online) and load it using gensim. Check a few similarities and compare it with the word embeddings trained on IMDb.

**Exercise 1:** Train a Word2Vec word embedding model on the IMDb corpus with 75 dimensions and a window size of 8. Then, check the most similar words of *movie* in the vector space and the similarity between *movie* and *table*. Compare the results with the previous trained model.

In [0]:
# To complete here...

**Exercise (optional):** Take a corpus of your choice (e.g. from one of the NLP projects) and train a word embedding model using gensim. Check a few similarities of words and compare with the models trained on IMDb.