# Deep Learning - Day 5 - Understand embeddings with Word2Vec

### Exercise objectives:
- Convert words to vector representations thanks to embeddings
- Discover the powerful Word2Vec algorithm

<hr>
<hr>

Natural Language Processing corresponds to the automated study of language. To deal with texts, we have to convert them into meaningful representations that can be handled by computers. 

Usually, the first step consists of cleaning the data. As you have already done it during the past weeks, we won't emphasize this part today. However, if you have finished an exercise and you feel confortable with what you have, you are welcome to improve the data cleaning to achieve better performances.

✅ **Good Practice** ✅ Never spend too much time on data cleaning! First, build the entire pipeline to have a baseline evaluation of your task. **Then, and only then, improve your data cleaning and _watch_ if it is really improving your score**. Otherwise, if you work on your data cleaning without baseline evaluation of your performances, you don't know whether your fancy data cleaning is improving the entire pipeline or not. And sometimes, it can get super counterintuitive ;)

In this exercise, a large part of the data cleaning is already done. The next step is to convert each word of our vocabulary to a vector that can be fed to a Recurrent Neural Network. To do that, you will use the powerful and well-known Word2Vec algorithm.


# The data

Keras provides many datasets, among which the ÌMDB dataset: it corresponds to sentences that are movie reviews. Each of them is related to a score given by the review writer.

❓ **Question** ❓ Let's first load the data. You don't have to understand what is going on in the function, it does not matter here.


⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

In [None]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    X_test = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_test]
    
    return X_train, y_train, X_test, y_test



### Just run this cell to load the data
X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=20)

Now that you have loaded the data, let's look at it!

❓ **Question** ❓ `X_train` and `X_test` are lists of sentences. For each of them, let's plot some of the sentences with their respective scores stored in `y_train` and `y_test.

In [None]:
### YOUR CODE HERE

You probaly see that the data have already been cleaned a little. 

Moreover, the task corresponds to a binary classification problem:
- label 0 corresponds to a negative review
- label 1 corresponds to a positive review

As for now, your sentences correspond to only a long string, but your neural network does not know that this string is composed of different words. For that reason, you must convert each sentence (the full string) into a list words, i.e. a list of separated strings.

For instance, 'this is a good movie' should become ['this', 'is', 'a', 'good', 'movie'].

❓ **Question** ❓ Write a function to convert `X_train` and `X_test` from list of strings (sentences) to list of list of strings (words).

In [None]:
### YOUR CODE HERE

It is impossible to use strings into Neural Networks. Therefore, we have to convert each word into a mathematical representation, namely a vector. By the way, if a word is represented by a vector of size N, then a sentence of M words is represented by a vector of size M x N.

All your words will then be vector (or points) in an N-dimensional space. This space is called the embedding space. And there are many ways to create this embedding space. Let's discover it.


# Embedding with Word2Vec

Now, let's use Word2Vec to embed the words of our sentences. Word2Vec will be able to convert each word to a fixed-size vectorial representation.

For instance, we will have:
- 'dog' --> [0.1, -0.3, 0.8]
- 'cat' --> [-1.1, 2.3, 0.7]
- 'apple' --> [3.1, 0.9, -4.7]

Here, your embedding space is of size 3.

What you expect is to have representation such as words with close meanings are close in this embedding space. As on this example:

![Embedding](word_embedding.png)

❓ **Question** ❓ Let's run Word2Vec! The following code imports word2vec from gensim, a great python package that makes the use of the word2vec algorithm easy, fast and accurate - which is not an easy task. The second line learns the embedding representation of the words thanks to the sentences in `X_train`.

In [None]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=X_train)

Let's look at the embedded representation of some words.

You can use `word2vec.wv` as a dictionnary.
For instance, `word2vec.wv['dog']` will return a representation of `dog` in the embedding space.

❓ **Question** ❓ Try different words - especially, try non-existing words to see that they don't have any representation (which is perfectly normal as their representation were not learn). 

In [None]:
### YOUR WORDS HERE

❓ **Question** ❓ What is the size of each word representation, and therefore, what is the size of the embedding space?

In [None]:
### YOUR CODE HERE

How to know if this embedding make any sense? To do that, we will check that words with a close meaning have close representations. 

Let's use the `most_similar` method that, given an input word, display the "closest" words in the embedding space. If the embedding is well done, then words that have close meanings will have close representation in the embedding space.

❓ **Question** ❓ Test the `most_similar` method on different words. 

❗ **Remark** ❗ Indeed, the quality of the closeness will depend on the quality of your embedding, and thus, depend on the number of sentences that you have loaded and from which you create your embedding.

In [None]:
word2vec.wv.most_similar("car")

### TRY OTHER WORDS

Similarly to `most_similar` used on words directly, we can use `similar_by_vector` on vectors to do the same thing :

In [None]:
word_embedding = word2vec.wv['car']
word2vec.wv.similar_by_vector(word_embedding)

# Arithmetic on words

Now, let's do mathematical operations on words - meaning on their vector representations!

As any word is represented as a vector, you can do basic arithmetic as

W2V(good) - W2V(bad)

❓ **Question** ❓ Do this mathematical operation and print the result

In [None]:
### YOUR CODE HERE

Now, image for a second that, somehow, the following equality holds true - just for a second

    W2V(good) - W2V(bad) = W2V(nice) - W2V(unpleasant)

which is equivalent to 

    W2V(good) - W2V(bad) + W2V(unpleasant) = W2V(nice).

❓ **Question** ❓ Let's, just for fun (as it would be foolish of us to think that this equality holds true ...), do the operation W2V(good) - W2V(bad) + W2V(unpleasant) and store it in a `res` variable (which will be a vector of size 100 that you can print).

In [None]:
# To complete

We earlier said that, for any vector, it is possible to see the closest vectors in the embedding space.

❓ **Question** ❓ Look at the closest vector (thanks to the `word2vec.wv.similar_by_vecto` function) of `res`

In [None]:
### YOUR CODE HERE

Incredible right! You can do arithmetical operations on words!

❓ **Question** ❓ You can try on arithmetic such as 

W2V(Boy) - W2V(Girl) = W2V(Man) - W2V(Woman)

or 

W2V(Queen) - W2V(King) = W2V(actress) - W2V(actor)

❗ **Remark** ❗ You will probably see that the results are not perfect. But don't forget that you trained your model on a very small corpus.

In [None]:
### YOUR CODE HERE

You might wonder where does this magic comes from (at quite a low price, you just run a line of code on a very small corpus and it was trained within few minutes). The magic comes from the way Word2Vec is trained. The details are quite complex, but you can remembed that Word2vec, in `word2vec = Word2Vec(sentences=X_train)` , actually trains a internal neural network (that you don't see).  

In a nutshell, this internal NN predicts a word from the surroundings words in a sentences. So it chooses many splits in the different sentences, choose some words as inputs $X$  and a word as output $y$ which it tries to predict, in the embedding space.

And as any neural network, Word2Vec has some hyperparameters. Let's check some. 


# Word2Vec hyperparameters


❓ **Question** ❓ The first important hyperparameter is the `size` argument. It corresponds to the size of the embedding space. Learn a new `word2vec_2` model, still trained on the `X_train`, but with a smaller or higher `size`.

Verify on some words that the corresponding embedding is of your selected size.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Use the `word2vec.wv.vocab` attribute to display the size of the learnt vocabulary. On the other hand, compare it to the number of different words in `X_train`.

In [None]:
### YOUR CODE HERE

There is an important difference between the number of words in the train sentences and in the Word2Vec vocabulary, even though it has been train on the train sentence set. The reasons comes from the second important hyperparameter of Word2Vec :  `min_count`. 

`min_count` is a integer that tells you how many occurences a given word should have to be learn in the embedding space. For instance, let's say that the word "movie" appears 1000 times in the corpus and "simba" only 2 times. If `min_count=3`, the word "simba" will be skipped during the training.

The intention is to have only words that are sufficiently present in the corpus to have a robust embedded representation

❓ **Question** ❓ Learn a new `word2vec_3` model with a `min_count` higher than 5 (which is the default value) and a `word2vec_4` with a `min_count` smaller than 5, and then, compare the size of the vocabulary for all the different word2vec that you have trained (you can choose any `size` you want).

In [None]:
### YOUR CODE HERE

Remember that we say that word2vec has an internal neural network that it optimizes based on some predictions? These predictions actually correspond to predicting a word based on surrounding words. The surroundings words are in a `window` which corresponds to the number of words taken into account. And you can train the word2vec with different `window` sizes.

❓ **Question** ❓ Learn a new `word2vec_5` model with a `window` different than previously (default is 5).

In [None]:
### YOUR CODE HERE

The arguments you have seen (`size`, `min_count` and `window`) are usually the one that you should start changing to get a better performance for your model.

But you can also look at other arguments in the [documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus)


# Convert our train and test set to RNN ready data

Now, let's convert the training and test data into their vector representation to be ready to be feed in RNNs.

❓ **Question** ❓ Now, write a function that, given a sentence, returns a matrix that corresponds to the embedding of the full sentence, which means that you have to embeed each word one after the other and concatenate the result to output a 2D matrix (be sure that your output is a numpy array)

❗ **Remark** ❗ You will probably notice that some words you are trying to convert throw errors as they are said not to belong to the dictionnary:

- for the test set, this is understandable: some words were not in the train set and thus their embedded representation is unknwon
- for the train set, due to `min_count` hyperparameter, not all the words have a vector representation

In any case, just skip the missing words here.

In [None]:
example = ['this', 'movie', 'is', 'the', 'worst', 'action', 'movie', 'ever']
example_missing_words = ['this', 'movie', 'is', 'laaaaaaaaaame']

def embed_sentence(word2vec, sentence):
    ### YOUR CODE HERE
    
    
    
### Checks
embedded_sentence = embed_sentence(word2vec, example)
assert(type(embedded_sentence) == np.ndarray)
assert(embedded_sentence.shape == (8, 100))

embedded_sentence_missing_words = embed_sentence(word2vec, example_missing_words)  
assert(type(embedded_sentence_missing_words) == np.ndarray)
assert(embedded_sentence_missing_words.shape == (3, 100))

❓ **Question** ❓ Write a function that, given a list of sentence (each sentence being a list of words/strings), returns a list of embedded sentences (each sentence is a matrix). Apply this function to the train and test sentences

Hint: Use the previous function `embed_sentence`

In [None]:
def embedding(word2vec, sentences):
    ### YOUR CODE HERE
    
X_train = embedding(word2vec, X_train)
X_test = embedding(word2vec, X_test)

❓ **Question** ❓ In order to have ready-to-use adta, do not forget to pad them, as yesterday, in order to have tensors that can be divided in batch sizes during the optimization. Store the padedd values in `X_train_pad` and `X_test_pad`. Do not forget the important arguments of the padding ;)

In [None]:
### YOUR CODE HERE

assert(len(X_train_pad.shape) == 3)
assert(len(X_test_pad.shape) == 3)
assert(X_train_pad.shape[2] == 100)
assert(X_test_pad.shape[2] == 100)

# Load existing models: transfer learning

There is a possibility that your computer cannot train Word2Vec on large corpus (or just takes too much time, or just that you want a word2vec trained on a very large corpus). In that case, you can directly load word2Vec embedding that have been preprained on other corpuses. This is another example of transfer learning.

❓ **Question** ❓ To do that, look at available corpuses thanks to the following command: 

In [None]:
import gensim.downloader as api
print(list(api.info()['models'].keys()))

❓ **Question** ❓ Load a pretrained model thanks to `api.load(name_of_your_corpus)` and check the size of the vocabulary, but also the size of the embedding space.

In [None]:
### YOUR CODE HERE