# Deep Learning - Day 5 - Wod generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate text

<hr>
<hr>

In this exercise, we will try to generate some words. The underlying idead is to give a input sequence and to predict what the next word is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

In [1]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X = load_data(percentage_of_sentences=20)

Now that you have already done 4 NLP exercises, and because you already trained RNN networks yesterday, you are now on your own. The goal here is to reproduce a very similar experiment to the letter generation. But instead of predicting a letter based on a string, you have to predict a word based on a list of words!

The first part should be very similar to the previous exercice.

Nonetheless, keep in mind that this is a quite complex task so do not despair : you already have the skills to tackle state of the art problems. The rest is practice and to stay aware of the achievement in the community. Moreover, this task is very open. So you are also allowed to do some transfer learning.

Nonetheless, don't just load a pre-trained package ;)

In [None]:
### THE BEGINNING OF YOUR JOURNEY HERE