# Recurrent Neural Networks for Language Modeling in Python

David Cecchini
Data Scientist

I am a Data Scientist focusing my work and research on using machine learning on text data. I entered the field when I co-founded a startup company in the field of RegTech that automatically collect, classify and distribute regulations on highly regulated markets, and am currently a Ph.D. student at Tsinghua-Berkeley Shenzhen Institute, a partner program from Tsinghua University from China and UC Berkeley from the USA.

# Intro

1. Introduction to the course
 - Hi, my name is David. I'm a Data Scientist that focuses on text data for real world applications, and I am proud to be your instructor in this course where you will be introduced to four different applications of language models using Recurrent Neural Networks with python.

2. Text data is available online
 - So, why learn to model language (or text) data? Well, we know that Data Science models require data to be effective, and one kind of data that is available on the Internet is text. From news articles to tweets, the volume of text data is increasing fast and is freely accessible to anyone with an Internet connection.

3. Applications of machine learning to text data
 - So, what can Data Scientists do with all this data? In this course we will introduce 4 applications: 
    1. sentiment analysis
    2. multi-class classification
    3. text generation
    4. machine neural translation

4. Sentiment analysis
 - If you have an online customer interaction, you may be interested in knowing how your customers feel towards your brand or product. To do that, you can use sentiment analysis models and classify their messages into positive or negative.

5. Multi-class classification
 - build a recommender system and need to categorize news articles into a set of pre-defined categories.

6. Text generation
 - Also, it is possible to generate text automatically using a specific writing style, or automatically reply to messages.

7. Neural machine translation
 - Lastly, it is also possible to create models that translate from one language to another.

8. Recurrent Neural Networks
 - All these applications are possible with a type of Deep Learning architecture called Recurrent Neural Networks. So what is different about RNN architectures, and why do we use it?
 - The main advantages to use RNN for text data is that 
     1. it reduces the number of parameters of the model (by avoiding one-hot encoding) and 
     2. it shares weights between different positions of the text
 - In the example, the model uses information from all the words to predict if the movie review was good or not.

9. Sequence to sequence models
 - RNNs model sequence data and can have different lengths of inputs and outputs. 
    1. Many inputs to One output is commonly used for classification tasks, where the final output is a probability distribution. 
        - This is used on sentiment analysis and multi-class classification applications.
    2. Many inputs to Many outputs: Text Generation
        - start the same as in the classification case, but for the outputs, it uses the previous prediction as input to the next prediction.
    3. Many inputs to Many outputs: Neural Machine Translation
        - is separated in two blocks: encoder and decoder. The encoder learns the characteristics of the input language, while the decoder learns for the output language. The encoder has no prediction (no arrows going up), and the decoder doesn't receive inputs (no arrows from below).
    4. Many inputs to Many outputs: language models 
        - starts with an artificial zero input, and then for every input word i the model tries to predict the next word i plus one.

# Review 

## Comparing the number of parameter of RNN and ANN
In this exercise, you will compare the number of parameters of an artificial neural network (ANN) with the recurrent neural network (RNN) architectures. Here, the vocabulary size is equal to 10,000 for both models.

The models have been defined for you with similar architectures of only one layer with 256 units (Dense or RNN) plus the output layer. They are stored on variables ann_model and rnn_model.

In [None]:
In [1]:
ann_model.summary()
Model: "ann_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, None, 256)         2560256   
_________________________________________________________________
dense_2 (Dense)              (None, None, 1)           257       
=================================================================
Total params: 2,560,513
Trainable params: 2,560,513
Non-trainable params: 0
################################################
    
In [2]:
rnn_model.summary()
Model: "rnn_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
simple_rnn_1 (SimpleRNN)     (None, 256)               66048     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257       
=================================================================
Total params: 66,305
Trainable params: 66,305
Non-trainable params: 0
_________________________________________________________________

# The RNN model has fewer parameters than the ANN model.

## Sentiment analysis
In the video exercise, you were exposed to the various applications of sequence to sequence models. In this exercise you will see how to use a pre-trained model for sentiment analysis.

The model is pre-loaded in the environment on variable model. Also, the tokenized test set variables X_test and y_test and the pre-processed original text data sentences from IMDb are also available.You will learn how to pre-process the text data and how to create and train the model using Keras later in the course.

You will use the pre-trained model to obtain predictions of sentiment. The model returns a number between zero and one representing the probability of the sentence to have a positive sentiment. So, you will create a decision rule to set the prediction to positive or negative.

In [None]:
# Inspect the first sentence on `X_test`
print(X_test[0])

# Get the prediction for all the sentences
# only takes 1 argument here
pred = model.predict(X_test)

# Transform the predition into positive (> 0.5) or negative (<= 0.5)
pred_sentiment = ["positive" if x>0.5 else "negative" for x in pred]

# Create a data frame with sentences, predictions and true values
result = pd.DataFrame({'sentence': sentences, 'y_pred': pred_sentiment, 'y_true': y_test})

# Print the first lines of the data frame
print(result.head())

'''
                                            sentence    y_pred    y_true
0  the it of yet br stress and must in at town wh...  positive  negative
1  the what have just be ever have 2 at is over d...  negative  positive
2  the was me of and in character and performance...  negative  positive
3  the as on mean unlike and movie pictures is pa...  negative  negative
4  the genuine was capture now and and and new to...  negative  negative
'''

You can see that some of the predictions were correct and some were not. The model used was very simple and its accuracy was not very high. You will learn later some tuning approaches to the sentiment classification model. Also, the process of pre-processing the text data and creating, training and testing models in Keras will be detailed later in the course. Finally, you created a decision rule to determine if the sentiment would be classified as positive or negative. In many applications, the value of 0.5 is used as decision boundary, but other values can also be used depending on what metric you want to optimize.

# Language models

1. Introduction to language models
 - In this lesson, you will learn in more detail how to create a language model from raw text data.

2. Sentence probability
 - Language models represent the probability of a sentence. For example, what is the probability of the sentence I love this movie? What is the probability of each word in this sentence to appear in this particular order? The way this probability is computed changes from one model to another. Unigram models use the probability of each word inside the document, and assume the probabilities are independent. 
 - N-gram models use the probability of each word conditional to the previous N minus one words. 
 - When N equals to 2 it's called bigram, and when it is equal to 3 it's called trigram.
 - Skipgram model does the opposite, computes the probability of the context words, or neighboring words, given the center word.
 - Neural networks models with a softmax function in the last layer of the model, output layer, with units equal to the size of the vocabulary are also language models.

4. Link to RNNs
 - We are focusing on Recurrent Neural Networks. So how exactly are language models related to them? Well, everywhere! Recurrent Neural Network models are themselves language models when trained on text data, because they give the probability of the next token given the previous k tokens.
 - Also, an embedding layer can be used to create vector representations of the tokens as the first layer.

6. Building vocabulary dictionaries
 - When creating RNN models, we need to transform the text data into a sequence of numbers, which are the indexes of the tokens in the array of unique tokens, the vocabulary. To do that, we first need to create an array containing each unique word of the corpus. We can use the combination list-set to create a list of unique words. And we can get all words in a text by splitting the text using space as the separator. Other languages such as Chinese need additional steps to get words since there is no space between the characters. We can now create dictionaries that map words to their index on the vocabulary and vice versa using dictionary comprehension. By enumerating a list, we obtain the numeric indexes and the items as tuples, and we can use them to create key-value dictionaries. The first dictionary uses the words as keys and the indexes as values, it can transform the text into numerical values. The later one is used to go back from numbers to words, since it has indexes as keys and words as values.

7. Preprocessing input
 - With the created dictionaries, we can prepare pairs of X and y to be used on a supervised machine learning model. For that, we can loop into the sequences of numerical indexes in blocks of fixed-length size. We use the initial words as x and the final word as y, and shift the text step words forward. If we use a step equal to 2, it means that the X sentences will be shifted by 2 words at a time.

8. Transforming new texts
 - When preparing new data, we can use the dictionary to get the correct indexes for each word. Using the example on the slide, create a list that will contain the transformed text. Loop for every sentence of the new text create a temporary list that will contain the current sentence. iterate over all words of the sentence by splitting the sentence on it's white spaces. get the index using the dictionary append the index to the sentence list then, append the sentence of indexes on the first list you created, new text split.

## Building vocabulary dictionaries

In [None]:
# get unique words
unique_words = list(set(text.split(' ')))

# create dictionary: word is key, index is value
word_to_index = {k:v for (v,k) in enumerate(unique_words)}

# create dictionary: index is key, word is value
index_to_word = {k:v for (k,v) in enumerate(unique_words)}

#####################
# preprocessing input
#####################
# initialize variables X and y
X = []
y = []
# loop over the text: length 'sentence_size' per time with step equal to 'step'
# if step=2, then sentences will be shifted 2 words at a time
for i in range(0, len(text) - sentence_size, step):
    X.append(text[i:i + sentence_size])
    y.append(text[i + sentence_size])

'''
example (numbers are numerical indexes of vocabulary):
sentence is: 'i loved this movie' > (['i','loved','this'],'movie')
X[0],y[0] = ([10, 444, 11], 17)
'''

#####################
# transforming new texts
#####################
# create list to keep the sentences of indexes
new_text_split = []
# loop and get the indexes from dictionary
for sentence in new_text:
    # temporary list to contain the current sentence
    sent_split = []
    # split the sentences into words
    for wd in sentence.split(' '):
        ix = wd_to_index[wd]
        sent_split.append(ix)
    # append the sentence of indexes on the first list
    new_text_split.append(sent_split)
    
    

## Getting used to text data
In this exercise, you will play with text data by analyzing quotes from Sheldon Cooper in The Big Bang Theory TV show. This will give you a chance to analyze sentences to obtain insights on what it's like to deal with real-world text data.

You will use dictionary comprehensions to create dictionaries that map words to indexes and vice versa. The use of dictionaries instead of, for example, a pandas.DataFrame is because they are more intuitive and don't add unnecessary extra complexity.

The data is available in sheldon_quotes with the first two sentences already printed for you.

In [None]:
# Transform the list of sentences into a list of words
all_words = ' '.join(sheldon_quotes).split(' ')

# Get number of unique words
unique_words = list(set(all_words))

# Dictionary of indexes as keys and words as values
index_to_word = {i:wd for i, wd in enumerate(sorted(unique_words))}

print(index_to_word)

# Dictionary of words as keys and indexes as values
word_to_index = {wd:i for i, wd in enumerate(sorted(unique_words))}

print(word_to_index)
'''
{0: '(3', 1: 'Ah,', 2: "Amy's", 3: 'And', 4: 'Explorer', 5: 'Firefox.', 6: 'For', 7: 'Galileo,', 8: 'Goblin', 9: 'Green', 10: 'Hubble', 11: 'I', 12: "I'm", 13: 'Internet', 14: 'Ladybugs', 15: 'Oh', 16: 'Paul', 17: 'Penny', 18: 'Penny!', 19: 'Pope', 20: 'Scissors', 21: 'She', 22: 'Spider-Man,', 23: 'Spock', 24: 'Spock,', 25: 'Thankfully', 26: 'The', 27: 'Two', 28: 'V', 29: 'Well,', 30: 'What', 31: 'Wheaton!', 32: 'Wil', 33: "You're", 34: 'a', 35: 'afraid', 36: 'all', 37: 'always', 38: 'am', 39: 'and', 40: 'appeals', 41: 'are', 42: 'art', 43: 'as', 44: 'at', 45: 'aware', 46: 'based', 47: 'be', 48: 'became', 49: 'because', 50: 'been', 51: 'birthday', 52: 'bitch.', 53: 'black', 54: 'blood', 55: 'bottle.', 56: 'bottom', 57: 'brain', 58: 'breaker.', 59: 'bus', 60: 'but', 61: 'calls', 62: 'can', 63: 'care', 64: 'catatonic.', 65: 'center', 66: 'chance', 67: 'circuit', 68: 'computer', 69: 'could', 70: 'covers', 71: 'crushes', 72: 'cry', 73: 'cuts', 74: 'days', 75: 'decapitates', 76: 'deity.', 77: 'discovering', 78: 'disproves', 79: 'do', 80: 'does', 81: "don't", 82: 'eat', 83: 'eats', 84: 'every', 85: 'example,', 86: 'flashlight', 87: 'for', 88: 'free', 89: 'genitals,', 90: 'genitals.', 91: 'get', 92: 'ghost', 93: 'girlfriend', 94: 'gravity,', 95: 'had', 96: 'hand.', 97: 'has,', 98: 'have', 99: 'have?', 100: 'having', 101: 'heartless', 102: 'here', 103: 'hole', 104: 'humans', 105: 'if', 106: 'impairment;', 107: 'in', 108: 'insane,', 109: 'insects', 110: 'involves', 111: 'is', 112: "isn't", 113: 'it', 114: 'it.', 115: 'just', 116: 'kept', 117: 'knocks)', 118: 'later,', 119: 'little', 120: 'living', 121: 'lizard', 122: 'lizard,', 123: 'loud', 124: 'makes', 125: 'man', 126: 'masturbating', 127: 'me', 128: 'memory', 129: 'messy,', 130: 'money.', 131: 'moon-pie', 132: 'mother', 133: 'moved', 134: 'much', 135: 'must', 136: 'my', 137: 'next', 138: 'not', 139: 'nummy-nummy', 140: 'of', 141: 'on', 142: 'one.', 143: 'other', 144: 'others', 145: 'paper', 146: 'paper,', 147: 'people', 148: 'please', 149: 'poisons', 150: 'present', 151: 'prize', 152: 'relationship', 153: 'render', 154: 'reproduce', 155: 'right', 156: 'rock', 157: 'rock,', 158: 'rushed', 159: 'sad.', 160: 'say', 161: 'scissors', 162: 'scissors,', 163: 'scissors.', 164: 'searching', 165: 'sexual', 166: 'she', 167: 'smashes', 168: 'so', 169: 'sooner', 170: 'stopping', 171: 'stupid,', 172: 'taken', 173: 'telescope', 174: 'tested.', 175: 'that', 176: 'the', 177: 'things', 178: 'think', 179: 'thou', 180: 'three', 181: 'to', 182: 'today', 183: 'town.', 184: 'tried', 185: 'unnecessary', 186: 'unsanitary', 187: 'up.', 188: 'used', 189: 'usually', 190: 'vaporizes', 191: 'vodka', 192: 'way', 193: 'we', 194: 'well,', 195: 'which', 196: 'white', 197: 'will', 198: 'with', 199: 'women,', 200: 'would', 201: 'years,', 202: 'you', 203: 'your'}
{'(3': 0, 'Ah,': 1, "Amy's": 2, 'And': 3, 'Explorer': 4, 'Firefox.': 5, 'For': 6, 'Galileo,': 7, 'Goblin': 8, 'Green': 9, 'Hubble': 10, 'I': 11, "I'm": 12, 'Internet': 13, 'Ladybugs': 14, 'Oh': 15, 'Paul': 16, 'Penny': 17, 'Penny!': 18, 'Pope': 19, 'Scissors': 20, 'She': 21, 'Spider-Man,': 22, 'Spock': 23, 'Spock,': 24, 'Thankfully': 25, 'The': 26, 'Two': 27, 'V': 28, 'Well,': 29, 'What': 30, 'Wheaton!': 31, 'Wil': 32, "You're": 33, 'a': 34, 'afraid': 35, 'all': 36, 'always': 37, 'am': 38, 'and': 39, 'appeals': 40, 'are': 41, 'art': 42, 'as': 43, 'at': 44, 'aware': 45, 'based': 46, 'be': 47, 'became': 48, 'because': 49, 'been': 50, 'birthday': 51, 'bitch.': 52, 'black': 53, 'blood': 54, 'bottle.': 55, 'bottom': 56, 'brain': 57, 'breaker.': 58, 'bus': 59, 'but': 60, 'calls': 61, 'can': 62, 'care': 63, 'catatonic.': 64, 'center': 65, 'chance': 66, 'circuit': 67, 'computer': 68, 'could': 69, 'covers': 70, 'crushes': 71, 'cry': 72, 'cuts': 73, 'days': 74, 'decapitates': 75, 'deity.': 76, 'discovering': 77, 'disproves': 78, 'do': 79, 'does': 80, "don't": 81, 'eat': 82, 'eats': 83, 'every': 84, 'example,': 85, 'flashlight': 86, 'for': 87, 'free': 88, 'genitals,': 89, 'genitals.': 90, 'get': 91, 'ghost': 92, 'girlfriend': 93, 'gravity,': 94, 'had': 95, 'hand.': 96, 'has,': 97, 'have': 98, 'have?': 99, 'having': 100, 'heartless': 101, 'here': 102, 'hole': 103, 'humans': 104, 'if': 105, 'impairment;': 106, 'in': 107, 'insane,': 108, 'insects': 109, 'involves': 110, 'is': 111, "isn't": 112, 'it': 113, 'it.': 114, 'just': 115, 'kept': 116, 'knocks)': 117, 'later,': 118, 'little': 119, 'living': 120, 'lizard': 121, 'lizard,': 122, 'loud': 123, 'makes': 124, 'man': 125, 'masturbating': 126, 'me': 127, 'memory': 128, 'messy,': 129, 'money.': 130, 'moon-pie': 131, 'mother': 132, 'moved': 133, 'much': 134, 'must': 135, 'my': 136, 'next': 137, 'not': 138, 'nummy-nummy': 139, 'of': 140, 'on': 141, 'one.': 142, 'other': 143, 'others': 144, 'paper': 145, 'paper,': 146, 'people': 147, 'please': 148, 'poisons': 149, 'present': 150, 'prize': 151, 'relationship': 152, 'render': 153, 'reproduce': 154, 'right': 155, 'rock': 156, 'rock,': 157, 'rushed': 158, 'sad.': 159, 'say': 160, 'scissors': 161, 'scissors,': 162, 'scissors.': 163, 'searching': 164, 'sexual': 165, 'she': 166, 'smashes': 167, 'so': 168, 'sooner': 169, 'stopping': 170, 'stupid,': 171, 'taken': 172, 'telescope': 173, 'tested.': 174, 'that': 175, 'the': 176, 'things': 177, 'think': 178, 'thou': 179, 'three': 180, 'to': 181, 'today': 182, 'town.': 183, 'tried': 184, 'unnecessary': 185, 'unsanitary': 186, 'up.': 187, 'used': 188, 'usually': 189, 'vaporizes': 190, 'vodka': 191, 'way': 192, 'we': 193, 'well,': 194, 'which': 195, 'white': 196, 'will': 197, 'with': 198, 'women,': 199, 'would': 200, 'years,': 201, 'you': 202, 'your': 203}
'''

## Preparing text data for model input
Previously, you learned how to create dictionaries of indexes to words and vice versa. In this exercise, you will split the text by characters and continue to prepare the data for supervised learning.

Splitting the texts into characters may seem strange, but it is often done for text generation. Also, the process to prepare the data is the same, the only change is how to split the texts.

You will create the training data containing a list of fixed-length texts and their labels, which are the corresponding next characters.

You will continue to use the dataset containing quotes from Sheldon (The Big Bang Theory), available in the sheldon_quotes variable.

The print_examples() function print the pairs so you can see how the data was transformed. Use help() for details.

In [None]:
# Create lists to keep the sentences and the next character
sentences = []   # ~ Training data
next_chars = []  # ~ Training labels

# Define hyperparameters
step = 2          # ~ Step to take when reading the texts in characters
chars_window = 10 # ~ Number of characters to use to predict the next one  

# Loop over the text: length `chars_window` per time with step equal to `step`
for i in range(0, len(sheldon_quotes) - chars_window, step):
    sentences.append(sheldon_quotes[i:i + chars_window])
    next_chars.append(sheldon_quotes[i + chars_window])

# Print 10 pairs using function
print_examples(sentences, next_chars, 10)
'''
Sentence	Next char
You're afr	a
u're afrai	d
re afraid 	o
 afraid of	 
fraid of i	n
aid of ins	e
d of insec	t
of insects	 
 insects a	n
'''

With this you are ready to use the sentences and next character to train a supervised learning model! 

Don't mind that the printed sentences look strange, since you used characters instead of words and defined a sentence with a fixed length, the texts can be broken in the middle of a word. 

Note that the process of creating the sentences and next chars is the same when using words instead of characters, the only change being the values present on the lists (words instead of characters). 

Now, before going straight to training machine learning models, let's see what to do when you have a new text data not pre-processed yet.

## Transforming new text

In [None]:
'''
In this exercise, you will transform a new text into sequences of 
numerical indexes on the dictionaries created before.

This is useful when you already have a trained model and want to 
apply it on a new dataset. The preprocessing steps done on the 
training data should also be applied to the new text, 
so the model can make predictions/classifications.

Here, you will also use a special token "<UKN/>" to represent words
that are not in the vocabulary. Typically, these special tokens 
are the first indexes of the dictionaries, the position 0.

The variables word_to_index, index_to_word and vocabulary are 
already loaded in the environment. Also, the variable with the 
new text is also loaded as new_text. The new text has been printed 
for you to have a look.
'''

In [None]:
# Loop through the sentences and get indexes
new_text_split = []
for sentence in new_text:
    sent_split = []
    for wd in sentence.split(' '):
        index = word_to_index.get(wd, 0)
        sent_split.append(index)
    new_text_split.append(sent_split)

# Print the first sentence's indexes
print(new_text_split[0])

# Print the sentence converted using the dictionary
print(' '.join([index_to_word[index] for index in new_text_split[0]]))

'''
[276, 15070, 10160, 14750, 14590, 5715, 13813, 12418, 22564, 12797, 15443, 13813, 0, 5368, 14578, 13813, 16947, 12507, 23031, 12859, 5975, 16795, 13813, 5368, 21189, 22564, 0, 5910]
A man either lives life as it happens to him meets it <UKN/> and licks it or he turns his back on it and starts to <UKN/> away
'''

In [None]:
'''
You can see that some of the words were not found on the dictionary 
and have index = 0. By using the token '<UKN/>' in the training phase,
you can easily use the model on unseen data without getting errors.
This is also done when limiting the size of the vocabulary, say to 
5,000 most frequent words, and setting the others as '<UKN/>'.
'''

## Intro to RNN inside Keras

1. Introduction to RNN inside Keras
 - In this lesson, we implement the RNN models using keras. Previously, you were introduced to the architecture of language models. Now we will use keras to create and train RNN models.

2. What is keras?
 - Keras is a high-level API with deep learning frameworks as background. 
 - It is possible to configure keras with Tensorflow, CNTK or Theano.
 - To install keras, we can simply use the Python package manager pip. After installation, we can use its modules to execute fast experimentation and research. Next we will introduce the main modules of keras that will be useful for the language models

3. keras.models
 - keras models contain two classes of models. The Sequential class has a structure where each layer is implemented one after the other, meaning that the output of one layer is the input of the next one. The Model class is a generic definition of a model that is more flexible and allows multiple inputs and outputs.

4. keras.layers
 - keras layers contains the different types of layers including the 
     - LSTM cells
     - GRU cells
     - Dense
     - Dropout
     - Embedding
     - Bidirectional

5. keras.preprocessing
 - keras preprocessing contains useful functions for pre-processing the data such as the pad_sequences method that transforms text data into fixed-length vectors. In the example, we padded the texts to equal length of 3.
 - ie. keras.preprocessing.sequence.pad_sequences(texts, maxlen=3)

6. keras.datasets
 - The datasets module contains useful datasets. 
 - imdb movie reviews that is used for sentiment analysis
 - reuters newswire dataset used for topic classification with 46 classes
 - other datasets that you can check on the keras website.

7. Creating a model
 - You can build a Sequential model in keras with just a few lines of code. Import the required classes as: from keras dot models import sequential from keras dot layers import dense Then, instantiate the class in the variable called model with: model equals to sequential open and close parenthesis. Add desired layers with the method add as in: model dot add dense 64, activation equals to the string relu, input_dim equals to 100 The parameter input dim declares the shape of the input data, which is mandatory for the first layer in the model. Then add the output layer: model dot add dense 1, activation equal to the string sigmoid Finally, we compile the model by executing the compile method of the class. We pass the string adam to the optimizer parameter, the string mean squared error to loss, and a single-element list containing the string accuracy to the metrics parameter

8. Training the model
 - To train the model, we use the fit method on the training data. For example: model.fit(X_train, y_train, epochs=10, batch_size=32)
 - epochs is the number of iterations over the entire dataset and defaults to one. 
 - batch size is the size of a subset of the data that will be used on each step. When the dataset cannot fit in the memory this is crucial. It defaults to 32.

9. Model evaluation and usage
 - To analyze the model's performance, we can use the method evaluate as model dot evaluate x-test comma y-test This method returns the loss and accuracy values. To use the model on new data, use the method predict as: model dot predict new_data

10. Full example: IMDB Sentiment Classification
 - To create a full example, let's instantiate the Sequential class, add three layers (don't bother with new layers for now, we will explain them in details on chapter 2) and compile. Next, we can use the training set to fit the model. And measure its accuracy on the test set.

In [None]:
# import
from keras.models import Sequential
from keras.layers import LSTM, Dense

In [None]:
# Creating a model
# import required modules
from keras.models import Sequential
from keras.layers import Dense

# instantiate the model class
model = Sequential()

# add the layers
model.add(Dense(64, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

# training the model
model.fit(X_train, y_train, epochs=10, batch_size=32)

# evaluate the model, yields lss score and accuracy
model.evalaute(X_test, y_test)

# make predictions on new data
model.predict(new_data)

In [None]:
# example - IMDB sentiment classification
# build and compile the model
model = Sequential()
model.add(Embedding(10000, 128))
model.add(LSTM(128, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# training
model.fit(x_train, y_train, epochs=5)

# evaluation
score, acc = model.evaluation(x_test, y_test)

In this exercise you'll practice using two classes from the keras.models module. You will create one model using the two classes Sequential and Model.

The Sequential class is easier since the layers are assumed to be in order, while the Model class is more flexible and allows multiple inputs, multiple outputs and shared layers (shared weights).

The Model class needs to explicitly declare the input layer, while in the Sequential class, this is done with the input_shape parameter.

In [None]:
# example - create one model with 2 classes - Sequential and Model
# Instantiate the class
model = Sequential(name="sequential_model")

# One LSTM layer (defining the input shape because it is the 
# initial layer)
model.add(LSTM(128, input_shape=(None, 10), name="LSTM"))

# Add a dense layer with one unit
model.add(Dense(1, activation="sigmoid", name="output"))

# The summary shows the layers and the number of parameters 
# that will be trained
model.summary()
'''
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
=================================================================
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________
'''
# Define the input layer
main_input = Input(shape=(None, 10), name="input")

# One LSTM layer (input shape is already defined)
lstm_layer = LSTM(128, name="LSTM")(main_input)

# Add a dense layer with one unit
main_output = Dense(1, activation="sigmoid", name="output")(lstm_layer)

# Instantiate the class at the end
model = Model(inputs=main_input, outputs=main_output, name="modelclass_model")

# Same amount of parameters to train as before (71,297)
model.summary()
'''
Model: "modelclass_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           (None, None, 10)          0         
_________________________________________________________________
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
=================================================================
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________
'''


The keras.models.Sequential is very easy to use to add layers in sequence. 

On the other hand, the keras.models.Model class is very flexible and is usually the choice when scientists need deep customization in their solution. Also, you saw how one layer is connected to another layer in both cases, by adding them in sequence using the method add, or by creating a layer and calling the desired (previous) layer like a function, in the Model class API, every layer is callable on a tensor and always return a tensor.

### Keras preprocessing
The second most important module of Keras is keras.preprocessing. You will see how to use the most important modules and functions to prepare raw data to the correct input shape. Keras provides functionalities that substitute the dictionary approach you learned before.

You will use the module keras.preprocessing.text.Tokenizer to create a dictionary of words using the method .fit_on_texts() and change the texts into numerical ids representing the index of each word on the dictionary using the method .texts_to_sequences().

Then, use the function .pad_sequences() from keras.preprocessing.sequence to make all the sequences have the same size (necessary for the model) by adding zeros on the small texts and cutting the big ones.

In [None]:
# Import relevant classes/functions
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# In [1]:
# texts
# Out[1]:
# array(['So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
#        'Hello, female children. Allow me to inspire you with a story about a great female scientist. Polish-born, French-educated Madame Curie. Co-discoverer of radioactivity, she was a hero of science, until her hair fell out, her vomit and stool became filled with blood, and she was poisoned to death by her own discovery. With a little hard work, I see no reason why that can’t happen to any of you. Are we done? Can we go?'],
#       dtype='<U419')

# Build the dictionary of indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Change texts into sequence of indexes
texts_numeric = tokenizer.texts_to_sequences(texts)
print("Number of words in the sample texts: ({0}, {1})".format(
    len(texts_numeric[0]), len(texts_numeric[1])))

# Pad the sequences
texts_pad = pad_sequences(texts_numeric, 60)
print("Now the texts have fixed length: 60. Let's see the first one: \n{0}".format(
    texts_pad[0]))
'''
Number of words in the sample texts: (54, 78)
Now the texts have fixed length: 60. Let's see the first one: 
[ 0  0  0  0  0  0 24  4  1 25 13 26  5  1 14  3 27  6 28  2  7 29 30 13
 15  2  8 16 17  5 18  6  4  9 31  2  8 32  4  9 15 33  9 34 35 14 36 37
  2 38 39 40  2  8 16 41 42  5 18  6]
'''

### Your first RNN model
In this exercise you will put in practice the Keras modules to build your first RNN model and use it to classify sentiment on movie reviews.

This first model has one recurrent layer with the vanilla RNN cell: SimpleRNN, and the output layer with two possible values: 0 representing negative sentiment and 1 representing positive sentiment.

You will use the IMDB dataset contained in keras.datasets. A model was already trained and its weights stored in the file model_weights.h5. You will build the model's architecture and use the pre-loaded variables x_test and y_test to check the its performance.

In [None]:
# Build model to classify sentiment on movie reviews
model = Sequential()
model.add(SimpleRNN(units=128, input_shape=(None, 1)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', 
              optimizer='adam',
              metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Method '.evaluate()' shows the loss and accuracy
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print("Loss: {0} \nAccuracy: {1}".format(loss, acc))

# Loss: 0.6991182217597961 
# Accuracy: 0.495

# note accuracy is very low

# Vanishing and Exploding gradients

1. Vanishing and exploding gradients
 - You learned how to prepare text documents and use them on a RNN model to classify sentiment on movie reviews. But, the accuracy was not as expected! In this lesson you will be introduced to some pitfalls of vanilla RNN cells, which are the vanishing or exploding gradient problems, and how to deal with them.

2. Training RNN models
 - To understand the vanishing or exploding gradient problems, you first need to understand how the RNN model is trained. In other words, how to perform back propagation. In this picture, you can see the forward propagation and back propagation directions. The important part here is that the they follow two directions: vertical (between input and output) and horizontal (going through time) . Because of this horizontal direction, back propagation is referred as back propagation through time.

3. Forward propagation
 - In the forward propagation phase, we compute a hidden state a that will carry past information by applying the linear combination over the previous step and the current input. The output y is computed only in the last hidden state often by applying a sigmoid or softmax activation function. The loss function can be the cross-entropy function and we use it to have a numeric value of the error. We can see that the past information is carried out during the forward propagation with an example. The second step combines the results from the first step, and receive the second word as input. We can also see that the weight matrix Wa is used on all steps, which means the weights are shared among all the inputs.

4. Back propagation through time (BPTT)
 - In the back propagation phase, we have to compute the derivatives of the loss function with respect to the parameters. To compute the derivative of the loss with respect to the matrix Wa, we need to use the chain rule because y hat depends on a_t which also depends on Wa. But, a_t also depends on a_t minus 1 that depends on Wa. Thus, we need to consider the contribution of every previous step by summing up their derivatives with respect to the matrix Wa. Also, the derivative of at with respect to Wa also need the chain rule of derivatives and can be written as the product of the intermediate states multiplied by the derivative of the first state with respect to the matrix.

5. BPTT continuation
 - Not going into too much detail on the math, when computing the gradients of the loss function with respect to the weight matrix we obtain the matrix Wa power t minus one multiplied by a term. Intuitively, if the values of the matrix are below one, the series will converge to zero, and if its values are above one it will diverge to infinity.

6. Solutions to the gradient problems
 - Researchers found some approaches to avoid these problems. 
 - Dealing with Exploding gradients
     - Gradient clipping (limiting the size of the gradients) or scaling
 - Dealing with Vanishing gradients
     - Initializing the matrix W as an orthogonal matrix makes their multiplication always be equal to one 
     - Using regularization controls the size of the entries
     - Using the ReLU activation function (instead of tanh, sigmoid, softmax), the derivative becomes a constant, and thus doesn't increase or decrease exponentially 
     - use other RNN cells such as GRU and LSTM

## Exploding gradient problem

In the video exercise, you learned about two problems that may arise when working with RNN models: the vanishing and exploding gradient problems.

This exercise explores the exploding gradient problem, showing that the derivative of a function can increase exponentially, and how to solve it with a simple technique.

The data is already loaded on the environment as X_train, X_test, y_train and y_test.

You will use a Stochastic Gradient Descent (SGD) optimizer and Mean Squared Error (MSE) as the loss function.

In the first step you will observe the gradient exploding by computing the MSE on the train and test sets. On step 2, you will change the optimizer using the clipvalue parameter to solve the problem.

The Stochastic Gradient Descent in Keras is loaded as SGD.

### Exploding gradient example

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu',
                kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9))
history = model.fit(X_train, y_train, validation_data=(
    X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse = model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# Train: nan, Test: nan # >> Gradients exploded

### Exploding gradient problem solved with gradient clipping

In [None]:
# set the SGD() clipvalue param = 3.0, then run again

# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
# NOTE - clipvalue added to solve gradient exploding problem
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9, clipvalue=3.0))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse = model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# Train: 73.888, Test: 100.110

## Vanishing gradient problem
The other possible gradient problem is when the gradients vanish, or go to zero. This is a much harder problem to solve because it is not as easy to detect. If the loss function does not improve on every step, is it because the gradients went to zero and thus didn't update the weights? Or is it because the model is not able to learn?

This problem occurs more often in RNN models when long memory is required, meaning having long sentences.

In this exercise you will observe the problem on the IMDB data, with longer sentences selected. The data is loaded in X and y variables, as well as classes Sequential, SimpleRNN, Dense and matplotlib.pyplot as plt. The model was pre-trained with 100 epochs and its weights are stored on the file model_weights.h5.

In [None]:
# Create the model
model = Sequential()
model.add(SimpleRNN(units=600, input_shape=(None, 1)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Plot the accuracy x epoch graph
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.legend(['train', 'val'], loc='upper left')
plt.show()

![image.png](attachment:image.png)

You can observe that at some point the accuracy stopped to improve, which can happen because of the vanishing gradient problem. This kind of problem is harder to detect than the exploding gradient problem and will demand deeper analysis by the data scientist. Researchers found a model architecture way to solve this problem, which you will study later in this course. Instead of using SimpleRNN cells, you can use the more complex ones such as Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) cells.

## GRU and LSTM cells
- models architectures that help solve the vanishing gradient problem
- Gated Recurrent Unit (GRU) cells
- Long Short-Term Memory (LSTM) cells

1. GRU and LSTM cells
 - In this lesson you will learn about two different RNN cells that will achieve good results in language modeling and solve the vanishing gradient problem.

2. SimpleRNN cell in detail
 - Let's first have a detailed look of the SimpleRNN cell. On every cell, we compute the new memory state based on the previous memory state t minus one and the current input word Xt. In the computations, we have a weight matrix Wa that is shared between all steps. We will consider the case of classification tasks and thus the output y hat will be computed only in the last step.

3. GRU cell - add Update gate to RNN
 - GRU cells were proposed in 2014, and add one gate to the vanilla RNN cell. Now before updating the memory cell, we first compute a candidate a-tilde that will carry the present information. Then we compute the update gate GU that will determine if the candidate a tilde will be used as memory state or if we keep the past memory state a minus one. If the gate is zero, the network keeps the previous hidden state, and if it is equal to one it uses the new value of a tilde. Other values will be a combination of the previous and the candidate memory state, but during training it tends to get close to zero or one.

4. LSTM cell - adds 3 gates - forget gate, update gate, output gate
 - LSTM was first proposed in 1997, and adds three gates to the vanilla RNN cell. The forget gate g_f determines if the previous state c_t minus one state should be forgotten (meaning to have its value set to zero) or not. The update gate g_u do the same for the candidate hidden state c tilde. The output gate g_o do the same for the new hidden state c_t. The green circles on the picture represent the gates. We can think of them as an open or closed gate, allowing for the left side to pass through or not if the gates value are 0 or 1 respectively.

5. No more vanishing gradients
 - Because GRU and LSTM cells add gates to the equations, the gradients are no longer only dependent on the memory cell state. 
 - The derivative of the loss function with respect to the weights matrix depends on all the gates and on the memory cell, summing each of its parts. 
 - Without going into deeper details on the math, this architecture adds the different gradients (corresponding to the gradients of each gate and the memory state), making the total gradient stop converging to zero or diverging. On every step, if the gradient is exponentially increasing or decreasing, we expect the training phase to adjust the value of the corresponding gate accordingly to stop this vanishing or exploding tendency.

6. Usage in keras
 - Without further discussing the intuition and the theory, let's put the new RNN cells in practice inside keras. First, the layers with the GRU and LSTM cells are available in the keras dot layers dot recurrent, with a shortcut on keras dot layers. To use the GRU and LSTM cells on a keras model, we simple add them as usual. The important parameters are the number of units, meaning the number of memory cells to keep track, and the return sequences parameter that is used when adding more than one layer in sequence, making all the cells to emit an output that will be fed to the next layer as input.

In [None]:
# example
# import the layers
from keras.layers import GRU, LSTM

# add the layers to a model
# note return_sequences True for first layer, but False for last LSTM layer
model.add(GRU(units=128, return_sequences=True, name='GRU layer'))
model.add(LSTM(units=64, return_sequences=False, name='LSTM layer'))

### GRU cells are better than simpleRNN
In this exercise you will re-run the same model as the first chapter of the course to compare the accuracy of the model by simpling changing the SimpleRNN cell to a GRU cell.

The model was already trained with 10 epochs, as in the previous model with a SimpleRNN cell. In order to compare the models, a test set (x_test, y_test) is already loaded in the environment, as well as the old model SimpleRNN_model.

In [None]:
# Import the modules
from keras.layers import GRU, Dense

# Print the old and new model summaries
SimpleRNN_model.summary()
gru_model.summary()

# Evaluate the models' performance (ignore the loss value)
_, acc_simpleRNN = SimpleRNN_model.evaluate(X_test, y_test, verbose=0)
_, acc_GRU = gru_model.evaluate(X_test, y_test, verbose=0)

# Print the results
print("SimpleRNN model's accuracy:\t{0}".format(acc_simpleRNN))
print("GRU model's accuracy:\t{0}".format(acc_GRU))
'''
Model: "simple_rnn_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
simple_rnn_1 (SimpleRNN)     (None, 128)               16640     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
=================================================================
Total params: 16,769
Trainable params: 16,769
Non-trainable params: 0
_________________________________________________________________
Model: "gru_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
gru_1 (GRU)                  (None, 128)               49920     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
=================================================================
Total params: 50,049
Trainable params: 50,049
Non-trainable params: 0
_________________________________________________________________
SimpleRNN model's accuracy: 0.495
GRU model's accuracy: 0.58
'''

### Stacking RNN layers
Deep RNN models can have tens to hundreds of layers in order to achieve state-of-the-art results.

In this exercise, you will get a glimpse of how to create deep RNN models by stacking layers of LSTM cells one after the other.

To do this, you will set the return_sequences argument to True on the firsts two LSTM layers and to False on the last LSTM layer.

To create models with even more layers, you can keep adding them one after the other or create a function that uses the .add() method inside a loop to add many layers with few lines of code.

In [None]:
# Import the LSTM layer
from keras.layers.recurrent import LSTM

# Build model
model = Sequential()
model.add(LSTM(units=128, input_shape=(None, 1), return_sequences=True))
model.add(LSTM(units=128, return_sequences=True))
model.add(LSTM(units=128, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('lstm_stack_model_weights.h5')

print("Loss: %0.04f\nAccuracy: %0.04f" % tuple(model.evaluate(X_test, y_test, verbose=0)))
# Loss: 0.6789
# Accuracy: 0.5590

## Embedding layer - for transfer learning

1. The Embedding layer
 - You will learn now about vectorization of a language model using the embedding layer in keras, and how it can be used for transfer learning.

2. Why embeddings
 - Advantages
     - Reduce the dimension
         - The first reason to use embeddings is because the one-hot encoding of the tokens in a scenario with a very big vocabulary (maybe 100 thousands words) demands a lot of memory. An embedding layer with dimension, say, 300 is more viable. 
         - one_hot = np.array((N, 100000))
         - embedd = np.array((N, 300))
     - Dense representation
         - Also, embeddings are a dense representations of the words, and the implementations gives surprisingly nice understanding of the tokens. Like the famous king - man + woman = queen. 
     - Transfer learning

 - Disadvantages
     - it demands training lots of parameters to learn this representation
     - can make training slower

3. How to use in keras
 - To use the embedding layer in keras, we first import it from keras dot layers module. The embedding layer should be the first layer of the model. The relevant parameters include: input dim, which is the size of the vocabulary output dim, which is the dimension of the embedding space trainable, that defines if this layer should have its weights updated or not during the training phase embedding initializer, that can be used to perform transfer learning by using pre-trained weights for the words in your vocabulary. Often, when using transfer learning we set trainable to False, but it is not mandatory. The final parameter is the input length, which determines the size of the sequences (it assumes that you padded the input sentences beforehand)

4. Transfer learning
 - There are many pre-trained vectors that were trained on big datasets such as the Wikipedia, news articles, etc. To train a model on those big sets demand a lot of computer power, but loading the weights does not! Recent advances in NLP and language models research is based on open sourcing pre-trained weights on big datasets using popular models such as glove, word to vec and bert, among others. In keras, we need the constant initializer to define the pre-trained matrix of the Embedding layer.

5. Using GloVE pre-trained vectors
 - Glove files contain rows separated by spaces, where the first column is the word and the others are the weights values for each dimension of the embedding space. To read the values, then, we loop over the rows of the file, split the line by spaces, get the word as the first item of the list and the rest of the list are the weights. We use dictionaries to easily store for each word an np array with the values. We also cast the values to have float32 type because it is the type used to create the vectors.

6. Using the GloVE on a specific task
 - To use the GloVE vectors in a specific task, we can simply select the words present on the vocabulary list, ignoring the other words to save memory. We need the task-specific vocabulary dictionary with words as keys and indexes as values, the glove dict created in the previous slide and the dimension of the embedding space as inputs. We define a matrix with shape equal to the number of words plus one and the embedding space dim. We add one because the index zero is reserved for the padding token. We iterate over the vocabulary words, if the word is found in the glove vectors, then we update this row of the matrix with the values from glove.

### Embedding layer in keras

In [None]:
from keras.layers import Embedding
model = Sequential()

# use as the first layer
model.add(Embedding(input_dim=100000,
                    out_dim=300, # embedding space dimension
                    trainable=True, # update weights during training or not
                    embeddings_initializer=None # use transfer learning for words in vocabulary, but often False
                    input_length=120 # size of sequences
                   ))

### Transfer learning
Transfer learning for language models
- GloVE
- word2vec
- BERT

In [None]:
# in keras, need to import Constant for embedding layer
from keras.initializers import Constant
model.add(Embedding(input_dim=vocabulary_size,
                    out_dim=embedding_dim,
                    embeddings_initializer=Constant(pre_trained_vectors)
                   ))

### Using GloVE pre-trained vectors
- https://nlp.stanford.edu/projects/glove/
- rows separated by spaces
- 1st column is the word, the others are the weights

In [None]:
# get hte GloVE vectors
def get_glove_vectors(filename='glove.6B.300d.txt'):
    # get all word vectors from pre-trained model
    glove_vector_dict = {}
    with open(filename) as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = values[1:]
            glove_vector_dict[word]=np.asarray(coefs, dtype='float32')
        
    # is this right to return?
    return embeddings_index

# using the GloVE on a specific task
# filter GloVE vectors to specific task, word:index key:value
def filter_glove(vocabulary_dict, glove_dict, wordvec_dim=300):
    # create a matrix to store the vectors
    embedding_matrix = np.zeros((len(vocabular_dict)+1, wordvec_dim))
    for word, i in vocabulary_dict.items():
        embedding_vector = glove_dict.get(word)
        # if the word is found then we update
        if embedding_vector is not None:
            # words not found in the glove_dict will be all-zeros
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix

### Example - Number of parameters comparison
You saw that the one-hot representation is not a good representation of words because it is very sparse. Using the Embedding layer creates a dense representation of the vectors, but also demands a lot of parameters to be learned.

In this exercise you will compare the number of parameters of two models using embeddings and one-hot encoding to see the difference.

The model model_onehot is already loaded in the environment, as well as the Sequential, Dense and GRU from keras. Finally, the parameters vocabulary_size=80000 and sentence_len=200 are also loaded.

In [None]:
# Import the embedding layer
from keras.layers import Embedding

# Create a model with embeddings
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=vocabulary_size+1, output_dim=wordvec_dim, input_length=sentence_len, trainable=True))
model.add(GRU(128))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the summaries of the one-hot model
model_onehot.summary()

# Print the summaries of the model with embeddings
model.summary()

'''
Model: "model_onehot"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
gru_1 (GRU)                  (None, 128)               49920     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
=================================================================
Total params: 50,049
Trainable params: 50,049
Non-trainable params: 0
_________________________________________________________________
Model: "emb_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 200, 300)          24000600  
_________________________________________________________________
gru_2 (GRU)                  (None, 128)               164736    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
=================================================================
Total params: 24,165,465
Trainable params: 24,165,465
Non-trainable params: 0
_________________________________________________________________
'''

You can see the immense difference in the number of parameters when using the embedding layer! Don't worry, in the next exercise you will learn how make transfer learning to avoid having to train this layer.

### Transfer learning
You saw that when training an embedding layer, you need to learn a lot of parameters.

In this exercise, you will see that when using transfer learning it is possible to use the pre-trained weights and don't update them, meaning that all the parameters of the embedding layer will be fixed, and the model will only need to learn the parameters from the other layers.

The function load_glove is already loaded on the environment and retrieves the glove matrix as a numpy.ndarray vector. It uses the function covered on the lesson's slides to retrieve the glove vectors with 200 embedding dimensions for the vocabulary present in this exercise.

In [None]:
# Load the glove pre-trained vectors
glove_matrix = load_glove('glove_200d.zip')

# Create a model with embeddings
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=vocabulary_size + 1, output_dim=wordvec_dim, 
                    embeddings_initializer=Constant(glove_matrix), 
                    input_length=sentence_len, trainable=False))
model.add(GRU(128))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the summaries of the model with embeddings
model.summary()
'''
Model: "emb_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 200, 200)          2000400   
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               126336    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
=================================================================
Total params: 2,126,865
Trainable params: 126,465
Non-trainable params: 2,000,400
_________________________________________________________________
'''

The total parameters is very big, but the number of parameteres that will be trained is much smaller. The trained vectors already has values for the words, but is equal to a vector of zeros for new words not present in the pre-trained vectors. This can lead to problems if the task at hand is very specific.

### Embeddings improves performance
Does the embedding layer improves the accuracy of the model? Let's check it out in the same IMDB data.

The model was already trained with 10 epochs, as in the previous model with simpleRNN cell. In order to compare the models, a test set (X_test, y_test) is available in the environment, as well as the old model simpleRNN_model. The old model's accuracy is loaded in the variable acc_SimpleRNN.

All required modules and functions as loaded in the environment: Sequential() from keras.models, Embedding and Dense from keras.layers and SimpleRNN from keras.layers.recurrent.

In [None]:
# Create the model with embedding
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=max_vocabulary,
                    output_dim=wordvec_dim, input_length=max_len))
model.add(SimpleRNN(units=128))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('embedding_model_weights.h5')

# Evaluate the models' performance (ignore the loss value)
_, acc_embeddings = model.evaluate(X_test, y_test, verbose=0)

# Print the results
print("SimpleRNN model's accuracy:\t{0}\nEmbeddings model's accuracy:\t{1}".format(
    acc_simpleRNN, acc_embeddings))

# SimpleRNN model's accuracy: 0.495
# Embeddings model's accuracy: 0.733

# the embedding layer greatly improves the accuracy of the model

## Improving RNN model and overfitting - Sentiment classification revisited
Ways to improve the SimpleRNN model
- add the embedding layer
- increase the number of layers
- tune the parameters
- increase vocabulary size
- accept longer sentences with more memory cells

Avoid overfitting - RNN models can overfit
- test different batch sizes
- add dropout layers
- add dropout and recurrent_dropout parameters on RNN layers


In [None]:
# Avoid overfitting

# add dropout layer
# removes 20% of input to add noise
model.add(Dropout(rate=0.2))

# add dropout and recurrent_dropout parameters
# removes 10% of input and memory cells respectively
model.add(LSTM(128, dropout=0.1, recurrent_dropout=0.1))

### Convolution Layer and MaxPooling layer
- convolution layer do feature selection on the embedding vector
- achieves state-of-the-art results in many NLP problems

In [None]:
model.add(Embedding(vocabulary_size, wordvec_dim, ...))
model.add(Conv1D(num_filters=32, kernel_size=3, padding='same'))
model.add(MaxPooling1D(pool_size=2))

### Example - sentiment classification

In [None]:
model = Sequential()
# add embedding layer
model.add(Embedding(vocabulary_size, wordvec_dim, trainable=True,
                    embeddings_intializer=Constant(glove_matrix),
                    input_length=max_text_len, name="Embedding"))
model.add(Dense(wordvec_dim, activation='relu', name="Dense1"))
model.add(Dropout(rate=0.25))
model.add(LSTM(64, return_sequences=True, dropout=0.15, name="LSTM"))
model.add(GRU(64, return_sequences=False, dropout=0.15, name="GRU"))
model.add(Dense(64, name="Dense2"))
model.add(Dropout(rate=0.25))
model.add(Dense(32, name="Dense3"))
model.add(Dense(1, activation='sigmoid', name="Output"))

### Better sentiment classification
In this exercise, you go back to the sentiment classification problem seen in Chapter 1.

You are going to add more complexity to the model and improve its accuracy. You will use an Embedding layer to train word vectors on the training set and two LSTM layers to keep track of longer texts. Also, you will add an extra Dense layer before the output.

This is no longer a simple model, and the training can take some time. For this reason, a pre-trained model is available by loading its weights with the method .load_weights() from the keras.models.Sequential class. The model was trained with 10 epochs and its weights are available on the file model_weights.h5.

The following modules are loaded on the environment: Sequential, Embedding, LSTM, Dropout, Dense.

In [None]:
# Build and compile the model
model = Sequential()
model.add(Embedding(vocabulary_size, wordvec_dim,
                    trainable=True, input_length=max_text_len))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(64, return_sequences=False, dropout=0.2, recurrent_dropout=0.15))
model.add(Dense(16))
model.add(Dropout(rate=0.25))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Print the obtained loss and accuracy
print("Loss: {0}\nAccuracy: {1}".format(
    *model.evaluate(X_test, y_test, verbose=0)))
'''
Loss: 1.0716214485168456
Accuracy: 0.822

just increased the accuracy of your sentiment classification 
task from poorly 50% to more than 80%
'''

### Using the CNN layer
In this exercise, you will use a pre-trained model that makes use of the Conv1D and MaxPooling1D layers from the keras.layers.convolutional module, and achieves even better accuracy on the classification task.

This architecture achieved good results in language modeling tasks such as classification, and is added here as an extra exercise to see it in action and have some intuitions.

Because this layer is not in the scope of the course, you will focus on how to use the layers together with the RNN layers you already learned.

Please follow the instructions to see the results.

In [None]:
# Print the model summary
model_cnn.summary()

# Load pre-trained weights
model_cnn.load_weights('model_weights.h5')

# Evaluate the model to get the loss and accuracy values
loss, acc = model_cnn.evaluate(x_test, y_test, verbose=0)

# Print the loss and accuracy obtained
print("Loss: {0}\nAccuracy: {1}".format(loss, acc))

'''
Model: "cnn_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Embedding (Embedding)        (None, 800, 100)          2000100   
_________________________________________________________________
dropout_1 (Dropout)          (None, 800, 100)          0         
_________________________________________________________________
Conv (Conv1D)                (None, 797, 16)           6416      
_________________________________________________________________
MaxPool (MaxPooling1D)       (None, 398, 16)           0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 398, 16)           0         
_________________________________________________________________
LSTM (LSTM)                  (None, 64)                20736     
_________________________________________________________________
dropout_3 (Dropout)          (None, 64)                0         
_________________________________________________________________
Dense2 (Dense)               (None, 16)                1040      
_________________________________________________________________
Output (Dense)               (None, 1)                 17        
=================================================================
Total params: 2,028,309
Trainable params: 2,028,309
Non-trainable params: 0
_________________________________________________________________
Loss: 0.4343099966049194
Accuracy: 0.836

'''

you achieved very high accuracy on the sentiment classification task! Remark that on the training data the model achieved more than 98% accuracy, and because the accuracy was not in the same level on the test data, you can guess that it had some level of overfitting. It may be because the dataset was not big enough to train the model and some patterns present on the test data weren't present on the train set. Finally, the model can be further extended to have additional layers to achieve even better results, but will also demand more data and computer power.

# Data pre-processing