On this hands-on lab we will perform few activities related to Sequence Models - in special Recurrent Neural Networks applied to sentiment analysis and time series forecasting.

To perform those activities it is important to address some requirements beforehand:

1) deploy one AWS EC2 instance (P2.8x type) to be used as sandbox (it could be destroyed after the lab execution)

2) After logging in the instance, run 'source activate tensorflow_p36'

3) Create a directory as 'mkdir -p /models/ai-conference' and enter on it 'cd /models/ai-conference'

4) Clone the github repository containing the labs 'git clone github link'

This notebook includes the following activities:

- building a first sample RNN (LSTM) on NLP
- train the neural network using the IMDB reviews dataset and evaluate its performance
- report the performance metrics for that model, including precision, recall, f1score and support
- performing transfer learning to speed up model creation process
- building a second sample RNN (LSTM) network on Time Series forecasting


## Part I - Sequence Models basics

In [None]:
# validate that the required python modules are installed before starting

!conda install -y seaborn Pillow scikit-learn

In [None]:
# importing required modules

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
from operator import itemgetter    
from keras import models, regularizers, layers, optimizers, losses, metrics
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils, to_categorical
from keras.datasets import imdb
from keras.utils.training_utils import multi_gpu_model
from sklearn.metrics import confusion_matrix, classification_report

### "Recurrent means the output at the current time step becomes the input to the next time step. At each element of the sequence, the model considers not just the current input, but what it remembers about the preceding elements."

![Recurrent Neural Network](https://cdn-images-1.medium.com/max/1600/1*KljWrINqItHR6ng05ASR8w.png)

In [None]:
# For reproducibility

np.random.seed(1000)

# model configuration -- number of GPUs and training option (Yes or No)

n_gpus = 8 # knob to make the model parallel or not
train_model = False # knob to decide if the model will be trained or imported

In [None]:
# loading IMDB dataset

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

In [None]:
# taking a look on the dataset info

print("_"*50)
print("\ntrain_data ", train_data.shape)
print("train_labels ", train_labels.shape)
print("_"*50)
print("\ntest_data ", test_data.shape)
print("test_labels ", test_labels.shape)
print("_"*50)
print("\nMaximum value of a word index ")
print(max([max(sequence) for sequence in train_data]))
print("\nMaximum length num words of review in train ")
print(max([len(sequence) for sequence in train_data]))
print("_"*50)

# checking a sample from the dataset

word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[123]])
print('decoded text:\n\n', decoded_review)

![Vectorization](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/assets/atap_0402.png)

In [None]:
# function to vectorize the dataset information

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results
    
# vectorizing the datasets

X_train = vectorize_sequences(train_data)
X_test = vectorize_sequences(test_data)

print("x_train ", X_train.shape)
print("x_test ", X_test.shape)

# vectorizing the labels

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
print("y_train ", y_train.shape)
print("y_test ", y_test.shape)

In [None]:
# creating a validation set

X_val = X_train[:10000]
X_train = X_train[10000:]
y_val = y_train[:10000]
y_train = y_train[10000:]

print("x_val ", X_val.shape)
print("X_train ", X_train.shape)
print("y_val ", y_val.shape)
print("y_train ", y_train.shape)

![Vectorization](https://www.researchgate.net/profile/Aliaa_Rassem/publication/317954962/figure/download/fig2/AS:667792667860996@1536225587611/RNN-simple-cell-versus-LSTM-cell-4.png)

In [None]:
# creating the RNN model

if train_model is True:
    model = models.Sequential()
    model.add(layers.Dense(16, kernel_regularizer=regularizers.l1(0.001), activation='relu', input_shape=(10000,)))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(16, kernel_regularizer=regularizers.l1(0.001),activation='relu'))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(1, activation='sigmoid'))

    # making the model aware of multiple GPUs
    
    if n_gpus > 1:
        final_model = multi_gpu_model(model, gpus=8)
    else:
        final_model = model

    # compile the model

    final_model.compile(optimizer='rmsprop', 
                  loss='binary_crossentropy', 
                  metrics=['accuracy'])

    # summarize the model
    final_model.summary()

In [None]:
# training the model

if train_model is True:
    n_epochs = 20
    batch_size = 512

    history = final_model.fit(X_train, 
                              y_train, 
                              epochs=n_epochs, 
                              batch_size=batch_size,
                              validation_data=(X_val, y_val))

In [None]:
# save the model details

if train_model is True:
    
    # save the model
    model.save('rnn_model.h5')

In [None]:
# summarize history for accuracy

if train_model is True:
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

In [None]:
# summarize history for loss

if train_model is True:
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

In [None]:
# evaluating the model

if train_model is True:
    results = final_model.evaluate(X_test, y_test)
    print("_"*50)
    print("Test Loss and Accuracy")
    print("results ", results)
    history_dict = history.history

In [None]:
# evaluating the model accuracy

if train_model is not True:

    from keras.models import load_model
    final_model = load_model('rnn_model.h5')
    final_model.compile(optimizer='adam',
                        loss='categorical_crossentropy', 
                        metrics=['accuracy'])

predictions = final_model.predict(X_test)
predictions = (predictions > 0.5)

cm = confusion_matrix(y_test, predictions)

plt.imshow(cm, cmap=plt.cm.Blues)
classNames = ['Negative','Positive']
plt.title('IMDB reviews sentiment analysis')
plt.xlabel("Predicted labels")
plt.ylabel("True labels")
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]))
plt.colorbar()
plt.show()

In [None]:
print(classification_report(predictions, y_test))

## Part II - Time Series Forecasting

To avoid memory and/or cpu usage issues, it is important to reset the Jupyter Notebook kernel.

This task can be performed as:

- go to the Jupyter notebook menu (up there)
- click on 'Kernel'
- click on 'Restart'
- wait for the kernel to restart

Once the restart procedure is finished, go ahead on the next steps.

In [None]:
# importing modules

from datetime import datetime
from math import sqrt
import numpy as np
from numpy import concatenate
from matplotlib import pyplot as plt
import pandas as pd
from pandas import read_csv, DataFrame, concat
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.utils.training_utils import multi_gpu_model

In [None]:
# model configuration -- number of GPUs and training option (Yes or No)

n_gpus = 1 # knob to make the model parallel or not

In [None]:
# load and process data

def parse(x):
    return datetime.strptime(x, '%Y %m %d %H')

dataset = read_csv('pollution.csv',  
                   parse_dates = [['year', 'month', 'day', 'hour']], 
                   index_col=0, 
                   date_parser=parse)

dataset.drop('No', axis=1, inplace=True)
dataset.columns = ['pollution', 'dew', 'temp', 'press', 
                   'wnd_dir', 'wnd_spd', 'snow', 'rain']
dataset.index.name = 'date'
dataset['pollution'].fillna(0, inplace=True)
dataset = dataset[24:]

print('-'*100)
print(dataset.head(5))
print('-'*100)
dataset.to_csv('pollution_parsed.csv')

In [None]:
# visualizing the dataset

dataset = read_csv('pollution_parsed.csv', header=0, index_col=0)
values = dataset.values
groups = [0, 1, 2, 3, 5, 6, 7]
i = 1
plt.figure()
for group in groups:
    plt.subplot(len(groups), 1, i)
    plt.plot(values[:, group],'k')
    plt.title(dataset.columns[group], y=0.5, loc='right')
    i += 1
plt.show()

In [None]:
# loading the dataset

def organize_series(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols, names = list(), list()
    
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]

    agg = pd.concat(cols, axis=1)
    agg.columns = names

    if dropnan:
        agg.dropna(inplace=True)

    return agg

# processing the dataset

dataset = read_csv('pollution_parsed.csv', header=0, index_col=0)
values = dataset.values
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
values = values.astype('float32')
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
reframed = organize_series(scaled, 1, 1)

In [None]:
# split data into training and testing

values = reframed.values
n_train_hours = 365 * 24
test = values[:n_train_hours, :]
train = values[n_train_hours:, :]

# split into input and outputs

X_train, y_train = train[:, :-1], train[:, -1]
X_test, y_test = test[:, :-1], test[:, -1]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
print(" Training data shape X, y => ",X_train.shape, y_train.shape," Testing data shape X, y => ", X_test.shape, y_test.shape)

In [None]:
# defining the RNN/LSTM model

model = Sequential()
model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.3))
model.add(Dense(1,kernel_initializer='normal', activation='sigmoid'))

# make the model aware of multi GPU

if n_gpus > 1:
    final_model = multi_gpu_model(model, gpus=8)
else:
    final_model = model

# compile the model
final_model.compile(loss='mae', optimizer='adam')

# summarize the model
final_model.summary()

In [None]:
# training the RNN/LSTM model

epochs = 5
batch_size = 72

history = final_model.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=batch_size, 
                    validation_data=(X_test, y_test), 
                    shuffle=False)

In [None]:
# visualizing the loss during training

plt.plot(history.history['loss'], 'b', label='Training')
plt.plot(history.history['val_loss'],  'r',label='Validation')
plt.title("Train and Test Loss for the LSTM")
plt.legend()
plt.show()

## Cleaning things up

Not much actions must be taken to clean the environment used on this lab.

As a new EC2 instance was created for this purpose, simply terminate the instance.