<a href="https://colab.research.google.com/github/mallibus/Unige-DL2019/blob/master/UNIGE_DL_2019_2_0_DeepSequenceModeling_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2. Deep Sequence Modeling

In [0]:
from __future__ import print_function
import tensorflow as tf
import os, json, re 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import pandas as pd
%matplotlib inline


tf.enable_eager_execution()

print("TensorFlow version: {}".format(tf.__version__))
print("Eager execution: {}".format(tf.executing_eagerly()))

## 2.1 Deal with sequential data
In this lab we see Deep Learning models that can process sequential data (text, timeseries,..).<br>
These models don’t take as input raw text: they only work with numeric tensors; **vectorizing** text is the process of transforming text into numeric tensors.<br><br><br>

<img src="http://mlclass.epizy.com/lab2_images_notebook/vectorizing.png" width="400px"><br><br><br>
The different units into which you can break down text (words, characters) are called tokens; then if you apply a tokenization scheme, you associate numeric vectors with the generated tokens.<br>
These vectors, packed into sequence tensors, are fed into Deep Neural Network.<br>
There are multiple ways to associate a vector with a token: we will see One-Hot Encoding and Token Embedding.

### 1) One-Hot Encoding
One-Hot Encoding consists of associating a unique integer index with every word and then turning this integer index $i$ into a binary vector of size $N$ (the size of the vocabulary); the vector is all zeros except for the $i$-th entry, which is 1.
<img src="http://mlclass.epizy.com/lab2_images_notebook/one-hot.png" width=" 400px">


#### Try to perform One-Hot Encoding using Tokenizer
Keras provides the Tokenizer class for preparing text documents for DL.<br>
The Tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents

In [0]:
# define 4 documents
docs = ['Well done!','Good work','Great effort','nice work']

# create the tokenizer
tokenizer = Tokenizer()

# fit the tokenizer on the documents
# --fill here-- # use fit_on_texts() function


encoded_docs = # --fill here-- # use the function texts_to_matrix()
print(encoded_docs)

Some problems related to this kind of encoding are sparsity of the solution and the high dimensionality of the vector encoding of the tokens.

### 2) Word embedding
The vector obtained from word embedding is dense and has lower dimensionality w.r.t One-Hot Encoding vector; the dimensionality of embedding space vector is an hyperparameter.<br>
<img src="http://mlclass.epizy.com/lab2_images_notebook/one-hot-we.png" width="400px"><br>
There are two ways to obtain word embeddings:<br>
* May be learned jointly with the network
* May use pre-trained word vectors (Word2Vec, GloVe,..)


Word embeddings maps human language into a geometric space; in a reasonable embedding space synonyms are embedded into similar word vectors and the geometric distance between any two word vectors reflects the semantic distance between the associated words (words meaning different things are embedded at points far away from each other, whereas related words are closer).<br>
How good is a word-embedding space depends on the specific task.<br>
It is reasonable to learn a new embedding space with every new task: with backpropagation and Keras it reduces to learn the weights of the Embedding layer.

### Learning Word Embeddings with the embedding layer

#### Load imdb dataset
This dataset contains movies reviews from IMDB, labeled by sentiment(positive/negative); reviews have been preprocessed, and each review is encoded as a sequence of word indexes(integers).<br>
https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

In [0]:
max_features = 10000
maxlen = 20

imdb = tf.keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

x_train = # --fill here-- # use preprocessing.sequence.pad_sequences
x_test = # --fill here-- # use preprocessing.sequence.pad_sequences

#### Show the size of vocabulary and the most frequent words

In [0]:
word_to_index = imdb.get_word_index()

vocab_size = # --fill here-- # 
print('Vocab size : ', vocab_size)


words_freq_list = []
for (k,v) in imdb.get_word_index().items():
    # --fill here-- #

sorted_list = sorted(words_freq_list, key=lambda x: x[1])

print("50 most common words: \n")
print(sorted_list[0:50])

In [0]:
word_to_index['otherwise']

#### Create the model

In [0]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(10000, 8, input_length=maxlen))   
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation=tf.nn.sigmoid))

#### Compile the model

In [0]:
# --fill here-- # you can use rms as optimizer and binary crossentropy as loss function

#### Train the model

In [0]:
model.summary()

In [0]:
history = # --fill here-- # 

#### Visualize accuracy and loss

In [0]:
def plot_history(history):
    # Plot training & validation accuracy values
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.title('Model accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Val'], loc='upper left')
    plt.show()

    # Plot training & validation loss values
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Val'], loc='upper left')
    plt.show()

In [0]:
plot_history(history)

#### Evaluate the model

In [0]:
test_loss, test_acc = # --fill here-- # 
print('Test accuracy: %.3f, Test loss: %.3f' % (test_acc,test_loss))

The Dense layer on top leads to a model that treats each word in the input sequence separately, without considering inter-word
relationships and sentence structure (for example, this model would likely treat both “this movie is a bomb” and “this movie is the bomb” as being negative reviews). It’s much better to add recurrent layers on top of the embedded sequences to learn features that take into account each sequence as a whole.

### Using pre-trained Word Embeddings
If you have little training data available and you can’t use your data alone to learn an appropriate task-specific embedding of your vocabulary, you can load embedding vectors from a precomputed embedding space thath exhibits useful properties that  captures generic aspects of language structure.

#### Parsing the GloVe word-embeddings file
You can find the Glove word-embeddings file here http://nlp.stanford.edu/data/glove.6B.zip or here https://drive.google.com/drive/folders/1wvyeiRwYAdypLfrOfIaiwBMPPTzQwKp_ you can find the .txt file already extracted.<br>
Let’s parse the unzipped file (a .txt file) to build an index that maps words (as strings) to their vector representation (as number vectors).

**Get path to file saved at your Google Drive folder (same procedure as lab 1)**


Load one example of image from a Google Drive folder
* Click on arrow at left side of screen, then "Files". On the Directory Tree, navigate to "gdrive", which will contain your Drive folder as "My Drive"
* In My Drive, search for the folder containing this Lab-2 and find "glove.6B”
* right click on that corresponding folder, “copy path” 


In [0]:
# THE SAME PROCEDURE DONE FOR LAB1 TO LOAD IMAGES FROM A GOOGLE DRIVE FOLDER 
# HAS TO BE DONE HERE 
glove_dir = # --fill here-- # path where you save 'glove.6B.100d.txt'

embeddings_index = {}

f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding="utf8") # for windows encoding "utf8" works; for linux/ios check

# Parse the .txt file to build an index that maps words (as strings) to their vector representation (as number vectors).
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
    
f.close()
print('Found %s word vectors.' % len(embeddings_index))

SyntaxError: ignored

#### Load imdb dataset

In [0]:
max_features = 10000
maxlen = 20

imdb = tf.keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

x_train = # --fill here-- # use preprocessing.sequence.pad_sequences
x_test = # --fill here-- # use preprocessing.sequence.pad_sequences

#### Preparing the GloVe word-embeddings matrix
Now we build an embedding matrix that you can load into an Embedding layer.<br>
Each entry contains the embedding_dim-dimensional vector for the word of the index in the reference word index (built during tokenization).

In [0]:
# dimensionality of word embeddings
embedding_dim = 100

# Word from this index are valid words. i.e  3 -> 'the' which is the most frequent word
index_from = 3

word_to_index = {k:(v+index_from-1) for k,v in imdb.get_word_index().items()}
word_to_index["<PAD>"] = 0
word_to_index["<START>"] = 1
word_to_index["<UNK>"] = 2


# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size+index_from, embedding_dim))

# unknown words are mapped to zero vector
embedding_matrix[0] = np.array(embedding_dim*[0])
embedding_matrix[1] = np.array(embedding_dim*[0])
embedding_matrix[2] = np.array(embedding_dim*[0])

for word, i in word_to_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = # --fill here-- # 
    #else :
        #print(word, ' not found in GLoVe file.')

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
print('Coverage = ', nonzero_elements / vocab_size)

#### Build the model

In [0]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size+index_from, embedding_dim, input_length=maxlen))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation=tf.nn.sigmoid))

In [0]:
model.summary()

#### Loading pretrained word embeddings into the Embedding layer

In [0]:
# --fill here-- # look at method tf.keras.layers.set_weights and at property tf.keras.layers.trainable

#### Compile the model

In [0]:
# --fill here-- #

#### Train the model

In [0]:
history = # --fill here-- #

In [0]:
plot_history(history)

#### Evaluate the model

In [0]:
test_loss, test_acc = # --fill here-- #
print('Test accuracy: %.3f, Test loss: %.3f' % (test_acc,test_loss))

You can notice that using few training samples, the performance is poor.<br>
If you try to train the same model without loading the pretrained word embeddings and without freezing the embedding layer, you’ll learn a task specific embedding of the input tokens, which is generally more powerful than pretrained word embeddings when lots of data is available.

## 2.2 Recurrent Neural Network
Here https://colah.github.io/posts/2015-08-Understanding-LSTMs/ you can find a clear explanation about RNNs and LSTMs; the following is a summary of the main concepts.


A major characteristic of some neural networks, as ConvNet, is that they have no memory: each input is processed independently, with no state kept in between inputs.<br>
With such networks, in order to process a sequence or a temporal series of data points, you have to show the entire sequence to the network at once (turn it into a single data point).<br>
Biological intelligence processes information incrementally while maintaining an internal model of what it’s processing, built from past information and constantly updated as new information comes in.<br>
A recurrent neural network (RNN) adopts the same principle but in an extremely simplified version: it processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far. In effect, an RNN is a type of neural network that has an internal loop.

<img src="http://mlclass.epizy.com/lab2_images_notebook/rnn.png" width="650px"><br>




Each input $x_{i=t-1, t, t+1, ..}$ is combined with the internal state and then is applied an activation function (e.g. $tanh$); then the output is computed $h_{i=t-1, t, t+1, ..}$ and the internal state is updated.<br>
In many cases, you just need the last output ($h_{i=last t}$ at the end of the loop), because it already contains information
about the entire sequence.
<img src="http://mlclass.epizy.com/lab2_images_notebook/rnn2.png" width="550px">


#### Numpy implementation of RNN

<img src="http://mlclass.epizy.com/lab2_images_notebook/rnn1.png" width="550px">

In [0]:
timesteps = 100
input_features = 32
output_features = 64

inputs = np.random.random((timesteps, input_features))

state_t = # --fill here-- #  initial state all 0s

# set W,u and b to random values
W = # --fill here-- #
U = # --fill here-- #
b = # --fill here-- #

successive_outputs = []

for input_t in inputs:
    output_t = # --fill here-- # 
    successive_outputs.append(output_t) 
    state_t = # --fill here-- # 
    
final_output_sequence = np.concatenate(successive_outputs, axis=0)  # The final output is a 2D tensor of 
                                                                    # shape (timesteps, output_features).
final_output_sequence[-1]

In this example, the final output is a 2D tensor of shape (timesteps, output_features), where each timestep is the output of the loop at time t (so it contains information about all timesteps); as has already been said, in the majority of the cases you just need the last output (output_t at the end of the loop).

#### RNN with tensorflow

#### Load dataset

In [0]:
max_features = 10000
maxlen = 20

imdb = tf.keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

# --fill here-- # use preprocessing.sequence.pad_sequences
x_train = # --fill here-- #
x_test = # --fill here-- #

#### Create the model

In [0]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(10000, 32))
model.add(tf.keras.layers.SimpleRNN(32))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

#### Compile and fit the model

In [0]:
# --fill here-- #

#### Train the model

In [0]:
history = # --fill here-- #

In [0]:
plot_history(history)

#### Evaluate the model

In [0]:
test_loss, test_acc = # --fill here-- #
print('Test accuracy: %.3f, Test loss: %.3f' % (test_acc,test_loss))

#### Try to stack several recurrent layers one after the other in order to increase the representational power of a network. In such a setup, you have to get all of the intermediate layers to return full sequence of outputs

In [0]:
# --fill here-- #

#### Compile the model

In [0]:
# --fill here-- #

#### Train the model

In [0]:
history = # --fill here-- #

In [0]:
plot_history(history)

#### Evaluate the model

In [0]:
test_loss, test_acc = # --fill here-- #
print('Test accuracy: %.3f, Test loss: %.3f' % (test_acc,test_loss))

## 2.3 LSTM Network
LSTMs are a special kind of recurrent neural network which works, for many tasks, much better than the standard RNNs.<br>
These nets are capable of learning long-term dependencies (they are explicitly designed to avoid the long-term dependency problem); remembering information for long periods of time is practically their default behavior.<br><br>

<img src="http://mlclass.epizy.com/lab2_images_notebook/lstm.png" width="650px"><br>

RNNs have a very simple structure, such as a single $tanh$ layer.<br>
LSTMs also have a chain like structure, but the repeating module has a different structure: instead of having a single neural network layer, there are four, interacting in a very special way

#### LSTM Walk Through

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.<br>
It runs straight down the entire chain, with only some minor linear interactions.
<img src="http://mlclass.epizy.com/lab2_images_notebook/lstm_cellstate.png" width="650px"><br>
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.<br>
Gates are a way to optionally let information through; they are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers in $[0,1]$, describing how much of each component should be let through ($0$ means “let nothing through,” $1$ means “let everything through”); an LSTM has three of these gates, to protect and control the cell state.

#### How LSTM works?

**1.** The first step is to decide what information we’re going to throw away from the cell state; this decision is made by a sigmoid layer called the “forget gate layer.” 

<img src="http://mlclass.epizy.com/lab2_images_notebook/lstm_fg.png" width="650px"><br>

**2.** The next step is to decide what new information we’re going to store in the cell state. 
First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, $\tilde{C}$, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

<img src="http://mlclass.epizy.com/lab2_images_notebook/lstm_ig.png" width="650px"><br>

**3.** Now we update the old cell state, $C_{t−1}$, into the new cell state $C_{t}$.<br>
We multiply the old state by $f_{t}$, forgetting the things we decided to forget earlier; then we add $i_{t}$∗$\tilde{C_{t}}$. This is the new candidate values, scaled by how much we decided to update each state value.

<img src="http://mlclass.epizy.com/lab2_images_notebook/lstm_up.png" width="650px"><br>

**4.** Finally the output will be based on our cell state, but will be a filtered version. 
First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values in $[−1,1]$) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

<img src="http://mlclass.epizy.com/lab2_images_notebook/lstm_out.png" width="650px">


#### Create LSTM model in TensorFlow

In [0]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(20000, 128))
model.add(tf.keras.layers.CuDNNLSTM(128))# if no GPU tf.keras.layers.LSTM 
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

#### Compile the model

In [0]:
# --fill here-- #

#### Train the model

In [0]:
history = # --fill here-- #

In [0]:
plot_history(history)

#### Evaluate the model

In [0]:
test_loss, test_acc = # --fill here-- #
print('Test accuracy: %.3f, Test loss: %.3f' % (test_acc,test_loss))

**The dataset is actually too small for LSTM to be of any advantage compared to simpler models.**

### BONUS - Try to implement a model for the yelp review dataset
You can download the dataset in .json format from link: https://www.yelp.com/dataset/download

**This file has a size of about 8GB**

In [0]:
def convert(x):
    """ This function convert the .json into a dict with all information about the review
        e.g. 'review_id': 'Q1sbwvVQXV2734tPgoKj4Q', 'user_id': 'hG7b0MtEbXx5QzbzE6C_VA',.."""
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob

#### Load data

In [0]:
json_filename = # --fill here-- # path where review.json is located

with open(json_filename,'rb') as f:
    data = f.readlines()

print(len(data))

#### Using all dataset could crush your laptop memory; in order to avoid it, we can use only a part of the dataset.

In [0]:
ind = int(len(data)/10)
data1 = data[:ind]

#### Now store data in a pandas DataFrame converting each review

In [0]:
df = pd.DataFrame([convert(line) for line in data1])

In [0]:
del(data)
del(data1)
data = df[['text', 'stars']]

In [0]:
data.head()

In [0]:
# Rates above 3 are considered positives
data['sentiment'] = ['pos' if (x>3) else 'neg' for x in data['stars']]

data['text']= [x.lower() for x in data['text']]
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

In [0]:
for idx,row in data.iterrows():
    row[0] = row[0].replace('rt',' ')

In [0]:
pd.set_option('display.max_colwidth',-1)
data[:5]

data.dtypes

#### Visualize data

In [0]:
data.head()

#### Vectorizing the text

In [0]:
nb_words = 2000

tokenizer = Tokenizer(nb_words=nb_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                   lower=True,split=' ')
tokenizer.fit_on_texts(data['text'].values)

In [0]:
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)

#### Build the model

In [0]:
embed_dim = 128

model = # --fill here-- # 
# hint you could stack and embedding layer followed by an RNN or LSTM 
# the dropout and finally a dense with softmax

#### Compile the model

In [0]:
# --fill here-- #

#### Split data into Train and Test set

In [0]:
# Convert categorical variable into dummy variables
Y = pd.get_dummies(data['sentiment']).values
x_train, x_test, y_train, y_test = # --fill here-- #

print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

#### Train the model

In [0]:
batch_size = 32

history = # --fill here-- # training could be very time consuming, so set epochs variable in an appropriate way

In [0]:
plot_history(history)

#### Evaluate the model

In [0]:
test_loss, test_acc = # --fill here-- #
print('Test accuracy: %.3f, Test loss: %.3f' % (test_acc,test_loss))