[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut7_RNN_NLP1_teacher.ipynb)

# Tutorial 7: Processing words as sequences
In this tutorial, we will try to predict the next word in a sentence. This is challenging, as we will see because we choose a word out of a vocabulary, which is commonly large. Hence, the purpose of this tutorial is not to get an accurate model, but rather to show you how this task can be performed. More accurate models require larger samples and computational resources. 

We cover the following
1. Prepare the text data to represent the sequence $[w_1,w_2,w_3,w_4,w_5,w_6]$ into something like $y=w_6$ and $x=[w_1,w_2,w_3,w_4,w_5]$. Because you are now familiar with IMBD dataset, we will use it to create our sequence data.
2. Train a feedforward network. 
3. Train a NN with `SimpleRNN` layer. 
4. Train a NN with `LSTM` layer.
5. Train a NN with `Embedding` and `LSTM` layers.

For further examples, please visit the demos in [demos/rnn](https://github.com/Humboldt-WI/adams/tree/master/demos/rnn).

## 1. Preprocess IMDB data 

In [1]:
# Import the required libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import string
import re
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

### Exercise 1
Load the IMDB, and use the first 100 reviews as training and the next 20 as validation. We won't be using the sentiment, only the text.

In [3]:
# load the data (be sure to provide the correct file path)
total_imbd = pd.read_csv("IMDB-50K-Movie-Review.csv", sep=",", encoding="ISO-8859-1")
text_data = total_imbd['review'][:120].to_numpy()
text_data_train = text_data[:100]
text_data_val = text_data[100:]
text_data_train[:2]

array(["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to

### Exercise 2
Create `our_standardization` function to convert to lowercase, remove HTML tags, punctation and double spaces (check [tut5_embeddings](https://github.com/Humboldt-WI/adams/blob/master/exercises/tut5_embeddings_teacher.ipynb)). 

In [4]:
def our_standardization(text_data):
  lowercase = tf.strings.lower(text_data) # convert to lowercase
  remove_html = tf.strings.regex_replace(lowercase, '<br />', ' ') # remove HTML tags
  pattern_remove_punctuation = '[%s]' % re.escape(string.punctuation) # pattern to remove punctuation
  remove_punct = tf.strings.regex_replace(remove_html, pattern_remove_punctuation, '') # apply pattern
  remove_double_spaces = tf.strings.regex_replace(remove_punct, '\s+', ' ') # remove double space
  return remove_double_spaces

### Exercise 3
Create `TextVectorization` with `output_mode` integer and without defining the `output_sequence_length`. Use only 100 words as vocabulary (nothing good can be done with 100 words, but the purpose is to illustrate).

In [5]:
# Define the size of the vocabulary and the max number of words in a sequence
vocab_size = 100
# Create a vectorization layer
vectorize_layer = TextVectorization(
    standardize = our_standardization,
    max_tokens = vocab_size )

### Exercise 4
Adapt the vectorization layer to the text_data.

In [26]:
# To create the vocabulary, we need to call adapt. The input is only the text
vectorize_layer.adapt(text_data)
# Check the first 10 words of the vocabulary. It is sorted by frequency 
vocab = vectorize_layer.get_vocabulary()
print(vocab[:10])

['', '[UNK]', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'it']


### Exercise 5
Create `transform_text` function to transform the text data into a time serie. The targets are related with their previous 5 words (similar to what we saw in [tut6_LSTM](https://github.com/Humboldt-WI/adams/blob/master/exercises/tut6_LSTM_teacher.ipynb). You can use built-in `timeseries_dataset_from_array` from Keras. 

In [6]:
def transform_text(data, sequence_length):
    delay = sequence_length # the target word is the word after the sequence
    batch_size = 1
    flag = True
    # Generate data
    for rev in data:
        vec_rev = vectorize_layer(rev) 
        # Create time series dataset for each review
        aux_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
            data = vec_rev[:-delay],
            targets = vec_rev[delay:],
            sequence_length=sequence_length,
            shuffle=False,
            batch_size=batch_size)
        # Concatenate the time series
        for input, target in aux_dataset:
            if flag:
                X = input
                y = target
                flag = False
            else:     
                X = tf.concat([X , input], 0)
                y = tf.concat([y, target], 0)
    return X, y

### Exercise 6
Create the training and validation datasets.

In [27]:
sequence_length = 5 # we use the last 5 words
X_train, y_train = transform_text(text_data_train, sequence_length)
X_val, y_val = transform_text(text_data_val, sequence_length)

In [8]:
print("features:", X_train[2]," target:", y_train[2])

features: tf.Tensor([ 2 82  1 43  1], shape=(5,), dtype=int64)  target: tf.Tensor(12, shape=(), dtype=int64)


In [9]:
vectorize_layer(text_data_train[0])

<tf.Tensor: shape=(304,), dtype=int64, numpy=
array([31,  5,  2, 82,  1, 43,  1, 12,  1,  1, 35,  1,  1,  1,  1, 32,  1,
       40, 28,  1, 15, 10,  7,  1, 44,  1, 14, 60,  2, 86,  1, 12,  1, 60,
       42,  1, 13, 23,  1,  4,  1,  1,  5,  1, 50,  1,  8,  1, 36,  2,  1,
        1,  1, 60, 10,  7, 22,  3,  1, 17,  2,  1,  1, 38,  1, 10,  1,  1,
       49,  1, 14,  1,  6,  1,  1, 38,  1, 23,  7,  1,  8,  2,  1,  1,  5,
        2,  1,  9,  7,  1,  1, 15, 12,  7,  2,  1,  1,  6,  2,  1,  1,  1,
        1,  1,  9,  1,  1, 20,  1,  1, 34,  1,  1,  5,  2,  1, 88, 30,  2,
        1, 29,  1,  1,  4,  1,  1, 39,  1,  7, 22,  1, 20,  2,  1,  1,  1,
        7,  1,  6,  1,  1,  1,  1,  1,  1,  1,  4,  1,  1,  1,  1,  1,  1,
        4,  1,  1, 28, 83,  1,  1, 11, 62,  1,  2,  1,  1,  5,  2,  1,  7,
        1,  6,  2,  1, 12,  9,  1, 88, 82,  1,  1,  1,  1,  1,  1,  1, 17,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  2, 86,  1, 11,  1,  1,  1, 60,
       15, 39,  1,  9, 13,  1, 11,  1,  1, 11, 13,  1,

### Exercise 7
Check the frequency of each token (you can use `tf.unique_with_counts`). What's the problem?

In [10]:
tf.unique_with_counts(y_train)

UniqueWithCounts(y=<tf.Tensor: shape=(99,), dtype=int64, numpy=
array([43,  1, 12, 35, 32, 40, 28, 15, 10,  7, 44, 14, 60,  2, 86, 42, 13,
       23,  4,  5, 50,  8, 36, 22,  3, 17, 38, 49,  6,  9, 20, 34, 88, 30,
       29, 39, 83, 11, 62, 82, 18, 47, 45, 84, 90, 75, 65, 24, 46, 63, 59,
       27, 98, 57, 26, 31, 21, 93, 66, 67, 78, 81, 73, 91, 51, 41, 52, 85,
       25, 72, 96, 53, 16, 19, 37, 80, 70, 99, 76, 54, 56, 58, 94, 74, 64,
       55, 61, 79, 92, 77, 48, 33, 71, 68, 97, 69, 95, 89, 87])>, idx=<tf.Tensor: shape=(21873,), dtype=int32, numpy=array([ 0,  1,  2, ...,  1, 24,  1], dtype=int32)>, count=<tf.Tensor: shape=(99,), dtype=int32, numpy=
array([   62, 10944,   233,    75,    94,    70,   100,   163,   224,
         399,    64,   171,    43,  1327,    33,    68,   174,   114,
         638,   604,    54,   372,    73,   123,   653,   154,    69,
          52,   515,   262,   130,    76,    31,    95,    95,    60,
          31,   236,    41,    32,   159,    56,    63,    32

## 2. Feedforward NN
### Exercise 8
Fit a feedforward network

In [11]:
input = tf.keras.Input(shape=(sequence_length,), dtype="int64") 
emd = tf.one_hot(input, depth=vocab_size)
flat = layers.Flatten()(emd)
x = layers.Dense(32)(flat) 
output = layers.Dense(vocab_size, activation="softmax")(x) 
model = tf.keras.Model(input, output) 

model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 5)]               0         
_________________________________________________________________
tf.one_hot (TFOpLambda)      (None, 5, 100)            0         
_________________________________________________________________
flatten (Flatten)            (None, 500)               0         
_________________________________________________________________
dense (Dense)                (None, 32)                16032     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               3300      
Total params: 19,332
Trainable params: 19,332
Non-trainable params: 0
_________________________________________________________________


In [12]:
model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 10, 
    batch_size=128) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa4bd8ffe50>

In [13]:
## Predict
np.sum(np.argmax(model.predict(X_val), axis = 1)==y_val.numpy())

1933

## 3. SimpleRNN
### Exercise 9 
Fit a NN with a `SimpleRNN` layer.

In [14]:
input = tf.keras.Input(shape=(sequence_length,), dtype="int64") 
emd = tf.one_hot(input, depth=vocab_size)
x = layers.SimpleRNN(32)(emd) 
output = layers.Dense(vocab_size, activation="softmax")(x) 
model = tf.keras.Model(input, output) 

model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 5)]               0         
_________________________________________________________________
tf.one_hot_1 (TFOpLambda)    (None, 5, 100)            0         
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 32)                4256      
_________________________________________________________________
dense_2 (Dense)              (None, 100)               3300      
Total params: 7,556
Trainable params: 7,556
Non-trainable params: 0
_________________________________________________________________


In [15]:
model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 10, 
    batch_size=128) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa4bde62130>

In [16]:
## Predict
# np.argmax(model.predict(tf.reshape(X_train,(1,-1))))
np.sum(np.argmax(model.predict(X_val), axis = 1)==y_val.numpy())

1933

## 4. LSTM
### Exercise 10
Fit a NN with a `LSTM` layer.

In [17]:
input = tf.keras.Input(shape=(sequence_length,), dtype="int64") 
emd = tf.one_hot(input, depth=vocab_size)
x = layers.LSTM(32)(emd) 
output = layers.Dense(vocab_size, activation="softmax")(x) 
model = tf.keras.Model(input, output) 

model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 5)]               0         
_________________________________________________________________
tf.one_hot_2 (TFOpLambda)    (None, 5, 100)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense_3 (Dense)              (None, 100)               3300      
Total params: 20,324
Trainable params: 20,324
Non-trainable params: 0
_________________________________________________________________


In [18]:
model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 10, 
    batch_size=128) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa4bf35c250>

In [19]:
## Predict
np.sum(np.argmax(model.predict(X_val), axis = 1)==y_val.numpy())

1934

## 5. Embedding + LSTM
### Exercise 11
Fit a NN with an `Embedding` and `LSTM` layers.

In [20]:

input = tf.keras.Input(shape=(sequence_length,), dtype="int64") 
emd = layers.Embedding(input_dim=vocab_size, output_dim=16)(input)
x = layers.LSTM(32)(emd) 
output = layers.Dense(vocab_size, activation="softmax")(x) 
model = tf.keras.Model(input, output) 

model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 5)]               0         
_________________________________________________________________
embedding (Embedding)        (None, 5, 16)             1600      
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                6272      
_________________________________________________________________
dense_4 (Dense)              (None, 100)               3300      
Total params: 11,172
Trainable params: 11,172
Non-trainable params: 0
_________________________________________________________________


In [21]:
model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 10, 
    batch_size=64) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa4bf6feaf0>

In [22]:
## Predict
np.sum(np.argmax(model.predict(X_val), axis = 1)==y_val.numpy())

1934