# Week 3: Sequence models (RNN, LSTM)
Tokenizing words gave us a negative effect last week because the context of words was hard to follow when the words were broken down into subwords.
The sequence in which the tokens of subwords appear becomes very important when understanding their meaning.

In RNN, apart from x and y, there is always an element that is fed into a function from a previous function.
```
y0    y1    y2
^     ^     ^
| f0  | f1  | f2
F --> F --> F -->
^     ^     ^
|     |     |
x0    x1    x2
```

There can be a limitation when approaching text classification in this way:
`Today has a beautiful blue <...>` -> "blue sky" - the context word that lets us understand the next word is very close to it.

`I lived in Ireland, so at school they made me learn how to speak <...>` -> "to speak Galeic" - word "Ireland" is a context word but it apeears very far from the predicting word.

In this case using LSTM (Long Short-Term Memory) network can be useful to use. LSTM has an additional pipeline of context called "Cell state" and this can be passed through the network.
This helps to keep context from earlier tokens relevant to later ones.
Cell State can also be bi-directional: later contexts can impact earlier ones.

In [None]:
# Implementing LSTMs in code
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type = 'post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)), # LSTM layer here! 64 - is the # of outputs for this layer
    # `Bidirectional` layer will make sure cells will go both ways, hence output shape will be 2 * 64 = 128
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)), # LSTMs can also be stacked
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

# Using a Convolutional network
anotherModel = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    # Use convolutions
    tf.keras.layers.Conv1D(128, 5, activation='relu'), # 128 filters each for 5 words
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

# Going back to the IMDB dataset
import tensorflow_datasets as tfds
tfds.load("imdb_reviews", with_info=True, as_supervised=True)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    # tf.keras.layers.Flatten(), # would produce 171,533 params with high acc but clear overfitting
    #tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)), # with LSTM we have only 30,129 params, but it will take ~43 sec per epoch

    #tf.keras.layers.Conv1D(128, 5, activation='relu'), # with convolution layer it would be 171,149 params and it only takes about 6s/epoch
    #tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)), # would have 169,997 params, training time would take ~20s per epoch and accuracy will good, but still overfit
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Embedding(1, activation='sigmoid'),
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
