<a href="https://colab.research.google.com/github/gonzalovaldenebro/NaturalLanguageProcessing-Portfolio/blob/main/F6_4_LongTermRecurrence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Handling Long-Term Information in Recurrent Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F6_4_LongTermRecurrence.ipynb)

## Reference

SLP: RNNs and LSTMs, Chapter 9 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/9.pdf

Wikipedia article on Gated Recurrent Unit: https://en.wikipedia.org/wiki/Gated_recurrent_unit

Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras by Jason Brownlee: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

In [1]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow transformers



## Example Sentence

*The flights the airline was cancelling were full*

Suppose we have generated `The flights the airline`
* `was` is a good next choice
   - `airline` has context for `was` vs. `were`

Suppose we have generated `The flights the airline was cancelling`
* `was`/`were` depends on `flights`
* much more distance information

## The Vanishing Gradient

The *vanishing gradient* is a common problem in deep neural networks

If weights are < 1, they will get smaller and smaller each node they have to pass through - causing them to have little or not effect

Happens during training too - error/loss is propogated backwards through the network proportional to the weights on each edge
* earlier edges in the network are left with little error to use in adjusting weights

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/vanishing_gradient.png?raw=1">
</div>

image source: https://www.researchgate.net/figure/A-visualization-of-the-vanishing-gradient-problem-using-the-architecture-depicted-in_fig8_277603865

## LSTM

Long Short-Term Memory (LSTM) networks try to address the vanishing gradient through
* removing unneeded information from the context
* adding information likely needed later

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/LSTM_node.png?raw=1", width=700>
</div>

image_source: SLP Fig. 19.3, https://web.stanford.edu/~jurafsky/slp3/9.pdf

### How does it do this?

* Explicit **context layer** $c_t$
* neural **gates** that control the flow of information through the layer
    - $f$ - the **forget gate** - delete info from context that is no longer needed
    - $g$ - basic extraction of info from previous hidden state
    - $i$ - the **add gate** - select info to add to current context
    - $o$ - the **output gate** - decide what info is needed for current hidden state
    
<div>
    <table>
    <tr>
        <td><img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/hadamard_product.png?raw=1"></td><td style="text-align: left;"><b>Hadamard product:</b> bitwise multiplication</td>
    </tr>
    <tr>
        <td><img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/sigmoid.png?raw=1"></td><td style="text-align: left;"><b>Sigmoid activation:</b> pushes everything to 0 or 1</td>
    </tr>
    <tr>
        <td><img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/tanh.png?raw=1"></td><td style="text-align: left;"><b>Hyperbolic tangent activation:</b> pushes to 0 or 1, more like identity at the origin</td>
    </tr>
    </table>
</div>

Combining sigmoid with ⊙ has the effect of *masking* out information removing some, leaving others

## Gated Recurrent Unit

A **Gated Recurrent Unit** is a popular unit similar to LSTM, except more lightweight
* no output gate
* no context vector

Performance is often similar, but fewer parameters
* faster
* less memory


<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN-vs-LSTM-vs-GRU.png?raw=1", width=700>
</div>

image source: http://dprogrammer.org/rnn-lstm-gru

## Let's work with some data

We'll do something that should be an easier learning problem: text classification with a recurrent network

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/RNN_classification.png?raw=1" width=700>
</div>

image source: SLP Fig. 9.8, https://web.stanford.edu/~jurafsky/slp3/9.pdf

### IMDB Reviews Dataset

In [2]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split

dataset = load_dataset("imdb")
dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [11]:
#uncomment these to work with a subset of the data
data_subset_text, _, data_subset_label, _ = train_test_split(dataset["train"]["text"],dataset["train"]["label"],train_size=5000)
train_data_text,  test_data_text, train_data_label, test_data_label = train_test_split(data_subset_text, data_subset_label,test_size = 0.2)

# uncomment these to use the full original data
#train_data_text = dataset["train"]["text"]
#train_data_label = dataset["train"]["label"]
#test_data_text = dataset["test"]["text"]
#test_data_label = dataset["test"]["label"]

In [12]:
#printing out a sample review
print( train_data_text[125] )
print( train_data_label[125] )

When it comes to horror movies, I am more than willing to suspend disbelief, ignore sub-par production values, and overlook plot holes in the interest of a good scare. This movies simply has no good scares to offer. It can't even be enjoyed as camp. Bad dialogue, bad acting, bad direction, the kills were predictable and poorly staged, the music was annoying, the camera work was wretched, even the costumes were bad. I felt really bad for the actors, who were obviously trying, but who had to deal with terrible, contrived dialogue and an obvious lack of direction. I doubt they got any rehearsal, either. It's embarrassing to watch, and so boring than making it through to the contrived "surprise" ending requires tremendous endurance. It's quite easily one of the worst movies I've ever seen.<br /><br />I don't normally write reviews, but this one was so bad that I felt compelled to warn others. This movie is a complete waste of time. If you must watch this movie, don't miss the "Making of"-f

### Importing libraries

In [13]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, SimpleRNN, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

### Preparing the data

In [19]:
vocab_size = 10000
pad_length = 500

tokenizer = Tokenizer(num_words=vocab_size) #only keep the 10000 most common words
tokenizer.fit_on_texts(train_data_text)
tokenized_train_data = tokenizer.texts_to_sequences(train_data_text)
processed_train_data = pad_sequences(tokenized_train_data,maxlen=pad_length, padding='pre')
tokenized_test_data = tokenizer.texts_to_sequences(test_data_text)
processed_test_data = pad_sequences(tokenized_test_data,maxlen=pad_length, padding='pre')

train_target = np.array(train_data_label)
test_target = np.array(test_data_label)

**Important Note:** I originally had tried `padding='post'` which led to bad results
* having a bunch of 0s at the end of a sequence is really bad when you are only sending the last output to the next layer
* in general, we shouldn't be using post-padding with recurrent networks
 - unfortunately, this doesn't seem to be the problem with our encoder-decoder example, but it is still worth going back and fixing if you want to keep working with it

### Defining a simple LSTM-based architecture

Since this is a binary classification problem, we can use a sigmoid activation and `binary_crossentropy` loss.

In [20]:
embedding_size = 50
hidden_layer_size = 100

model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=pad_length))
model.add(Dropout(0.2))
model.add(LSTM(hidden_layer_size), Dropout = 0.2, recurrent_dropout = 0.2)
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

In [21]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 500, 50)           500000    
                                                                 
 dropout (Dropout)           (None, 500, 50)           0         
                                                                 
 lstm_2 (LSTM)               (None, 100)               60400     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 560501 (2.14 MB)
Trainable params: 560501 (2.14 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [23]:
model.fit(processed_train_data,
          train_target,
          epochs = 10,
          batch_size = 64,
          validation_data=(processed_test_data,test_target) )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x2a21cb550>

## Exercise

The source I got this code from (https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ ) included Dropout layers - you can try uncommenting them and see what it does.

It's also equivalent to writing

`model.add(LSTM(hidden_layer_size), dropout=0.2, recurrent_dropout=0.2)`

Do some searching and see what you can find out about what dropout layers do and why people use them. Discuss your findings with your group.

**Did the dropout layers make a difference?**

Made it worse, with more epochs id die better 

**What does it do?**

Removes some of the data - randomly

## Exercise

Run an experiment: What is the difference between using `SimpleRNN` and `LSTM` with this data?

## Applied Exploration

Do one of the following:

1. Redo your experiment with another classification dataset. Choose something with more than 2 classes - this will be good practice is understanding the difference you need to make to the model and data prep. Describe your data and results as usual.
    * I also suggest including a GRU layer in your experiment as well: https://keras.io/api/layers/recurrent_layers/gru/

2. Edit the Encoder-Decoder code from last time to use LSTM or GRU.
    * Note that since LSTM returns both a context and hidden state, you will get an output, a hidden state, and context returned from the LSTM layer (instead of just the output and state). It will look something like
    

In [None]:
encoder_rnn = LSTM(100, return_state=True)
encoder_outputs, state_h, state_c = encoder_rnn(enc_emb)

and you will pass both state_h, state_c as the *context vector* which is the initial state for the decoder. See the source from last time to flesh out the example: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html