<a href="https://colab.research.google.com/github/lorenzo-crippa/3M_NLP_ESS_2022/blob/main/Tutorial_Eight_(R)_LSTMs_and_Bi_LSTMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification with LSTMs and Bi-LSTMS

## Douglas Rice

*This tutorial was originally created by Burt Monroe for his prior work with the Essex Summer School. I've updated and modified it.*

In this notebook, we'll move beyond the simple feed-forward architectures we have set up in prior neural networks to setting up neural networks that are explicitly trying to learn about *sequences*. We'll look specifically at **L**ong **S**hort-**T**erm **M**emory (LSTM) and **bi**directional LSTM (bi-LSTM)  networks. In terms of building the models in Keras, the modifications will be relatively straightforward updates. Computationally, however, we are adding significant complexity, and the additional complexity means the models will take longer to estimate.


#### Setup Instructions:
This notebook was designed to run in a clean R runtime within Google Colab. Before running any of the code below, go up to the menu at the top of the window and click "runtime," then, from the dropdown, click "Disconnect and Delete Runtime". Then, reconnect. That should get everything set up to run smoothly. 


## Setup

In [1]:
install.packages("keras") # install R library for keras; this installs dependencies we'll need, including tensorflow

library(tensorflow) # load R library for tensorflow
library(keras) # load R library for keras

tf$constant("Hello Tensorflow") # check that tensorflow is working

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘RcppTOML’, ‘here’, ‘png’, ‘config’, ‘tfautograph’, ‘reticulate’, ‘tensorflow’, ‘tfruns’, ‘zeallot’


Loaded Tensorflow version 2.8.2



tf.Tensor(b'Hello Tensorflow', shape=(), dtype=string)

In [2]:
install.packages("tfdatasets")
library(tfdatasets)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



## Load the IMDB data


In [3]:
url <- "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset <- get_file(
  "aclImdb_v1",
  url,
  untar = TRUE,
  cache_dir = '.',
  cache_subdir = ''
)

In [4]:
dataset_dir <- file.path("aclImdb")
list.files(dataset_dir)


In [5]:
train_dir <- file.path(dataset_dir, 'train')
list.files(train_dir)

In [6]:
sample_file <- file.path(train_dir, 'pos/1181_9.txt')
readr::read_file(sample_file)

In [7]:
remove_dir <- file.path(train_dir, 'unsup')
unlink(remove_dir, recursive = TRUE)

In [8]:
batch_size <- 512
seed <- 1234

raw_train_ds <- text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'training',
  seed = seed
)

In [9]:
raw_val_ds <- text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'validation',
  seed = seed
)

raw_test_ds <- text_dataset_from_directory(
  'aclImdb/test',
  batch_size = batch_size
)

## Apply TextVectorization

You can send a different tokenizer to the TextVectorization layer -- and the reviews do have some detritus like html tags that probably should be removed -- but we'll just use the default.

Now let's set up our vectorize_layer for real. We'll set our maximum vocabulary and our maximum review length.

In [10]:
max_features <- 10000
sequence_length <- 500

vectorize_layer <- layer_text_vectorization(
  max_tokens = max_features,
  output_mode = "int",
  output_sequence_length = sequence_length
)

We'll call the adapt function to build the vocabulary from the text of the reviews.



In [11]:
train_text <- raw_train_ds %>%
  dataset_map(function(text, label) text)
  
vectorize_layer %>% adapt(train_text)

In [12]:
vectorize_text <- function(text, label) {
  text <- tf$expand_dims(text, -1L)
  list(vectorize_layer(text), label)
}

In [13]:
train_ds <- raw_train_ds %>% dataset_map(vectorize_text)
val_ds <- raw_val_ds %>% dataset_map(vectorize_text)
test_ds <- raw_test_ds %>% dataset_map(vectorize_text)

In [14]:
names(raw_val_ds)

## Performance Considerations

In [15]:
AUTOTUNE <- tf$data$AUTOTUNE

train_ds <- train_ds %>%
  dataset_cache() %>%
  dataset_prefetch(buffer_size = AUTOTUNE)
val_ds <- val_ds %>%
  dataset_cache() %>%
  dataset_prefetch(buffer_size = AUTOTUNE)
test_ds <- test_ds %>%
  dataset_cache() %>%
  dataset_prefetch(buffer_size = AUTOTUNE)

# Create the Model

Building a basic LSTM is very simple in Keras. We just add an LSTM layer to our sequential model.

In [16]:
embedding_dim <- 16 # set 16 dimensions for our model

In [19]:
model <- keras_model_sequential() %>%
  layer_embedding(max_features + 1, # we want to train this embedding layer. 
                                    # We bring in the 10,000 most frequent words
                                    # in the set, +1 allocated for unknown tokens
                                    # so, if there's a token that you don't know
                                    # you can still include it and train it.
                  embedding_dim) %>%
  layer_lstm(units = 16) %>%
  layer_dense(units = 1, activation = "sigmoid")

summary(model)

Model: "sequential_2"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
 embedding_2 (Embedding)            (None, None, 16)                160016      
 lstm_2 (LSTM)                      (None, 16)                      2112        
 dense_2 (Dense)                    (None, 1)                       17          
Total params: 162,145
Trainable params: 162,145
Non-trainable params: 0
________________________________________________________________________________


In [20]:
model %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = c('accuracy')
)

In [None]:
history <- model %>% fit(
  train_ds,
  epochs = 25,
  validation_data = val_ds,
  verbose = 2
)

In [None]:
plot(history)

In [None]:
results <- model %>% evaluate(test_ds, verbose = 2)

In [None]:
results

That looks overfit, and like we could probably cut things off much earlier. There's that weird jump around 20 epochs, so let's go after that to about 30. We're not doing as well, but one reason for that might be that we are clipping the reviews at 250 tokens with max_sequence_length above. In doing so, we are probably losing the reviews that end with their rating (and thus increase the accuracy of some of our more naive approaches). Take some time and play around as an exercise with the specifications to see where we might be able to improve.

# Build a basic bi-LSTM model

Let's see if a bi-directional LSTM does any better. Notice again that this is very straightforward; everything mimics our code from before but we've wrapped our `layer_lstm` layer with `bidirectional()`.

In [None]:
model <- keras_model_sequential() %>%
  layer_embedding(max_features + 1, embedding_dim) %>%
  bidirectional(layer_lstm(units = 16)) %>%
  layer_dense(units = 1, activation = "sigmoid")

summary(model)

In [None]:
model %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = c('accuracy')
)

In [None]:
history <- model %>% fit(
  train_ds,
  epochs = 25,
  validation_data = val_ds,
  verbose = 2
)

In [None]:
plot(history)

In [None]:
results <- model %>% evaluate(test_ds, verbose = 2)

In [None]:
results

## Build a more expressive, deeper bi-LSTM model with dropout.

Bi-LSTMs seem to gain power when stacked in multiple layers. Let's do that, make everything bigger, and add some regularization through dropout.

In [None]:
model <- keras_model_sequential() %>%
  layer_embedding(max_features + 1, 64) %>%
  layer_dropout(rate = .3) %>%
  bidirectional(layer_lstm(units = 32, return_sequences = TRUE)) %>%
  layer_dropout(rate = .2) %>%
  bidirectional(layer_lstm(units = 16)) %>%
  layer_dense(units = 1, activation = "sigmoid")

summary(model)

In [None]:
model %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = c('accuracy')
)

This one takes 30 to 35 minutes to fit in R. 

In [None]:
history <- model %>% fit(
  train_ds,
  epochs = 15,
  validation_data = val_ds,
  verbose = 2
)

In [None]:
plot(history)

In [None]:
results <- model %>% evaluate(test_ds, verbose = 2)

In [None]:
results

Coming in at about 80% in the test set, though the better results with the validation set above makes it look like there might be some room for improvement if you play around with the model. 

It's worth noting, perhaps, that the even bigger, even more expressive model in the Keras documentation (128-dimensional embedding layer, and TWO 64-node BiLSTM layers -- 2.8 million parameters) gets accuracy in the test set of 86.8%. (https://keras.io/examples/nlp/bidirectional_lstm_imdb/)

And we did a bit better, 88%, with our basic feedforward network with some dropout.