# Recurrent Neural Networks

- Afshine Amidi and Shervine Amidi: [Recurrent Neural Networks Cheatsheet](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)
- Chris Olah: [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- Chris Olah, Shan Carter: [Attention and Augmented Recurrent Neural Networks](https://distill.pub/2016/augmented-rnns)
- Andreas Madsen: [Visualizing memorization in RNNs](https://distill.pub/2019/memorization-in-rnns)
- Michael Nguyen: [Illustrated Guide to Recurrent Neural Networks](https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9)
- Michael Nguyen: [Illustrated Guide to LSTM’s and GRU’s: A step by step explanation](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)

---

- [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](https://arxiv.org/abs/1803.01271)
  > "_For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at this http URL._"
  
---

## Motivation

Natural language does not usually come in neatly packaged fixed length sequences. The simple feed forward neural network from part 1, however, assumed that we can cut the documents down to 100 words. For some applications and data sources this may be approriate for others not. For instance, cutting tweets down to 100 or 200 words would not be that unreasonable, but cutting wikipedia articles would.

Recurrent neural networks get around, to some extent, this limitation on the sequence length. Given infinite time they can consume an infinite sequence. The network maintains an internal state that keeps track of what is important and what isn't. There are [lots of different kinds of recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) the main difference between being the way in which the internal state is updated and maintained.

In [None]:
import utils

gnad_train, gnad_test = utils.load_gnad()

In [None]:
# what percentage of the documents are below our document length threshold?
sum([True for d in gnad_train.text if len(d.split()) > 100]) / len(gnad_train)

In [None]:
long_articles = [idx for idx, d in enumerate(gnad_train.text) if len(d.split()) > 100]
print(gnad_train.text[long_articles[13]])

---

# The Idea Behind RNNs

![](./img/1280px-Recurrent_neural_network_unfold.svg.png)


<sub>Image By François Deloche, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=60109157</sub>

# Elman Network

$$ h_t = \sigma{(W_hx_t + U_hh_{t-1}+b_h)} $$
$$ y_t = \sigma{(W_yh_t + b_y)}$$

---

# An Elman Network in Practice

PyTorch has an implementation of an Elman network as the base `nn.RNN` class. The class enables uni- or bi-directional training and supports an arbitrary number of network layers.

In [None]:
from gensim.models import KeyedVectors

import numpy as np

import torch
from torch import nn


ft_vec = KeyedVectors.load_word2vec_format('./cc.de.300.vec.SMALL.gz')
word2idx = dict((w, idx+1) for (idx, w) in enumerate(ft_vec.index2word))

In [None]:
vectors = np.concatenate([np.random.rand(1, 300), ft_vec.vectors])
emb = nn.Embedding.from_pretrained(torch.FloatTensor(vectors))
rnn = nn.RNN(ft_vec.vectors.shape[1], 10, num_layers=1)

sentence = 'Die neue BER wird bald geöffnet'.split()
word_idx = [word2idx[w] for w in sent]

In [None]:
len(word2idx)

In [None]:
word_idx

In [None]:
# we can feed individual documents through the embedding layer ...
emb(torch.LongTensor(word_idx))

In [None]:
# ... and then feed the embedded document through the RNN
output, hn = rnn(emb(torch.LongTensor([word_idx])))

In [None]:
output.size()

In [None]:
hn.size()

In [None]:
output

In [None]:
hn

---

Load german word embeddings from fastText

In [None]:
from gensim.models import KeyedVectors
ft_vec_DE = KeyedVectors.load_word2vec_format('./cc.de.300.vec.gz')

In [None]:
word2idx = dict((w, idx+1) for (idx, w) in enumerate(ft_vec_DE.index2word))

In [None]:
from sklearn.preprocessing import LabelEncoder
import utils

gnad_train, gnad_test = utils.load_gnad()
label_encoder = LabelEncoder()

# turn all the data into integer indices
X_train = [[word2idx.get(w, 0) for w in doc.split()] for doc in gnad_train.text]
y_train = label_encoder.fit_transform(gnad_train.category)

In [None]:
import numpy as np
from torch import nn

UNK = np.random.rand(1, 300)
vectors = np.concatenate([UNK, ft_vec_DE.vectors], axis=0)

# A strange training loop (??)

Let's train an LSTM classifier that uses that last hidden state from each sequence (document) as the representation for a classification task.

In [None]:
from time import time
from torch.nn.utils import clip_grad_norm_

bidirectional = False
emb = nn.Embedding.from_pretrained(torch.FloatTensor(vectors))
lstm = nn.LSTM(vectors.shape[1], 64, num_layers=1, bidirectional=bidirectional, dropout=0.01)
classifier = nn.Linear(lstm.hidden_size if not bidirectional else lstm.hidden_size * 2, len(label_encoder.classes_))
optimizer = torch.optim.Adam(lstm.parameters(), lr=1e-2)

In [None]:
num_epochs = 1
h, c = None, None
lossfct = nn.CrossEntropyLoss()
for _ in range(num_epochs):
    train_loss = 0
    start_time = time()
    for i_step, (X_, y_) in enumerate(zip(X_train, y_train), 1):
        X_ = torch.LongTensor(X_)
        y_ = torch.LongTensor([y_])
        
        # run the word indices through the embedding layer and then the LSTM
        embed = emb(X_).unsqueeze(dim=1)
        output, *_ = lstm(embed)
        output = classifier(output[-1, :, :])

        loss = lossfct(output, y_)
        train_loss += loss.item()
        loss.backward()
        if i_step > 0 and i_step % 64 == 0:
            delta = time() - start_time
            avg_delta = delta // (i_step // 64)
            clip_grad_norm_(lstm.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad()
            n_total = len(X_train) // 64
            n_remaining = n_total - (i_step // 64)
            print('iter', i_step // 64, '/', n_total, f'{delta:.2f}s', f'{avg_delta*n_remaining:.2f}s', f'{train_loss / i_step:.4f}')

### The training loop above goes through one document at a time, this is very inefficient and unlikely to yield good performance. However, bacthing together variable length sequences requires a little bit of work.

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

doc_lengths = [len(doc) for doc in X_train]
longest_doc = max(doc_lengths)

data = np.zeros((len(X_train), longest_doc), dtype=np.int)
for i_doc, doc in enumerate(X_train):
    data[i_doc, :len(doc)] += doc

dataset = TensorDataset(torch.LongTensor(data), torch.LongTensor(doc_lengths), torch.LongTensor(y_train))
dataloader = DataLoader(dataset, batch_size=256)

In [None]:
dataset[:5]

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

X_tmp, len_tmp, _ = dataset[:4]
embed = emb(X_tmp)
packed = pack_padded_sequence(embed, len_tmp, enforce_sorted=False, batch_first=True)
output, *_ = lstm(packed)
X, _ = pad_packed_sequence(output, batch_first=True)
X.size()

In [None]:
from time import time
from torch.nn.utils import clip_grad_norm_

bidirectional = True
emb = nn.Embedding.from_pretrained(torch.FloatTensor(vectors))
lstm = nn.LSTM(vectors.shape[1], 64, num_layers=4, bidirectional=bidirectional, dropout=0.01)
classifier = nn.Linear(lstm.hidden_size if not bidirectional else lstm.hidden_size * 2, len(label_encoder.classes_))
optimizer_parameters = list(emb.parameters()) + list(lstm.parameters()) + list(classifier.parameters())
optimizer = torch.optim.Adam(optimizer_parameters, lr=1e-2)

In [None]:
from torch.nn.utils import clip_grad_norm_
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

from time import time
from datetime import datetime

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

num_epochs = 100
lossfct = nn.CrossEntropyLoss().to(DEVICE)
start_time = time()
for i_epoch in range(num_epochs):
    train_loss = 0
    
    lstm.train()
    classifier.train()
    emb.to(DEVICE)
    lstm.to(DEVICE)
    classifier.to(DEVICE)
    for i_step, batch in enumerate(dataloader, 1):
        X_batch, lengths, y_batch = (b.to(DEVICE) for b in batch)
        # run the word indices through the embedding layer and then the LSTM
        embed = emb(X_batch)
        
        # run the embeddings through the LSTM
        packed = pack_padded_sequence(embed, lengths, enforce_sorted=False, batch_first=True)
        output, *_ = lstm(packed)
        output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)

        # run the encoded last hidden state through the classifier
        last_hidden_states = output[torch.arange(0, X_batch.size()[0]), lengths-1, :]
        output = classifier(last_hidden_states)

        loss = lossfct(output, y_batch)
        train_loss += loss.item()
        loss.backward()
        clip_grad_norm_(lstm.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()
        
    delta = time() - start_time
    avg_delta = delta / (i_epoch + 1)
    remaining = avg_delta * (num_epochs - i_epoch)
    print('epoch', i_epoch, '/', num_epochs, f'{train_loss / i_step:.4f}',
          f'avg epoch {datetime.fromtimestamp(avg_delta):%M:%S}',
          f'total elapsed {datetime.fromtimestamp(delta):%M:%S}',
          f'remaining {datetime.fromtimestamp(remaining):%M:%S}')

In [None]:
from sklearn.preprocessing import LabelEncoder
import utils

# turn all the data into integer indices
X_test = [[word2idx.get(w, 0) for w in doc.split()] for doc in gnad_test.text]
y_test = label_encoder.transform(gnad_test.category)

doc_lengths = [len(doc) for doc in X_test]
longest_doc = max(doc_lengths)

data = np.zeros((len(X_test), longest_doc), dtype=np.int)
for i_doc, doc in enumerate(X_test):
    data[i_doc, :len(doc)] += doc

test_dataset = TensorDataset(torch.LongTensor(data), torch.LongTensor(doc_lengths), torch.LongTensor(y_test))
test_dataloader = DataLoader(test_dataset, batch_size=32)

In [None]:
from torch.utils.data import SequentialSampler
from torch.nn import functional as F

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
lstm.to(DEVICE)
pred = []
for i_step, (X_batch, lengths, y_batch) in enumerate(test_dataloader, 1):
    X_batch = X_batch.to(DEVICE)
    # run the word indices through the embedding layer and then the LSTM
    embed = emb(X_batch)

    # run the embeddings through the LSTM
    packed = pack_padded_sequence(embed, lengths, enforce_sorted=False, batch_first=True)
    output, *_ = lstm(packed)
    output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)

    # run the encoded last hidden state through the classifier
    last_hidden_states = output[torch.arange(0, X_batch.size()[0]), lengths-1, :]
    output = classifier(last_hidden_states)

    _, pred_ = F.log_softmax(output, dim=-1).max(dim=-1)
    pred.extend(pred_.cpu().numpy().tolist())

In [None]:
from sklearn import metrics

print(metrics.classification_report(y_test, pred, target_names=list(label_encoder.classes_)))

---
    **  ------------------------------------------------------------------------- **
    embedding + LSTM (64, 1-layer) + Linear
    100 epochs
    SGD lr=0.1 / momentum = 0.9
    ~ 25 minutes
    ** -----------------------------------------------------

                   precision    recall  f1-score   support

             Etat       0.80      0.70      0.75        67
           Inland       0.66      0.72      0.69       102
    International       0.87      0.81      0.84       151
           Kultur       0.73      0.74      0.73        54
         Panorama       0.78      0.71      0.75       168
            Sport       0.95      0.98      0.97       120
              Web       0.90      0.89      0.89       168
       Wirtschaft       0.81      0.70      0.75       141
     Wissenschaft       0.53      0.89      0.66        57

         accuracy                           0.80      1028
        macro avg       0.78      0.79      0.78      1028
     weighted avg       0.81      0.80      0.80      1028
     
     
     

    **  ------------------------------------------------------------------------- **
    embedding + LSTM (64, 1-layer) + Linear
    100 epochs
    Adam lr=0.01
    ~ 25 minutes
    ** -----------------------------------------------------

                   precision    recall  f1-score   support

             Etat       0.81      0.75      0.78        67
           Inland       0.71      0.74      0.72       102
    International       0.86      0.83      0.84       151
           Kultur       0.85      0.81      0.83        54
         Panorama       0.74      0.75      0.74       168
            Sport       0.97      0.96      0.97       120
              Web       0.89      0.89      0.89       168
       Wirtschaft       0.77      0.79      0.78       141
     Wissenschaft       0.81      0.88      0.84        57

         accuracy                           0.82      1028
        macro avg       0.82      0.82      0.82      1028
     weighted avg       0.82      0.82      0.82      1028




    **  ------------------------------------------------------------------------- **
    embedding + LSTM (128, 2-layer, bidir) + Linear
    fine tuned embeddings
    100 epochs
    Adam lr=0.01
    ~ 55 minutes
                  -----------------------------------------------------
                  precision    recall  f1-score   support

             Etat       0.86      0.76      0.81        67
           Inland       0.73      0.78      0.76       102
    International       0.88      0.86      0.87       151
           Kultur       0.85      0.83      0.84        54
         Panorama       0.79      0.76      0.77       168
            Sport       0.98      0.98      0.98       120
              Web       0.90      0.91      0.91       168
       Wirtschaft       0.81      0.80      0.81       141
     Wissenschaft       0.82      0.96      0.89        57

         accuracy                           0.85      1028
        macro avg       0.85      0.85      0.85      1028
     weighted avg       0.85      0.85      0.85      1028
     
    **  ------------------------------------------------------------------------- **
    embedding + LSTM (64, 4-layer, bidir) + Linear
    fine tuned embeddings
    100 epochs
    Adam lr=0.01
    ~ 70 minutes
                  -----------------------------------------------------
                   precision    recall  f1-score   support

             Etat       0.87      0.69      0.77        67
           Inland       0.75      0.75      0.75       102
    International       0.83      0.85      0.84       151
           Kultur       0.81      0.78      0.79        54
         Panorama       0.73      0.82      0.78       168
            Sport       0.99      0.98      0.99       120
              Web       0.92      0.93      0.92       168
       Wirtschaft       0.84      0.79      0.82       141
     Wissenschaft       0.91      0.86      0.88        57

         accuracy                           0.84      1028
        macro avg       0.85      0.83      0.84      1028
     weighted avg       0.85      0.84      0.84      1028
---

---

# Sequence to Sequence Learning, the Encoder Decoder architecture and Attention

The document classification example above used the last hidden state of the LSTM cell(s) as _input_ to a linear classifier layer. You can think of the LSTM stack as an encoder that takes the input sequence and creates some fixed length, real values dense representation of it. This encoding, or shall we say embedding, is then used by the classifier (the _decoder_), to create the final predictions.

The usage of "_encoder_" and "_decoder_" is a slight abuse of terminology, normally these terms are used in the context of the machine translation where an encoder RNN consumes an input sequence, outputs an encoded state that is then used by the decoder RNN to produce an output sequence of possibly different length. The original `seq2seq` models were improved by [Bahdanau (ICLR 2015)](https://arxiv.org/abs/1409.0473) introducing an attention mechanism.

Instead of the encoder RNN producing just a single final encoded state of the input sequence, the encoder produces a series of encoded states, one for each item in the input. The decoder uses attention weights to select which parts of the input to pay attention to when producing each element in the output sequence.

I won't cover attention in any more detail here, but there is a Notebook tutorial in the pytorch documentation for neural machine translation with attention.

- Bahdanau et al.: [_Neural Machine Translation by Jointly Learning to Align and Translate._](https://arxiv.org/abs/1409.0473) ICLR 2015
- Lilian Weng: [Attention, Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)
- [NLP from scratch: translation with a sequence to sequence network and attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

---

# Contextual Embeddings and Transfer Learning

Word meaning is sensitive to context, but the pretrained word embeddings from fastText we've been using thus far are not.

Let's say we are building a system that needs to find origin and destination mentions from text. Clearly origins and destinations are all locations, but whether some location is an origin, a destination or neither is context dependant.




    I live in Berlin and usually cycle to work.
              [LOC]
              [ - ]
    
    
    Can you book me a flight from Berlin to New York
                                  [LOC]
                                  [org]




The encoder / decoder architecture serves as a good example for how contextual word embeddings work. Similar to how the encoder RNN can be used to encode a source document, it can also be used to encode single words or parts of words _in context_.

- Peters et al.: [_Deep contextualized word representations._](https://www.aclweb.org/anthology/N18-1202/) NAACL-HLT 2018: 2227-2237
- Akbik et al.: [_Pooled Contextualized Embeddings for Named Entity Recognition._](https://www.aclweb.org/anthology/N19-1078/) NAACL-HLT (1) 2019: 724-728
- Howard et al.: [_Universal Language Model Fine-tuning for Text Classification._](https://www.aclweb.org/anthology/P18-1031/) ACL (1) 2018: 328-339
- Jeremy Howard and Sebastian Ruder [Introducing state of the art text classification with universal language models](http://nlp.fast.ai/classification/2018/05/15/introducing-ulmfit.html) 15 May 2018

---

# Playground

### Pre-trained AWD-LSTM from fast.ai

In [None]:
!curl -O http://files.fast.ai/models/wt103/fwd_wt103.h5

In [None]:
!curl -O http://files.fast.ai/models/wt103/fwd_wt103_enc.h5

In [None]:
!curl -O http://files.fast.ai/models/wt103/itos_wt103.pkl

In [None]:
import fastai
path = fastai.datasets.untar_data(URLs.IMDB_SAMPLE)

In [None]:
data_lm = text.data.TextLMDataBunch.from_csv(path, 'texts.csv')
awd_lstm = text.language_model_learner(data_lm, text.AWD_LSTM)

In [None]:
awd_lstm.beam_search('What made this the hugely successful triumph it was? Was it casting, music, imagination, ingenuity, or luck?',
                     n_words=25,
                     temperature=0.95,
                     top_k=25,
                     beam_sz=250)