# 3. Capture patterns with recurrent neural networks

## 3.1 Recurrent neural networks
In the previous module, we have been using rich semantic representations of text, and a simple linear classifier on
top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it
does not take into account the order of words, because aggregation operation on top of embeddings removed this
information from the original text. Because these models are unable to model word ordering, they cannot solve
more complex or ambiguous tasks such as text generation or question answering.

To capture the meaning of text sequence, we need to use another neural network architecture, which is called a
**recurrent neural network, or RNN**. In RNN, we pass our sentence through the network one symbol at a time,
and the network produces some state, which we then pass to the network again with the next symbol.

![rnn-model](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img//sample-rnn-model-generation.png)

Given the input sequence of tokens X0,...,Xn.  RNN creates a sequence of neural network blocks, and trains this
sequence end-to-end using back propagation. Each network block takes a pair (Xi,Si)as an input, and produces S(i+1)
as a result. Final state Sn or output Xn goes into a linear classifier to produce the result. All network blocks
share the same weights, and are trained end-to-end using one backpropagation pass.

Because state vectors S0,...,Sn are passed through the network, it is able to learn the sequential dependencies
between words. For example, when the word not appears somewhere in the sequence, it can learn to negate certain
elements within the state vector, resulting in negation.

## 3.2 A simple example of RNN

In this example, we will use a simple RNN to classifier the news data set.

In [2]:
import torch
import torchtext
from torchnlp import *
from torchtext.vocab import vocab
from collections import Counter, OrderedDict


In [3]:
# download data
def load_dataset(storage_path):
    print("Loading dataset...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root=storage_path)
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset

path = "/tmp/pytorch/data"
train, test = load_dataset(path)


Loading dataset...


In [4]:
# build a simple tokenizer
my_tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

# build label class list
label_classes = ['World', 'Sports', 'Business', 'Sci/Tech']

# function that build vocabulary with the token of all text
def build_vocabulary(dataset, tokenizer, ngrams=1, min_freq=1):
    # here we use counter to store the generated token to take in account the token frequency
    counter = Counter()
    # we iterate over all rows, covert text to word token, and add these token to bag_of words
    for (label, line) in dataset:
        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line), ngrams=ngrams))
    # sort the collected token counter by token's frequencies
    sorted_by_token_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    # build a set of words as an orderedDict
    words_dict = OrderedDict(sorted_by_token_freq_tuples)
    # we build a vocabulary based on the words token
    return vocab(words_dict, min_freq=min_freq)

# build a vocab
my_vocab = build_vocabulary(train, my_tokenizer)


def encode(text, vocabulary, tokenizer):
    return [vocabulary[word] for word in tokenizer(text)]

my_vocab_size = len(my_vocab)

## 3.2.1 Build a Simple RNN Classifier

In case of simple RNN, each recurrent unit is a simple linear network, which takes
**concatenated input vector and state vector**, and produce a **new state vector**. PyTorch represents this unit
with RNNCell class, and a networks of such cells - as RNN layer.

To define an RNN classifier, we will first apply an embedding layer to lower the dimensionality of input vocabulary,
and then have RNN layer on top of it.

RNNs are quite difficult to train. So keep in mind two important points:
1. **Small learning rate**: because once the RNN cells are unrolled along the sequence length, the
resulting number of layers involved in back propagation is quite large.
2. **Use GPU if you can**: It can take quite a long time to train the network on larger dataset to produce good results.
                     GPU may reduce that training time.

In [5]:
# In this classifier, we will use padded data loader, so each batch will have a number of padded sequences of the
# same length. RNN layer will take the sequence of embedding tensors, and produce two outputs:
# - x: is a sequence of RNN cell outputs at each step
# - h: is a final hidden state for the last element of the sequence
# We then apply a fully-connected linear classifier to get the number of class.
class SimpleRNNClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        # note our embedding layer is untrained, if you want better results we can use pre-trained embedding layer
        # with Word2Vec or GloVe embeddings, as described in the previous unit. For better understanding, you might
        # want to adapt this code to work with pre-trained embeddings.
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,h = self.rnn(x)
        return self.fc(x.mean(dim=1))

### 3.2.2 Build a training loop



In [6]:
def select_hardware_for_training(device_name):
    if device_name == 'cpu':
        return 'cpu'
    elif device_name == 'gpu':
        return 'cuda' if (device_name == "") & torch.cuda.is_available() else 'cpu'
    else:
        print("Unknown device name, choose cpu as default device")
        return 'cpu'


device = select_hardware_for_training("cpu")

# define training loop
def train_loop(net, dataloader, lr=0.01, optimizer=None, loss_fn=torch.nn.CrossEntropyLoss(), epoch_size=None,
               report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(), lr=lr)
    loss_fn = loss_fn.to(device)
    net.train()
    total_loss, acc, count, i = 0, 0, 0, 0
    for labels, features in dataloader:
        optimizer.zero_grad()
        features, labels = features.to(device), labels.to(device)
        out = net(features)
        loss = loss_fn(out, labels)  #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss += loss
        _, predicted = torch.max(out, 1)
        acc += (predicted == labels).sum()
        count += len(labels)
        i += 1
        if i % report_freq == 0:
            print(f"{count}: acc={acc.item() / count}")
        if epoch_size and count > epoch_size:
            break
    return total_loss.item() / count, acc.item() / count


In [7]:
def padding_text(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label,
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [encode(x[1], my_vocab, my_tokenizer) for x in b]
    # first, compute max length of a sequence in this minibatch
    l = max(map(len, v))
    return (  # tuple of two tensors - labels and features
        torch.LongTensor([t[0] - 1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t), (0, l - len(t)), mode='constant', value=0) for t in v])
    )
# build a dataloader with text padding tensor
train_loader = torch.utils.data.DataLoader(train, batch_size=16, collate_fn=padding_text, shuffle=True)

In [8]:
net = SimpleRNNClassifier(my_vocab_size,64,32,len(label_classes)).to(device)
train_loop(net,train_loader, lr=0.001)

3200: acc=0.3134375
6400: acc=0.38953125
9600: acc=0.4515625
12800: acc=0.505859375
16000: acc=0.5450625
19200: acc=0.5766666666666667
22400: acc=0.6025446428571428
25600: acc=0.626875
28800: acc=0.6464583333333334
32000: acc=0.6628125
35200: acc=0.6766477272727273
38400: acc=0.6892447916666666
41600: acc=0.7001682692307692
44800: acc=0.7108705357142857
48000: acc=0.7194166666666667
51200: acc=0.7271484375
54400: acc=0.7348161764705883
57600: acc=0.7415104166666666
60800: acc=0.7479934210526316
64000: acc=0.753375
67200: acc=0.7586011904761905
70400: acc=0.7637073863636363
73600: acc=0.7682744565217391
76800: acc=0.77265625
80000: acc=0.776825
83200: acc=0.7809735576923077
86400: acc=0.7843518518518519
89600: acc=0.787578125


KeyboardInterrupt: 

## 3.3 Long Short Term Memory (LSTM) in RNN

One of the main problems of classical RNNs is so-called **vanishing gradients** problem. Because RNNs are
trained end-to-end in one back-propagation pass, it is having hard times to back propagate error to the first
layers of the network, and thus the network cannot learn relationships between distant tokens. One of the ways to
avoid this problem is to introduce explicit state management by using so called **gates**. There are two most known
architectures of this kind:
 - Long Short Term Memory (LSTM)
 - Gated Relay Unit (GRU)

