🚨**NOTE** 🚨: to run the notebooks move them to the main dir. Simply

```bash
cp notebook_name.ipynd ../
```

So far, in notebooks 01 and 02 I have described how to prepare the data and to implement a Hierarchical Attention Network to classify amazon reviews. Here I will describe how to run the experiments. The code here is in the `main_pytorch.py` file. 

In [1]:
import pandas as pd
import numpy as np
import pickle
import torch
import os
import torch.nn.functional as F

from pathlib import Path
from tqdm import trange
from sklearn.metrics import accuracy_score, f1_score, precision_score
from torch.optim.lr_scheduler import CyclicLR, ReduceLROnPlateau
from torch.utils.data import TensorDataset, DataLoader

from models.pytorch_models import HierAttnNet, RNNAttn
from utils.metrics import CategoricalAccuracy
from utils.parser import parse_args

In [2]:
n_cpus = os.cpu_count()
use_cuda = torch.cuda.is_available()

In [3]:
data_dir = Path("data")
train_dir = data_dir / "train"
valid_dir = data_dir / "valid"
test_dir = data_dir / "test"

In [4]:
ftrain, fvalid, ftest = "han_train.npz", "han_valid.npz", "han_test.npz"
tokf = "HANPreprocessor.p"

As with other notebooks, and to keep the data size tractable, I will just just a sample of 1000 observations

In [5]:
batch_size = 64
train_mtx = np.load(train_dir / ftrain)
np.random.seed(1)
idx = np.random.choice(train_mtx["X_train"].shape[0], 1000)
train_set = TensorDataset(
    torch.from_numpy(train_mtx["X_train"][idx]),
    torch.from_numpy(train_mtx["y_train"][idx]).long(),
)
train_loader = DataLoader(dataset=train_set, batch_size=batch_size, num_workers=n_cpus)

valid_mtx = np.load(valid_dir / fvalid)
np.random.seed(2)
idx = np.random.choice(valid_mtx["X_valid"].shape[0], 1000)
eval_set = TensorDataset(
    torch.from_numpy(valid_mtx["X_valid"][idx]),
    torch.from_numpy(valid_mtx["y_valid"][idx]).long(),
)
eval_loader = DataLoader(
    dataset=eval_set, batch_size=batch_size, num_workers=n_cpus, shuffle=False
)

In [6]:
next(iter(train_loader))

[tensor([[[   1,    1,    1,  ...,   70,   88,    9],
          [   1,    1,    1,  ...,   35, 3007,  841],
          [   1,    1,    1,  ...,  222,   70,    9],
          ...,
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1]],
 
         [[   1,    1,    1,  ...,  109,   15,    9],
          [   1,    1,    1,  ...,   22, 1108,    9],
          [   1,    1,    1,  ...,  448,  144, 1587],
          ...,
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1]],
 
         [[   1,    1,    1,  ...,   10, 6706,    9],
          [   1,    1,    1,  ...,    5,   97,    9],
          [   1,    1,    1,  ...,   66,  186,    9],
          ...,
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  .

Before I move on I want to add a comment the data preprocessing. One can see that the amount of padding is very significant. The default lenght that I used for both the amount of sentences in a review and the amount of tokens in a sentence is the 0.8 quantile. This, of course, implies that there is going to be a lot of padding. A priori, this does not represent a problem. In general, the network should learn that this token is irrelevant and furthermore, when using Pytorch we can pass a `padding_idx` param. This way, we pad the output with the embedding vector (normally zeros) whenever it encounters the index. Having said this, let's move on.

### HAN model

In [7]:
tok = pickle.load(open(data_dir / tokf, "rb"))

In [8]:
tok

<utils.preprocessors.HANPreprocessor at 0x13856d208>

In [9]:
model = HierAttnNet(
    vocab_size=len(tok.vocab.stoi),
    maxlen_sent=tok.maxlen_sent,
    maxlen_doc=tok.maxlen_doc,
    word_hidden_dim=32,
    sent_hidden_dim=32,
    padding_idx=1,
    embed_dim=50,
    weight_drop=0.,
    embed_drop=0.,
    locked_drop=0.,
    last_drop=0.,
    embedding_matrix=None,
    num_class=4,
)

In [10]:
model

HierAttnNet(
  (wordattnnet): WordAttnNet(
    (lockdrop): LockedDropout()
    (word_embed): Embedding(23611, 50, padding_idx=1)
    (rnn): GRU(50, 32, batch_first=True, bidirectional=True)
    (word_attn): AttentionWithContext(
      (attn): Linear(in_features=64, out_features=64, bias=True)
      (contx): Linear(in_features=64, out_features=1, bias=False)
    )
  )
  (sentattnnet): SentAttnNet(
    (rnn): GRU(64, 32, batch_first=True, bidirectional=True)
    (sent_attn): AttentionWithContext(
      (attn): Linear(in_features=64, out_features=64, bias=True)
      (contx): Linear(in_features=64, out_features=1, bias=False)
    )
  )
  (ld): Dropout(p=0.0, inplace=False)
  (fc): Linear(in_features=64, out_features=4, bias=True)
)

Before I move forward let me comment on the dropout-related parameters and the implementation that one can find in the `models` module. When I started running experiments I noticed that the model overfitted quite early. In fact, in some cases, the best validation loss was attained in the very first epoch (while training loss and metrics kept improving). 

When overfitting occurs, one has a few options to avoid it, such as:

1. Reduce model complexity
2. Early Stop 
3. Data Augmentation 
4. Regularization (Dropout, Lable Smoothing, ...)

The first one is explored throughout the different experiments I run (see notebook 04 for more details) and the second one is always used via the `early_stopping` function. Here I have ignored the 3rd one and perhaps I should use it. The reasons are two. In the first place it normally leads to good improvements and secondly, I already have the code. In the dir [amazon_reviews_classification_with_EDA](https://github.com/jrzaurin/nlp-stuff/tree/master/amazon_reviews_classification_with_EDA) I explore the use of Easy Data Augmentation ([Jason Wei and Kai Zou, 2019](https://arxiv.org/pdf/1901.11196.pdf)) to do the same as I am doing here (predict scores) but using tf-idf and topic modeling. There I describe why EDA is not particularly well suited for text processing approaches that do not consider the text as a sequence (e.g. tf-idf). Therefore, when I think about it, I am using EDA there that is not expected to lead to much improvement when I should be using it here. In any case, the code is there, so sooner or later I will bring it here and run the experiment. 

On the other hand, I have indeed explore regularization in the form of Dropout, lots of Dropout (if you want to have a look to Lable Smoothing for pytorch see [here](https://github.com/eladhoffer/utils.pytorch/blob/master/cross_entropy.py)). With that in mind I decided to use the Dropout implementations in the fantastic [work](https://arxiv.org/pdf/1708.02182.pdf) of Stephen Merity, Nitish Shirish Keskar and Richard Socher: Regularizing and Optimizing LSTM Language Models. There, among many other things, they discussed 3 forms of dropout: Embedding Dropout, Weight Dropout and Locked Dropout. 

In fact, within the `models` module there are 3 submodules named: `embed_regularize.py`, `locked_dropout.py` and `weight_dropout.py`. The code in there is **taken directly** from the original implementation at the [Salesforce repo](https://github.com/salesforce/awd-lstm-lm). The adaptations are minimal, simply adjusting the code to newer versions of `Pytorch` and a few minor style-related changes. Other than that, is a "**copy-paste**" of the code in their repo, so all credit to the 3 authors of the paper and the code. `last_drop` is simply dropout before the last fully connected layer.

Let me comment a bit on what these dropout mechanisms do (**NOTE**: do not use the code below as it is. Use the code in the modules mentioned before).

**Embedding Dropout** 

This is discussed in Section 4.3 in their paper and a a simplified version is shown in the following lines of code:

```python
mask = embed.weight.data.new().resize_((embed.weight.size(0), 1)).bernoulli_(
    1 - dropout
).expand_as(embed.weight) / (1 - dropout)

masked_embed_weight = mask * embed.weight
```

This creates a 0/1 mask along the 0-dim (i.e. words) in the `embed` Tensor and then expands that mask along the 1-dim (i.e. embed dimension). In other words, we drop words in the vocabulary with probability `dropout`. 

**Weight Dropout**

This is discussed in Section 2 in their paper and their original implementation is [here](https://github.com/salesforce/awd-lstm-lm/blob/32fcb42562aeb5c7e6c9dec3f2a3baaaf68a5cb5/weight_drop.py). In their own words: *"We propose the use of DropConnect ([Wan et al., 2013](http://yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdf)) on the recurrent hidden to hidden weight matrices which does not require any modifications to an RNN’s formulation."*

Again, a simplify version in code is (again, **do not** use this code as it is): 

```python
class WeightDrop(nn.Module):
    def __init__(self, module, weights, dropout=0, verbose=True):
        super(WeightDrop, self).__init__()
        self.module = module
        self.weights = weights
        self.dropout = dropout        
        self.verbose = verbose
        self._setup()

    def _setup(self):
        for name_w in self.weights:
            if self.verbose:
                print("Applying weight drop of {} to {}".format(self.dropout, name_w))
            w = getattr(self.module, name_w)
            del self.module._parameters[name_w]
            self.module.register_parameter(name_w + "_raw", nn.Parameter(w.data))

    def _setweights(self):
        for name_w in self.weights:
            raw_w = getattr(self.module, name_w + "_raw")
            w = nn.Parameter(
                torch.nn.functional.dropout(raw_w, p=self.dropout, training=self.training)
            )
            setattr(self.module, name_w, w)

    def forward(self, *args):
        self._setweights()
        return self.module.forward(*args)
```

Let's see what this does. `WeightDrop` will first copy and register the hidden-to-hidden weights (or in general terms the weights in the `List` weights) with a suffix `_raw`.  Then, we apply dropout and assign the weights again to the `module`. Their original implementation includes a so-called `variational` version also explained in the paper (please, read the paper). 

This implementation has a couple of drawbacks that I will discuss in the next notebooks. There are implementations of these dropout mechanisms other than the original one discussed here (but inspired by that one, of course). For example, the `text` API at the `fastai` library has a very neat [implementation](https://github.com/fastai/fastai/blob/master/fastai/text/models/awd_lstm.py#L75). Another nice [implemenation](https://github.com/dmlc/gluon-nlp/blob/8869e795b683ff52073b556cd24e1d06cf9952ac/src/gluonnlp/model/utils.py#L34) is found at the `Mxnet`'s `gluonnlp` library, which I have also used here **$-$ although only the `Pytorch` implementation is discussed here, I have also implemented the full HAN model using `Mxnet` $-$**. 

And finally

**Locked Dropout**

Simply:

```python
class LockedDropout(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, dropout=0.5):
        if not self.training or not dropout:
            return x
        mask = x.data.new(1, x.size(1), x.size(2)).bernoulli_(1 - dropout) / (1 - dropout)
        mask = mask.expand_as(x)
        return mask * x
```

After having explained the previous two mechanisms, this does not require much explanation. Quickly, this generates a mask long the 1st-dim of the 3-dim input Tensor and expands that mask along the 0-dim. For example, when applied to a Tensor like `(batch_size, seq_length, embed_dim)`, it will create a mask of dim `(1, seq_length, embed_dim)` and apply it to the whole batch.

Once we have the model, we need the remaining Pytorch components:

In [11]:
optimizer = torch.optim.AdamW(model.parameters())
# This class is at the utils module
metric = CategoricalAccuracy()

And the standard train and validation steps, along with an early stop function

In [12]:
def train_step(model, optimizer, train_loader, epoch, metric):
    model.train()
    metric.reset()
    train_steps = len(train_loader)
    running_loss = 0
    with trange(train_steps) as t:
        for batch_idx, (data, target) in zip(t, train_loader):
            t.set_description("epoch %i" % (epoch + 1))

            X = data.cuda() if use_cuda else data
            y = target.cuda() if use_cuda else target

            optimizer.zero_grad()
            y_pred = model(X)
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            avg_loss = running_loss / (batch_idx + 1)
            acc = metric(F.softmax(y_pred, dim=1), y)

            t.set_postfix(acc=acc, loss=avg_loss)


def eval_step(model, eval_loader, metric, is_test=False):
    model.eval()
    metric.reset()
    eval_steps = len(eval_loader)
    running_loss = 0
    preds = []
    with torch.no_grad():
        with trange(eval_steps) as t:
            for batch_idx, (data, target) in zip(t, eval_loader):
                if is_test:
                    t.set_description("test")
                else:
                    t.set_description("valid")

                X = data.cuda() if use_cuda else data
                y = target.cuda() if use_cuda else target

                y_pred = model(X)
                loss = F.cross_entropy(y_pred, y)
                running_loss += loss.item()
                avg_loss = running_loss / (batch_idx + 1)
                acc = metric(F.softmax(y_pred, dim=1), y)
                if is_test:
                    preds.append(y_pred)
                t.set_postfix(acc=acc, loss=avg_loss)

    return avg_loss, preds


def early_stopping(curr_value, best_value, stop_step, patience):
    if curr_value <= best_value:
        stop_step, best_value = 0, curr_value
    else:
        stop_step += 1
    if stop_step >= patience:
        print("Early stopping triggered. log:{}".format(best_value))
        stop = True
    else:
        stop = False
    return best_value, stop_step, stop

And with that we are good to go. Note that one could just define the train/eval functions so that they run all the epochs. Normally I prefer to code the steps and run them in a loop. A matter of taste, also depends on the code structure. In this particular case, I will leave it as it is. 

To run the model simply

In [13]:
metric = CategoricalAccuracy()
n_epochs = 4
eval_every = 1
patience = 1
stop_step = 0
best_loss = 1e6
for epoch in range(n_epochs):
    train_step(model, optimizer, train_loader, epoch, metric)
    if epoch % eval_every == (eval_every - 1):
        val_loss, _ = eval_step(model, eval_loader, metric)
        best_loss, stop_step, stop = early_stopping(
            val_loss, best_loss, stop_step, patience
        )
    if stop:
        break

epoch 1: 100%|██████████| 16/16 [00:04<00:00,  3.73it/s, acc=0.542, loss=1.25]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.30it/s, acc=0.577, loss=1.15]
epoch 2: 100%|██████████| 16/16 [00:04<00:00,  3.76it/s, acc=0.584, loss=1.12]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.35it/s, acc=0.577, loss=1.12]
epoch 3: 100%|██████████| 16/16 [00:04<00:00,  3.73it/s, acc=0.584, loss=1.09]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.47it/s, acc=0.577, loss=1.11]
epoch 4: 100%|██████████| 16/16 [00:04<00:00,  3.77it/s, acc=0.584, loss=1.08]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.60it/s, acc=0.577, loss=1.09]


And that's it. If you have a look to the `main_pytorch.py` script you will see that, of course, there are a number of additional adds on to the training/validation/test process, but the main bits and pieces have been discussed here.