**NOTE**: to run the notebooks move them to the main dir. Simply

```bash
cp notebook_name.ipynd ../
```

So far, in notebooks 01 and 02 I have described how to prepare the data and to implement a Hierarchical Attention Network to classify amazon reviews. Here I will describe how to run the experiments. 

Let's start by importing the neccesary tooling

In [1]:
import pandas as pd
import numpy as np
import pickle
import torch
import os
import torch.nn.functional as F

from pathlib import Path
from tqdm import trange
from sklearn.metrics import accuracy_score, f1_score, precision_score
from torch.optim.lr_scheduler import CyclicLR, ReduceLROnPlateau
from torch.utils.data import TensorDataset, DataLoader

from models.pytorch_models import HierAttnNet, RNNAttn
from utils.metrics import CategoricalAccuracy
from utils.parser import parse_args

In [2]:
n_cpus = os.cpu_count()
use_cuda = torch.cuda.is_available()

In [3]:
data_dir = Path("data")
train_dir = data_dir / "train"
valid_dir = data_dir / "valid"
test_dir = data_dir / "test"

In [4]:
ftrain, fvalid, ftest = "han_train.npz", "han_valid.npz", "han_test.npz"
tokf = "HANPreprocessor.p"

As with other notebooks, and to keep the data size tractable, I will just just a sample of 1000 observations

In [5]:
batch_size = 64
train_mtx = np.load(train_dir / ftrain)
np.random.seed(1)
idx = np.random.choice(train_mtx["X_train"].shape[0], 1000)
train_set = TensorDataset(
    torch.from_numpy(train_mtx["X_train"][idx]),
    torch.from_numpy(train_mtx["y_train"][idx]).long(),
)
train_loader = DataLoader(dataset=train_set, batch_size=batch_size, num_workers=n_cpus)

valid_mtx = np.load(valid_dir / fvalid)
np.random.seed(2)
idx = np.random.choice(valid_mtx["X_valid"].shape[0], 1000)
eval_set = TensorDataset(
    torch.from_numpy(valid_mtx["X_valid"][idx]),
    torch.from_numpy(valid_mtx["y_valid"][idx]).long(),
)
eval_loader = DataLoader(
    dataset=eval_set, batch_size=batch_size, num_workers=n_cpus, shuffle=False
)

In [6]:
next(iter(train_loader))

[tensor([[[   1,    1,    1,  ...,   70,   88,    9],
          [   1,    1,    1,  ...,   35, 3007,  841],
          [   1,    1,    1,  ...,  222,   70,    9],
          ...,
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1]],
 
         [[   1,    1,    1,  ...,  109,   15,    9],
          [   1,    1,    1,  ...,   22, 1108,    9],
          [   1,    1,    1,  ...,  448,  144, 1587],
          ...,
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1]],
 
         [[   1,    1,    1,  ...,   10, 6706,    9],
          [   1,    1,    1,  ...,    5,   97,    9],
          [   1,    1,    1,  ...,   66,  186,    9],
          ...,
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  ...,    1,    1,    1],
          [   1,    1,    1,  .

### HAN model

In [7]:
tok = pickle.load(open(data_dir / tokf, "rb"))

In [8]:
tok

<utils.preprocessors.HANPreprocessor at 0x13856d208>

In [9]:
model = HierAttnNet(
    vocab_size=len(tok.vocab.stoi),
    maxlen_sent=tok.maxlen_sent,
    maxlen_doc=tok.maxlen_doc,
    word_hidden_dim=32,
    sent_hidden_dim=32,
    padding_idx=1,
    embed_dim=50,
    weight_drop=0.,
    embed_drop=0.,
    locked_drop=0.,
    last_drop=0.,
    embedding_matrix=None,
    num_class=4,
)

In [10]:
model

HierAttnNet(
  (wordattnnet): WordAttnNet(
    (lockdrop): LockedDropout()
    (word_embed): Embedding(23611, 50, padding_idx=1)
    (rnn): GRU(50, 32, batch_first=True, bidirectional=True)
    (word_attn): AttentionWithContext(
      (attn): Linear(in_features=64, out_features=64, bias=True)
      (contx): Linear(in_features=64, out_features=1, bias=False)
    )
  )
  (sentattnnet): SentAttnNet(
    (rnn): GRU(64, 32, batch_first=True, bidirectional=True)
    (sent_attn): AttentionWithContext(
      (attn): Linear(in_features=64, out_features=64, bias=True)
      (contx): Linear(in_features=64, out_features=1, bias=False)
    )
  )
  (ld): Dropout(p=0.0, inplace=False)
  (fc): Linear(in_features=64, out_features=4, bias=True)
)

Before I move forward let me comment on the dropout-related parameters. You will see 4 types of dropout:

1. `embed_drop`
2. `weight_drop`
3. `locked_drop`
4. `last_drop`

The first 3 are taken directly from the work of Stephen Merity, Nitish Shirish Keskar and Richard Socher in their 2017 [paper](https://arxiv.org/pdf/1708.02182.pdf): *Regularizing and Optimizing LSTM Language Models*. 

In fact, within the `models` module there are 3 submodules named: `embed_regularize.py`, `locked_dropout.py` and `weight_dropout.py`. The code in there is **taken directly** from the original implementation at the [Salesforce repo](https://github.com/salesforce/awd-lstm-lm). The adaptations are minimal, simply adjusting the code to newer versions of `Pytorch` and a few minor style-related changes. Other than that, is a "copy-paste" of the code in their repo, so all credit to the 3 authors of the paper and the code. The `last_drop` is simply dropout before the last fully connected layer.

It is beyond the scope of this notebook to dive into the details of these dropouts methods. I intend to write a companion Medium post where I will discuss them in more detail, although in all honesty, they are simple, brilliant ideas to apply regularization. I'd say that you would easily understand what these methods do by having a quick look to the code that the authors wrote and also, of course, read the paper. 

The reason to add this much dropout is because when I started running experiments I noticed that the model overfitted quite early. In fact, in some cases, the best validation loss was attained in the very first epoch. 

When overfitting occurs, one has 3 options to avoid it:

1. Reduce model complexity
2. Early Stop (although if the process stops in a suboptimal solution, this only guarantess that we will not be using a worse solution, but it will not solve the problem that the current solution is indeed suboptimal model)
3. Regularization

The first two are a given, via parameter space exploration and a standard Early Stop function I implemented. To account for the 3rd one, I used the before mentioned dropout mechanisms. 

Note that there are other implementations of these dropout mechanisms other than the original (and inspired by that one, of course). For example, the `text` API at the fastai library has a very neat [implementation](https://github.com/fastai/fastai/blob/master/fastai/text/models/awd_lstm.py#L75). Another nice [implemenation](https://github.com/dmlc/gluon-nlp/blob/8869e795b683ff52073b556cd24e1d06cf9952ac/src/gluonnlp/model/utils.py#L34) is found at the `Mxnet's` `gluonnlp` library, which I have also used here $-$ although only the `Pytorch` implementation is discussed here, I have also implemented the full HAN model using `Mxnet` $-$. 

Having said all of the above, let's move on, shall we? 

Once we have the model, we need the remaining Pytorch components:

In [11]:
optimizer = torch.optim.AdamW(model.parameters())
# This class is at the utils module
metric = CategoricalAccuracy()

And the standard train and validation steps, along with an early stop function

In [12]:
def train_step(model, optimizer, train_loader, epoch, metric):
    model.train()
    metric.reset()
    train_steps = len(train_loader)
    running_loss = 0
    with trange(train_steps) as t:
        for batch_idx, (data, target) in zip(t, train_loader):
            t.set_description("epoch %i" % (epoch + 1))

            X = data.cuda() if use_cuda else data
            y = target.cuda() if use_cuda else target

            optimizer.zero_grad()
            y_pred = model(X)
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            avg_loss = running_loss / (batch_idx + 1)
            acc = metric(F.softmax(y_pred, dim=1), y)

            t.set_postfix(acc=acc, loss=avg_loss)


def eval_step(model, eval_loader, metric, is_test=False):
    model.eval()
    metric.reset()
    eval_steps = len(eval_loader)
    running_loss = 0
    preds = []
    with torch.no_grad():
        with trange(eval_steps) as t:
            for batch_idx, (data, target) in zip(t, eval_loader):
                if is_test:
                    t.set_description("test")
                else:
                    t.set_description("valid")

                X = data.cuda() if use_cuda else data
                y = target.cuda() if use_cuda else target

                y_pred = model(X)
                loss = F.cross_entropy(y_pred, y)
                running_loss += loss.item()
                avg_loss = running_loss / (batch_idx + 1)
                acc = metric(F.softmax(y_pred, dim=1), y)
                if is_test:
                    preds.append(y_pred)
                t.set_postfix(acc=acc, loss=avg_loss)

    return avg_loss, preds


def early_stopping(curr_value, best_value, stop_step, patience):
    if curr_value <= best_value:
        stop_step, best_value = 0, curr_value
    else:
        stop_step += 1
    if stop_step >= patience:
        print("Early stopping triggered. log:{}".format(best_value))
        stop = True
    else:
        stop = False
    return best_value, stop_step, stop

And with that we are good to go. Note that one could just define the train/eval functions so that they run all the epochs. Normally I prefer to code the steps and run them in a loop. A matter of taste, also depends on the code structure. In this particular case, I will leave it as it is. 

To run the model simply

In [13]:
metric = CategoricalAccuracy()
n_epochs = 4
eval_every = 1
patience = 1
stop_step = 0
best_loss = 1e6
for epoch in range(n_epochs):
    train_step(model, optimizer, train_loader, epoch, metric)
    if epoch % eval_every == (eval_every - 1):
        val_loss, _ = eval_step(model, eval_loader, metric)
        best_loss, stop_step, stop = early_stopping(
            val_loss, best_loss, stop_step, patience
        )
    if stop:
        break

epoch 1: 100%|██████████| 16/16 [00:04<00:00,  3.73it/s, acc=0.542, loss=1.25]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.30it/s, acc=0.577, loss=1.15]
epoch 2: 100%|██████████| 16/16 [00:04<00:00,  3.76it/s, acc=0.584, loss=1.12]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.35it/s, acc=0.577, loss=1.12]
epoch 3: 100%|██████████| 16/16 [00:04<00:00,  3.73it/s, acc=0.584, loss=1.09]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.47it/s, acc=0.577, loss=1.11]
epoch 4: 100%|██████████| 16/16 [00:04<00:00,  3.77it/s, acc=0.584, loss=1.08]
valid: 100%|██████████| 16/16 [00:00<00:00, 18.60it/s, acc=0.577, loss=1.09]


And that's it. If you have a look to the `run_pytorch.py` script you will see that, of course, there are a number of additional adds on to the training/validation/test process, but the main bits and pieces have been discussed here.