I have run a number of experiments which are detailed in the file `run_experiments.sh`. 

Here I will discuss them with some detail. Note that, in total, I have **manually** run 40 experiments for the Pytorch implementation and 19 for the mxnet implementation. Of course, this is **NOT** the best way to find the best set of parameters. Ideally, one would wrap up the process into an objective function and use [`hyperopt`](http://hyperopt.github.io/hyperopt/), leaving the hyperparamter search running for a few days. 

I intend to include a script in the repo to do precisely that. For now, I just wanted to get an idea of the potential of HANs to predict amazon reviews score relative to other techniques I have used in this repo. 

Let's have a look

In [12]:
import pandas as pd

from pathlib import Path

In [13]:
results_path = Path("results/")
pytorch_fname = "results_df.csv"
mxnet_fname = "mx_results_df.csv" 

The tables below show the evaluation metrics obtained **on the test dataset**, along with the best epoch during training.

In [14]:
res_pytorch = pd.read_csv(results_path / pytorch_fname)
res_pytorch.sort_values(['f1'], ascending=False).reset_index(drop=True)

Unnamed: 0,modelname,loss,acc,f1,prec,best_epoch
0,han_lr_0.0005_wdc_0.05_bsz_512_whd_128_shd_128_emb_300_drp_0.2_sch_cycliclr_cycl_4_lrp_no_pre_no,0.676955,0.722332,0.709809,0.703325,3
1,han_lr_0.001_wdc_0.05_bsz_512_whd_128_shd_128_emb_300_drp_0.2_sch_reducelronplateau_cycl_no_lrp_...,0.672373,0.720424,0.707064,0.700494,6
2,han_lr_0.001_wdc_0.01_bsz_64_whd_64_shd_64_emb_300_drp_0.2_sch_reducelronplateau_cycl_no_lrp_2_p...,0.677667,0.72046,0.706726,0.701034,4
3,han_lr_0.001_wdc_0.01_bsz_128_whd_64_shd_64_emb_300_drp_0.2_sch_no_cycl_no_lrp_no_pre_no,0.679628,0.721108,0.706106,0.70032,6
4,han_lr_0.0005_wdc_0.05_bsz_512_whd_128_shd_128_emb_300_drp_0.2_sch_cycliclr_cycl_2_lrp_no_pre_no,0.684018,0.718407,0.704798,0.698459,5
5,han_lr_0.001_wdc_0.01_bsz_64_whd_64_shd_64_emb_300_drp_0.2_sch_no_cycl_no_lrp_no_pre_no,0.681003,0.71693,0.703556,0.698334,4
6,han_lr_0.001_wdc_0.01_bsz_512_whd_64_shd_64_emb_300_drp_0.2_sch_no_cycl_no_lrp_no_pre_no,0.684646,0.718155,0.70243,0.696197,7
7,han_lr_0.001_wdc_0.01_bsz_128_whd_32_shd_32_emb_50_drp_0.0_sch_no_cycl_no_lrp_no_pre_no,0.702976,0.708827,0.698351,0.693251,3
8,han_lr_0.001_wdc_0.01_bsz_128_whd_64_shd_64_emb_50_drp_0.0_sch_no_cycl_no_lrp_no_pre_no,0.693445,0.711528,0.698188,0.694117,2
9,han_lr_0.001_wdc_0.05_bsz_128_whd_32_shd_32_emb_100_drp_0.0_sch_no_cycl_no_lrp_no_pre_no,0.696123,0.712897,0.698097,0.692957,2


You will noticed the lengthy names. Again, this is not the best way to "store" the set-up used per experiment. In the `main_pytorch.py` script I have included a piece of code that is a more elegant solution if one wanted to automate the process. For the relatively quick experimentation I wanted to do here these lengthy names are more convenient. 

Let me explain how one has to read them, for example: 

`han_lr_0.0005_wdc_0.05_bsz_512_whd_128_shd_128_emb_300_drp_0.2_sch_cycliclr_cycl_4_lrp_no_pre_no`

This run corresponds to a HAN model (RNN is also available) with:

* learning rate (lr) = 0.0005
* weight decay (wdc) = 0.05
* batch size (bsz) = 0.05 
* Word GRU hidden dim (whd) = 128
* sentence GRU hidden dim (shd) = 128
* word embedding dim (emb) = 300
* Embedding, Weight, Locked and "Last" dropout (drp) = 0.2
* using a cyclic learning rate scheduler (sch) with 4 cycles (cycl) with lr ranging from 0.0005 to 0.005
* Learning rate patience (lrp) does not apply since is only used when using ReduceLrOnPlateau
* No pretrained (pre) word embeddings

In [15]:
res_mxnet = pd.read_csv(results_path / mxnet_fname)
res_mxnet.sort_values(['f1'], ascending=False).reset_index(drop=True)

Unnamed: 0,modelname,loss,acc,f1,prec,best_epoch
0,han_mx_lr_0.01_wdc_0.0_bsz_128_whd_32_shd_32_emb_50_drp_0.0_sch_no_step_no_pre_no,0.713377,0.702921,0.694185,0.693122,0
1,han_mx_lr_0.01_wdc_0.0_bsz_128_whd_32_shd_32_emb_50_drp_0.0_sch_multifactorscheduler_step_248_pr...,0.713377,0.702921,0.694185,0.693122,0
2,han_mx_lr_0.01_wdc_0.0_bsz_256_whd_32_shd_32_emb_50_drp_0.0_sch_no_step_no_pre_no,0.707275,0.703785,0.685044,0.684165,0
3,han_mx_lr_0.01_wdc_0.0_bsz_128_whd_32_shd_32_emb_50_drp_0.2_sch_no_step_no_pre_no,0.717142,0.695394,0.68173,0.681444,0
4,han_mx_lr_0.01_wdc_0.0_bsz_128_whd_64_shd_64_emb_300_drp_0.2_sch_no_step_no_pre_no,0.753628,0.682861,0.669988,0.662521,0
5,han_mx_lr_0.01_wdc_0.0_bsz_64_whd_32_shd_32_emb_50_drp_0.0_sch_no_step_no_pre_no,0.731422,0.699247,0.669075,0.663576,0
6,han_mx_lr_0.01_wdc_0.0_bsz_512_whd_32_shd_32_emb_50_drp_0.0_sch_no_step_no_pre_no,0.709781,0.704685,0.668295,0.670837,1
7,han_mx_lr_0.01_wdc_0.001_bsz_128_whd_64_shd_64_emb_200_drp_0.0_sch_multifactorscheduler_step_248...,0.729639,0.696834,0.666595,0.660634,17
8,han_mx_lr_0.01_wdc_0.001_bsz_512_whd_32_shd_32_emb_50_drp_0.0_sch_multifactorscheduler_step_248_...,0.729468,0.693737,0.66571,0.658832,8
9,han_mx_lr_0.01_wdc_0.001_bsz_128_whd_16_shd_16_emb_50_drp_0.0_sch_no_step_no_pre_no,0.753661,0.685922,0.66438,0.656559,11


### Comments

#### Overfitting and Dropout

As I mentioned in the previous notebook, when I started to run experiments I noticed that the model overfitted quite early, moreover in the case of `Mxnet` (I will come back to the `MxNet` implemenation later). To mitigate that I used a number of dropout mechanisms (see Notebook 03) along with exploring relatively simple architectures and early stopping. Still, the best loss/metrics are reached quite early during the fitting process. However, these are not too distant of the best loss/training metrics. In other words, the overfitting is not that important after a small number of epochs. Of course, there is still more to try, like even simpler architectures, starting with lower learning rates or higher weight decay values. I will leave this for future exploration (or to you, the reader, if you are willing to 🙂)

The fact that the model quickly finds the best solution is not neccesarily bad. It might mean that the problem is relatively simple for the algorithm. For example, maybe only a small number of tokens are required to make good decisions (we'll see this in the next and final notebook). Also, maybe a better preprocessing would also help to reduce overfitting (I will also discuss this a bit more in the next notebook)

#### Hierarchy and Attention 

In the code in this repo I also include the option of using a "simple" stack of RNNs (LSTMs, referred as RNN model) with or without Attention, ignoring word/sentence hierarchy. The truth is that, when using the RNN model, I have explored the parameter space less than with the HAN model. However, the results seem to suggest that Hierarchy does make a difference. Test losses and metrics are worse when using the RNN model compared to those obtained using the HAN model. However, one might also observe in the table that the RNN model overfits less than the HAN model, in some cases, metrics are still improving after 20 epochs (the maximum number of epochs per experiment). This along with the already mentioned fact that I have not properly explored the parameter space in the case of the RNN model suggests that better results are still attainable. 

For example, when using the RNN model I have only used Attention without context. Maybe using a context vector makes a significant difference 🤷🏻‍♂️. As with some other comments I made throughout the notebooks, I will try a few more set ups when I re-visit this repo. 

#### MxNet

The code in this repo represents my second dive into `MxNet` (after the implementation of neural collaborative filtering [here](https://github.com/jrzaurin/neural_cf)). The more I use this DL frame the more I like it. Is written in a way that "makes sense" and, even I still don't think is as mature as `Pytorch`, I can see it catching up soon. Also, the last sentence might be of course influenced by the fact that I have not used `MxNet` as much as I have used `Pytorch`. With that in mind I will start by saying that the exploration process has been a bit lighter with `MxNet`. The results that I obtained where a bit worse than those with `Pytorch`, but I am sure that has to do with the fact that I do not know some of the details behing `MxNet`'s implementations. Therefore, if you, reading these lines, are an `MxNet` expert or simply know about `MxNet`, I would love to have a chat with you about my `MxNet` implementation of HANs. 

#### Results

So far, the best results I obtained when predicting the reviews score where using *tf-idf* + `LightGBM` properly tuned using `Hyperopt`. The metrics where: 

| acc | f1 | pre  |
|---|---|---|---|---|
| 0.7054  | 0.6832  |  0.676335  |

Here, the best results are:

| acc | f1 | pre  |
|---|---|---|---|---|
| 0.7223  | 0.71  |  0.7033  |

These results are obtained in less than 10 epochs with a model that trains in less than 2 min per epoch on a Tesla K80 GPU. 

Of course, the question is, it is worth chosing HAN over a tf-idf + LightGBM? To me the answer is, as with most things in life, "it depends". The increase is not that big, but it is significant. For example, if in your business getting a $\sim$3 increase in F1 score represent a sizeable increase in revenue or savings, then there is no question. Also, as we will see in the next notebook the attention weights might provide you with insights into which expressions or semantic constructions are relevant within the text, beyond keywords. 