# Shopee Product Classification: Other Networks

* In this notebook, we aim to use some of the other types of Neural network building blocks to perform image classification.
* These building blocks are added on to our CNN baseline model and evaluated.

The two additional types of network experiments performed in this notebook are as follows:

* Recurrent Neural Networks (RNN)
* Attention Neural Networks (Attention)

## Imports and Config

Note: This notebook was run in colab with this notebook in the root of the project directory.

In [1]:
# # Uncomment and run if running with file on drive
# from google.colab import drive
# drive.mount('/content/gdrive', force_remount=True)

# import os
# os.chdir('gdrive/MyDrive/cs5242-project/cs5242-project')

Mounted at /content/gdrive


In [2]:
import torch
import torch.nn as nn

from model import dataset, trainer
from model import baseline_cnn_1, rnn_cnn, attention_cnn

In [3]:
batch_size = 32
num_epoch = 30
seed = 42

## Data Import

* As previously, we use our dataset to import the set of images across categories.
* The 9 categories are selected with the custom filtered 500 images from each of the categories.

In [4]:
image_dir = 'data/selected_images/'

#### Uncomment the following block if running from the `notebooks` folder
# import sys
# sys.path.insert(0, '../')
# image_dir = '../data/selected_images/'
#####

data = dataset.DataSet(max_num_img=500, crop=0.8, path=image_dir)

In [5]:
data.load_all()

100%|██████████| 500/500 [00:11<00:00, 43.08it/s] 
100%|██████████| 500/500 [06:52<00:00,  1.21it/s]
100%|██████████| 500/500 [06:51<00:00,  1.21it/s]
100%|██████████| 500/500 [06:51<00:00,  1.21it/s]
100%|██████████| 500/500 [06:52<00:00,  1.21it/s]
100%|██████████| 500/500 [06:56<00:00,  1.20it/s]
100%|██████████| 500/500 [06:58<00:00,  1.20it/s]
100%|██████████| 500/500 [06:49<00:00,  1.22it/s]
100%|██████████| 500/500 [07:00<00:00,  1.19it/s]


## Baseline Model

* Before we proceed with these networks, we add in one evaluation of our baseline model to enable us to compare performances.

In [6]:
baseline_cnn_1_model = baseline_cnn_1.BaselineCNN1(len(data.categories))
torch.manual_seed(seed)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(baseline_cnn_1_model.parameters(), lr=4e-4)

In [7]:
mtrainer = trainer.Trainer(baseline_cnn_1_model, optimizer, criterion, data, batch_size)
mtrainer.run_train(num_epoch)

[Epoch   0]: Training loss: 1.735802 | Accuracy: 0.383810
[Epoch   0]: Validation loss: 1.682558 | Accuracy: 0.377778 | Within 3: 0.740000
[Epoch   1]: Training loss: 1.558757 | Accuracy: 0.465079
[Epoch   1]: Validation loss: 1.587951 | Accuracy: 0.455556 | Within 3: 0.773333
[Epoch   2]: Training loss: 1.455608 | Accuracy: 0.507937
[Epoch   2]: Validation loss: 1.503304 | Accuracy: 0.484444 | Within 3: 0.766667
[Epoch   3]: Training loss: 1.386036 | Accuracy: 0.522222
[Epoch   3]: Validation loss: 1.781648 | Accuracy: 0.420000 | Within 3: 0.691111
[Epoch   4]: Training loss: 1.323803 | Accuracy: 0.547619
[Epoch   4]: Validation loss: 1.615260 | Accuracy: 0.440000 | Within 3: 0.733333
[Epoch   5]: Training loss: 1.242632 | Accuracy: 0.577778
[Epoch   5]: Validation loss: 1.393525 | Accuracy: 0.515556 | Within 3: 0.797778
[Epoch   6]: Training loss: 1.197349 | Accuracy: 0.598413
[Epoch   6]: Validation loss: 1.489513 | Accuracy: 0.526667 | Within 3: 0.802222
[Epoch   7]: Training loss:

In [8]:
test_loss, test_acc, top_k, incorect_stats = mtrainer.run_test(mtrainer.testloader, 3, True)
print(f'Accuracy of the network on the test images: {test_acc*100} %')

Accuracy of the network on the test images: 64.55555555555556 %


## Recurrent Neural Network (RNN)

* In this approach, we add an RNN layer over the baseline CNN model we implemented.
* The RNN layer selected is a Long Short Term Memory (LSTM) layer from the Pytorch nn modules.
    * We keep all other convolutional blocks the same as compared to the baseline CNN model.
* The LSTM mechanism is implemented as follows:
    * After passing through the convolutional blocks, the image is split into smaller patches
    * These patches are then passed sequentially into the LSTM model.
    * The number of hidden states in the LSTM is directly proportional to the number of patches in the image.
* Following the LSTM layer, a final fully connected layer is used.
    * The adaptive average pooling layer is removed in this case.

The RNN and CNN model was experimented with, owing to findings from https://www.matec-conferences.org/articles/matecconf/pdf/2019/26/matecconf_jcmme2018_02001.pdf following a similar approach.

In [9]:
rnn_cnn_model = rnn_cnn.CNNWithRNN(len(data.categories))
torch.manual_seed(seed)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn_cnn_model.parameters(), lr=4e-4)

In [10]:
mtrainer = trainer.Trainer(rnn_cnn_model, optimizer, criterion, data, batch_size)
mtrainer.run_train(num_epoch)

[Epoch   0]: Training loss: 2.267701 | Accuracy: 0.180635
[Epoch   0]: Validation loss: 2.187227 | Accuracy: 0.215556 | Within 3: 0.548889
[Epoch   1]: Training loss: 1.861043 | Accuracy: 0.315873
[Epoch   1]: Validation loss: 1.695775 | Accuracy: 0.368889 | Within 3: 0.700000
[Epoch   2]: Training loss: 1.667541 | Accuracy: 0.401587
[Epoch   2]: Validation loss: 1.628065 | Accuracy: 0.384444 | Within 3: 0.768889
[Epoch   3]: Training loss: 1.530685 | Accuracy: 0.463492
[Epoch   3]: Validation loss: 1.637616 | Accuracy: 0.384444 | Within 3: 0.726667
[Epoch   4]: Training loss: 1.412013 | Accuracy: 0.506349
[Epoch   4]: Validation loss: 1.464527 | Accuracy: 0.480000 | Within 3: 0.797778
[Epoch   5]: Training loss: 1.299016 | Accuracy: 0.543492
[Epoch   5]: Validation loss: 1.504222 | Accuracy: 0.442222 | Within 3: 0.764444
[Epoch   6]: Training loss: 1.220570 | Accuracy: 0.569206
[Epoch   6]: Validation loss: 1.291953 | Accuracy: 0.551111 | Within 3: 0.831111
[Epoch   7]: Training loss:

In [11]:
test_loss, test_acc, top_k, incorect_stats = mtrainer.run_test(mtrainer.testloader, 3, True)
print(f'Accuracy of the network on the test images: {test_acc*100} %')

Accuracy of the network on the test images: 63.55555555555556 %


* We can see that the RNN model did not do as well as our baseline model and in fact led to a small reduction in performance (63.5% < 64.5%).
* In order to further understand this, we performed some paramter tuning on our model to see if that would affect our results, the results of which are explained below.

* **Increase in patch size**:
    * The increase in patch size led to a reduced performance on the RNN. This made sense since a larger patch size would require more information to be incorporated by the hidden cells and would lead to higher loss.
* **More stacked layers**:
    * Stacking multiple LSTM layers helped to increase the depth of our model and learn more features. We noticed that stacking 2 layers helped to provide a small improvement in the score, but increasing it to 3 led to a reduction. Thus stacking too many layers led to a higher degree of overfitting.
* **Removing MaxPool after convolution**:
    * An experiment was run with removing the MaxPool after the convolution layers as well, with the expectation that this would reduce abstraction and provide more data to the RNN. However this seemed to make performance worse as well. It would appear that the maxpool is important before applying the RNN.

* The result obtained above is after identifying the best parameters from search.

## Attention Neural Network (Attention)

* In the first approach, attention is applied at 2 points of the baseline model at increasing depth corresponding to increasing granuality of features extracted. The idea is to be able to weigh the different granularities of features extracted in the final classification decision, instead of just using the high level (global) features extracted by the last convolution layer.
* Each attention vector is computed based on the local feature map at that point and the global feature map from the final convolutional layer. A custom attention layer is built which incorporates the following steps:
    * The local feature map and the final global feature map are passed through a 1x1 convolutional block to project the features to a lower dimension.
    * The global feature map is upsampled via bilinear interpolation to match the dimension of the local feature map.
    * The feature map are then summed and projected to a single channel via another 1x1 convolution
    * A softmax is then applied to the result to get the attention map.
* The attention weights applied to each intermediate feature map are then concatennated with the output from the last feature layer and passed on to the classifier.

This implementation is based on the approach proposed in the paper, [Melanoma Recognition via Visual Attention (Yan et al, 2019)](https://www2.cs.sfu.ca/~hamarneh/ecopy/ipmi2019.pdf) where the attention module was found to be helpful in improving the network's classification ability 


In [12]:
attention_cnn_model = attention_cnn.CNNWithAttention(len(data.categories))
torch.manual_seed(seed)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(attention_cnn_model.parameters(), lr=4e-4)

In [13]:
mtrainer = trainer.Trainer(attention_cnn_model, optimizer, criterion, data, batch_size)
mtrainer.run_train(num_epoch)

[Epoch   0]: Training loss: 1.768142 | Accuracy: 0.369841
[Epoch   0]: Validation loss: 1.673710 | Accuracy: 0.411111 | Within 3: 0.746667
[Epoch   1]: Training loss: 1.598776 | Accuracy: 0.454603
[Epoch   1]: Validation loss: 1.725220 | Accuracy: 0.415556 | Within 3: 0.724444
[Epoch   2]: Training loss: 1.478499 | Accuracy: 0.501270
[Epoch   2]: Validation loss: 1.664003 | Accuracy: 0.426667 | Within 3: 0.715556
[Epoch   3]: Training loss: 1.395520 | Accuracy: 0.533016
[Epoch   3]: Validation loss: 1.721746 | Accuracy: 0.415556 | Within 3: 0.700000
[Epoch   4]: Training loss: 1.326853 | Accuracy: 0.539048
[Epoch   4]: Validation loss: 1.404146 | Accuracy: 0.531111 | Within 3: 0.824444
[Epoch   5]: Training loss: 1.244080 | Accuracy: 0.576508
[Epoch   5]: Validation loss: 1.361586 | Accuracy: 0.500000 | Within 3: 0.815556
[Epoch   6]: Training loss: 1.194026 | Accuracy: 0.593333
[Epoch   6]: Validation loss: 1.369494 | Accuracy: 0.537778 | Within 3: 0.826667
[Epoch   7]: Training loss:

In [14]:
test_loss, test_acc, top_k, incorect_stats = mtrainer.run_test(mtrainer.testloader, 3, True)
print(f'Accuracy of the network on the test images: {test_acc*100} %')

Accuracy of the network on the test images: 64.44444444444444 %


* We can see that the Attention network almost performs as good as the baseline on the test set (64.4% ~ 64.5%).
* With the potential for improved performance, we tried another approach using an attention model which was based on a convolutional approach.

### Convolutional Self-attention model
* The convolution-like self-attention attention model uses a self attention block where attention is applied locally by iterating over the pixel regions in the image. This is based on the self-attention component from the paper, [Stand-Alone Self-Attention in Vision Models (Ramachandran et al, 2019)](https://arxiv.org/pdf/1906.05909.pdf). The paper claims that such self-attention blocks can replace convolutional blocks in a CNN network.

In [18]:
attention_res_conv_cnn_model = attention_cnn.CNNWithConvAttention(len(data.categories))
torch.manual_seed(seed)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(attention_res_conv_cnn_model.parameters(), lr=4e-4)

In [19]:
mtrainer = trainer.Trainer(attention_res_conv_cnn_model, optimizer, criterion, data, batch_size)
mtrainer.run_train(num_epoch)

[Epoch   0]: Training loss: 1.733191 | Accuracy: 0.388571
[Epoch   0]: Validation loss: 1.620318 | Accuracy: 0.440000 | Within 3: 0.744444
[Epoch   1]: Training loss: 1.541552 | Accuracy: 0.458413
[Epoch   1]: Validation loss: 1.573415 | Accuracy: 0.440000 | Within 3: 0.768889
[Epoch   2]: Training loss: 1.411207 | Accuracy: 0.523810
[Epoch   2]: Validation loss: 1.560676 | Accuracy: 0.464444 | Within 3: 0.728889
[Epoch   3]: Training loss: 1.336674 | Accuracy: 0.539365
[Epoch   3]: Validation loss: 1.459053 | Accuracy: 0.482222 | Within 3: 0.800000
[Epoch   4]: Training loss: 1.249114 | Accuracy: 0.572698
[Epoch   4]: Validation loss: 1.942010 | Accuracy: 0.380000 | Within 3: 0.740000
[Epoch   5]: Training loss: 1.162595 | Accuracy: 0.607619
[Epoch   5]: Validation loss: 1.456341 | Accuracy: 0.484444 | Within 3: 0.802222
[Epoch   6]: Training loss: 1.106963 | Accuracy: 0.633016
[Epoch   6]: Validation loss: 1.419516 | Accuracy: 0.544444 | Within 3: 0.793333
[Epoch   7]: Training loss:

In [20]:
test_loss, test_acc, top_k, incorect_stats = mtrainer.run_test(mtrainer.testloader, 3, True)
print(f'Accuracy of the network on the test images: {test_acc*100} %')

Accuracy of the network on the test images: 67.0 %


* The ConvAttention module seems to have given the best performance with a significant improvement over the baseline (67% > 64.5%).
* We can thus see that local self-attention can be helpful in image classification.