# OverFeat

In this notebook we'll be implementing one of the [OverFeat](https://arxiv.org/abs/1312.6229) (OverFeat Network) model variants. OverFeat was designed for the [ImageNet challenge](http://www.image-net.org/challenges/LSVRC/), which it won in 2013.

We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.

OverFeat is a classic type of convolutional neural network architecture, employing convolution, pooling and fully connected layers. The Figure to the right shows the architectural details. Source: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks.

### Data Processing

As always, we'll start by importing all the necessary modules. We have a few new imports here:
- `lr_scheduler` for using the one cycle learning rate scheduler
- `os` and `shutil` for handling custom datasets

In [1]:
import torchvision
from torch.utils.data import Dataset, DataLoader
from torchsummaryX import summary
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim.lr_scheduler import _LRScheduler
import torch.utils.data as data

import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torchvision import models

from sklearn import decomposition
from sklearn import manifold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from tqdm.notebook import tqdm, trange
import matplotlib.pyplot as plt
import numpy as np

import copy
import random
import time
import os

We set the random seed so all of our experiments can be reproduced.

In [2]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Show current directory.

In [3]:
cwd = os.getcwd()
print(cwd)

/home/david/code/paper/cv-paper-pytorch/004_OverFeat


Show torch version and if there is a gpu in this device. Also it is will print gpu device name.

In [4]:
print(torch.__version__)

device = ("cuda" if torch.cuda.is_available() else "cpu") # device
print(device)

print("The GPU device is:{}".format( torch.cuda.get_device_name() ))


2.0.1+cu118
cuda
The GPU device is:NVIDIA GeForce RTX 3080 Ti


### Defining the Model

Next up is defining the model.

The actual model itself is no more difficult to understand than the previous model, LeNet. It is made up of convolutional layers, pooling layers and ReLU activation functions. See the previous notebook for a refresher on these concepts. 

There are only two new concepts introduced here, `nn.Sequential` and `nn.Dropout`.

We can think of `Sequential` as like our transforms introduced earlier for data augmentation. We provide `Sequential` with multiple layers and when the `Sequential` module is called it will apply each layer, in order, to the input. There is no difference between using a `Sequential` and having each module defined in the `__init__` and then called in `forward` - however it makes the code significantly shorter.

We have one `Sequential` model, `features`, for all the convolutional and pooling layers, then we flatten then data and pass it to the `classifier`, another `Sequential` model which is made up of linear layers and the second new concept, *dropout*.

Dropout is a form of [*regularization*](https://en.wikipedia.org/wiki/Regularization_(mathematics)). As our models get larger, to perform more accurately on richer datasets, they start having a significantly higher number of parameters. The problem with lots of parameters is that our models begin to *overfit*. That is, they do not learn general image features whilst learning to classify images but instead simply memorize images within the training set. This is bad as it will cause our model to achieve poor performance on the validation/testing set. To solve this overfitting problem, we use regularization. Dropout is just one method of regularization, other common ones are *L1 regularization*, *L2 regularization* and *weight decay*.

Dropout works by randomly setting a certain fraction, 0.5 here, of the neurons in a layer to zero. This effectively adds noise to the training of the neural network and causes neurons to learn with "less" data as they are only getting half of the information from a previous layer with dropout applied. It can also be thought of as causing your model to learn multiple smaller models with less parameters. 

Dropout is only applied when the model is training. It needs to be "turned off" when validating, testing or using the model for inference.

As mentioned in the previous notebook, during the convolutional and pooling layers the activation function should be placed **after** the pooling layer to reduce computational cost.

In the linear layers, dropout should be applied **after** the activation function. Although when using ReLU activation functions the same result is achieved if dropout is before or after, see [here](https://sebastianraschka.com/faq/docs/dropout-activation.html).

One last thing to mention is that the very first convolutional layer has an `in_channel` of three. That is because we are handling color images that have three channels (red, green and blue) instead of the single channel grayscale images from the MNIST dataset. This doesn't change the way any of the convolutional filter works, it just means the first filter has a depth of three instead of a depth of one.

The accurate model for overfeat is as following.
![](./assets/accurate_overfeat.png)


In [5]:
class OverFeat_accurate(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()

        # train with 221x221 5 random crops and their horizontal filps
        # mini- batches of size 128
        # initialized weight randomly with mu=0, sigma=1x10^-2
        # SGD, momentum=0.6, l2 weight decay of 1x10^-5
        # learning rate 5x10^-2, decay by 0.5 after (30, 50, 60, 70, 80) epochs
        # Dropout on FCN?? -> dropout before classifier conv layer

        self.feature_extractor = nn.Sequential(
            # no contrast normalization is used
            # max polling with non-overlapping
            # 1st and 2nd layer stride 2 instead of 4

            # 1st
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=7, stride=2),  # (b x 96 x 108 x 108)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=3),  # (b x 96 x 36 x 36)

            # 2nd
            nn.Conv2d(96, 256, 7, stride= 1),  # (b x 256 x 30 x 30)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 256 x 15 x 15)

            # 3rd
            nn.Conv2d(256, 512, 3, padding=1),  # (b x 512 x 15 x 15)
            nn.ReLU(),

            # 4th
            nn.Conv2d(512, 512, 3, padding=1),  # (b x 512 x 15 x 15)
            nn.ReLU(),

            # 5th
            nn.Conv2d(512, 1024, 3, padding=1),  # (b x 1024 x 15 x 15)
            nn.ReLU(),

            # 6th
            nn.Conv2d(1024, 1024, 3, padding=1),  # (b x 1024 x 15 x 15)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=3),  # (b x 1024 x 5 x 5)
        )

        # fully connecyed layers implemented as a convolution layers
        self.classifier = nn.Sequential(
            # 7th
            nn.Dropout(p=0.5, inplace=True),
            nn.Conv2d(in_channels=1024, out_channels=4096, kernel_size=5),
            nn.ReLU(),

            # 8th
            nn.Dropout(p=0.5, inplace=True),
            nn.Conv2d(4096, 4096, 1),
            nn.ReLU(),

            # 9th
            nn.Conv2d(4096, num_classes, 1)
        )

        self.init_weight()  # initialize weight

    def init_weight(self):
        for layer in self.feature_extractor:
            if isinstance(layer, nn.Conv2d):
                nn.init.normal_(layer.weight, mean=0, std=0.01)

    def forward(self, x):
        """
        Pass the input through the net.
        Args:
            x (Tensor): input tensor
        Returns:
            output (Tensor): output tensor
        """
        x = self.feature_extractor(x)
        return self.classifier(x).squeeze()

The fast model for overfeat is as following.
![](./assets/fast_overfeat.png)


In [6]:
class OverFeat_fast(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()

        # train with 221x221 5 random crops and their horizontal filps
        # mini- batches of size 128
        # initialized weight randomly with mu=0, sigma=1x10^-2
        # SGD, momentum=0.6, l2 weight decay of 1x10^-5
        # learning rate 5x10^-2, decay by 0.5 after (30, 50, 60, 70, 80) epochs
        # Dropout on FCN?? -> dropout before classifier conv layer

        self.feature_extractor = nn.Sequential(
            # no contrast normalization is used
            # max polling with non-overlapping
            # 1st and 2nd layer stride 2 instead of 4

            # 1st
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),  # (b x 96 x 56 x 56)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 96 x 28 x 28)

            # 2nd
            nn.Conv2d(96, 256, 5, stride= 1),  # (b x 256 x 24 x 24)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 256 x 12 x 12)

            # 3rd
            nn.Conv2d(256, 512, 3, padding=1),  # (b x 512 x 12 x 12)
            nn.ReLU(),

            # 4th
            nn.Conv2d(512, 1024, 3, padding=1),  # (b x 1024 x 12 x 12)
            nn.ReLU(),

            # 5th
            nn.Conv2d(1024, 1024, 3, padding=1),  # (b x 1024 x 12 x 12)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 1024 x 6 x 6)
        )

        # fully connecyed layers implemented as a convolution layers
        self.classifier = nn.Sequential(
            # 6th
            nn.Dropout(p=0.5, inplace=False),
            nn.Conv2d(in_channels=1024, out_channels=3072, kernel_size=6),
            nn.ReLU(),

            # 7th
            nn.Dropout(p=0.5, inplace=False),
            nn.Conv2d(3072, 4096, 1),
            nn.ReLU(),

            # 8th
            nn.Conv2d(4096, num_classes, 1)
        )

        self.init_weight()  # initialize weight

    def init_weight(self):
        for layer in self.feature_extractor:
            if isinstance(layer, nn.Conv2d):
                nn.init.normal_(layer.weight, mean=0, std=0.01)

    def forward(self, x):
        """
        Pass the input through the net.
        Args:
            x (Tensor): input tensor
        Returns:
            output (Tensor): output tensor
        """
        x = self.feature_extractor(x)
        return self.classifier(x).squeeze()

We'll create an instance of our model with the desired amount of classes.

In [7]:
Fast = True
OUTPUT_DIM = 1000

In [8]:
if Fast == True:
    model = OverFeat_fast(num_classes=OUTPUT_DIM)
    #summary(overfeat, torch.zeros((128,3,231,231),device=device))
else:
    model = OverFeat_accurate(num_classes=1000).to(device)
    #summary(overfeat, torch.zeros((128,3,221,221),device=device))

# overfeat = torch.nn.parallel.DataParallel(overfeat, device_ids=)

Then we'll see how many trainable parameters our model has. 

Our LeNet architecture had ~44k, but here we have 141.9M parameters - and AlexNet is a relatively small model for computer vision.

In [9]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 145,920,872 trainable parameters


Put ILSVRC2012 `ILSVRC2012_img_train.tar` and `ILSVRC2012_img_val.tar` to `./../data` directory and untar them.

Cd `./../data` and execute `./extract_ILSVRC2012.sh` to extract ILSVRC2012 data.

And then rename all images suffix name.
```shell
find ./ -name "*.JPEG" | awk -F "." '{print $2}' | xargs -i -t mv ./{}.JPEG ./{}.jpg
```

In [10]:
path_img_train = "./../data/ILSVRC2012/train/"
path_img_val = "./../data/ILSVRC2012/val/"

path_log = "./tblog/"
path_checkpoint = "./checkpoints/"
if not os.path.exists(path_log):
    os.makedirs(path_log)
if not os.path.exists(path_checkpoint):
    os.makedirs(path_checkpoint)

Next, we'll set the random seeds for reproducability.

In [11]:
tbwriter = SummaryWriter(log_dir=path_log)

In [12]:
Fast = True

num_epochs = 100
batch_size = 256

In [13]:
if Fast:
    overfeat = OverFeat_fast(num_classes=1000).to(device)
    summary(overfeat, torch.zeros((128,3,231,231),device=device))
else:
    overfeat = OverFeat_accurate(num_classes=1000).to(device)
    summary(overfeat, torch.zeros((128,3,221,221),device=device))

# overfeat = torch.nn.parallel.DataParallel(overfeat, device_ids=)

                                         Kernel Shape         Output Shape  \
Layer                                                                        
0_feature_extractor.Conv2d_0          [3, 96, 11, 11]    [128, 96, 56, 56]   
1_feature_extractor.ReLU_1                          -    [128, 96, 56, 56]   
2_feature_extractor.MaxPool2d_2                     -    [128, 96, 28, 28]   
3_feature_extractor.Conv2d_3          [96, 256, 5, 5]   [128, 256, 24, 24]   
4_feature_extractor.ReLU_4                          -   [128, 256, 24, 24]   
5_feature_extractor.MaxPool2d_5                     -   [128, 256, 12, 12]   
6_feature_extractor.Conv2d_6         [256, 512, 3, 3]   [128, 512, 12, 12]   
7_feature_extractor.ReLU_7                          -   [128, 512, 12, 12]   
8_feature_extractor.Conv2d_8        [512, 1024, 3, 3]  [128, 1024, 12, 12]   
9_feature_extractor.ReLU_9                          -  [128, 1024, 12, 12]   
10_feature_extractor.Conv2d_10     [1024, 1024, 3, 3]  [128, 102

  df_sum = df.sum()


In [14]:
if Fast:
    cropsize = 231
else:
    cropsize = 221

In [15]:
dataset_train = datasets.ImageFolder(path_img_train, transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(cropsize),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor()
]))
dataset_val = datasets.ImageFolder(path_img_val, transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(cropsize),
    transforms.ToTensor()
]))

In [16]:
dataloader_train = DataLoader(
    dataset=dataset_train,
    batch_size=batch_size,
    shuffle=True,
    num_workers=6,
    pin_memory=True,
    drop_last=True
)
dataloader_val = DataLoader(
    dataset=dataset_val,
    batch_size=batch_size,
    shuffle=True,
    num_workers=6,
    pin_memory=True,
    drop_last=True
)

In [17]:
optimizer = optim.SGD(
    params=overfeat.parameters(),
    momentum=0.6,
    weight_decay=1e-5,
    lr=5e-2
)

lr_scheduler = optim.lr_scheduler.MultiStepLR(optimizer=optimizer, milestones=[30, 50, 60, 70, 80], gamma=0.5)

In [None]:
overfeat.train()
step = 1

#for epoch in range(num_epochs):
for epoch in tqdm(range(num_epochs), desc="Epoch", leave=False):
    #for imgs, classes in dataloader_train:
    for (imgs, classes) in tqdm(dataloader_train, desc="Training", leave=False):
        imgs, classes = imgs.to(device), classes.to(device)

        output = overfeat(imgs)
        loss = F.cross_entropy(output, classes)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if(step % 1e3 == 0):
            with torch.no_grad():
                _, preds = torch.max(output, 1)
                accuracy = torch.sum(preds == classes)

                print(f'Epoch: {epoch + 1} \tStep: {step} \tLoss: {loss.item():.4f} \tAcc: {accuracy.item()}')
                tbwriter.add_scalar('loss', loss.item(), step)
                tbwriter.add_scalar('accuracy', accuracy.item(), step)

                for name, parameter in overfeat.named_parameters():
                    if parameter.grad is not None:
                        avg_grad = torch.mean(parameter.grad)
                        # print(f'\t{name} - grad_avg: {avg_grad}')
                        tbwriter.add_scalar(f'grad_avg/{name}', avg_grad.item(), step)
                        tbwriter.add_histogram(f'grad/{name}', parameter.grad.cpu().numpy(), step)
                    if parameter.data is not None:
                        avg_weight = torch.mean(parameter.data)
                        # print(f'\t{name} - param_avg: {avg_weight}')
                        tbwriter.add_histogram(f'weight/{name}', parameter.data.cpu().numpy(), step)
                        tbwriter.add_scalar(f'weight_avg/{name}', avg_weight.item(), step)

                overfeat.eval()
                val_cLoss = 0
                val_cAcc = 0
                val_count = 0
                
                for val_imgs, val_classes in dataloader_val:
                    val_imgs, val_classes = val_imgs.to(device), val_classes.to(device)

                    val_output = overfeat(val_imgs)
                    val_cLoss += F.cross_entropy(val_output, val_classes)

                    _, val_pred = torch.max(val_output, 1)
                    val_cAcc += torch.sum(val_pred == val_classes)

                    val_count += 1

                val_loss = val_cLoss / val_count
                val_accuracy = val_cAcc / val_count

                print(f'\tValidation Loss: {val_loss:.4f} \t Validation Acc: {val_cAcc} / {val_count} ({val_accuracy.item()*100:0f}%)')
                tbwriter.add_scalar('val_loss', val_loss.item(), step)
                tbwriter.add_scalar('val_accuracy', val_accuracy.item(), step)
                overfeat.train()

        step += 1
        
    lr_scheduler.step()
    
    if(Fast):
        checkpoint_path = os.path.join(path_checkpoint, f'overfeat_fast_states_epoch{epoch}.pkl')
    else:
        checkpoint_path = os.path.join(path_checkpoint, f'overfeat_accurate_states_epoch{epoch}.pkl')
    state = {
        'epoch': epoch,
        'step': step,
        'optimizer': optimizer.state_dict(),
        'model': overfeat.state_dict(),
        'seed' : SEED
    }
    torch.save(state, checkpoint_path)

HBox(children=(FloatProgress(value=0.0, description='Epoch', style=ProgressStyle(description_width='initial'))…

HBox(children=(FloatProgress(value=0.0, description='Training', max=5004.0, style=ProgressStyle(description_wi…

Epoch: 1 	Step: 1000 	Loss: 6.9062 	Acc: 0
	Validation Loss: 6.9079 	 Validation Acc: 50 / 195 (25.641027%)
Epoch: 1 	Step: 2000 	Loss: 6.9087 	Acc: 0
	Validation Loss: 6.9080 	 Validation Acc: 50 / 195 (25.641027%)
Epoch: 1 	Step: 3000 	Loss: 6.9093 	Acc: 0
	Validation Loss: 6.9079 	 Validation Acc: 70 / 195 (35.897437%)
Epoch: 1 	Step: 4000 	Loss: 6.8134 	Acc: 0
	Validation Loss: 6.7961 	 Validation Acc: 133 / 195 (68.205130%)
Epoch: 1 	Step: 5000 	Loss: 6.6055 	Acc: 0
	Validation Loss: 6.6680 	 Validation Acc: 213 / 195 (109.230769%)


HBox(children=(FloatProgress(value=0.0, description='Training', max=5004.0, style=ProgressStyle(description_wi…

Epoch: 2 	Step: 6000 	Loss: 6.4145 	Acc: 4
	Validation Loss: 6.4628 	 Validation Acc: 518 / 195 (265.641046%)
Epoch: 2 	Step: 7000 	Loss: 5.9640 	Acc: 2
	Validation Loss: 6.0596 	 Validation Acc: 1460 / 195 (748.717976%)
Epoch: 2 	Step: 8000 	Loss: 5.6343 	Acc: 13
	Validation Loss: 5.6394 	 Validation Acc: 2501 / 195 (1282.564163%)
Epoch: 2 	Step: 9000 	Loss: 5.4886 	Acc: 14
	Validation Loss: 5.4463 	 Validation Acc: 3400 / 195 (1743.589783%)
Epoch: 2 	Step: 10000 	Loss: 5.1454 	Acc: 21
	Validation Loss: 5.0273 	 Validation Acc: 4803 / 195 (2463.076973%)


HBox(children=(FloatProgress(value=0.0, description='Training', max=5004.0, style=ProgressStyle(description_wi…

Epoch: 3 	Step: 11000 	Loss: 4.9313 	Acc: 30
	Validation Loss: 4.8763 	 Validation Acc: 5781 / 195 (2964.615440%)
Epoch: 3 	Step: 12000 	Loss: 4.7576 	Acc: 29
	Validation Loss: 4.5724 	 Validation Acc: 7051 / 195 (3615.897369%)


# End......