## Baseline model

Here, I'm following another PyTorch tutorial on finetuning torchvision models [https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html]

The goal is to create a very simple baseline model, which is very fast to train. 

The idea is to download a pretrained backbone CNN and use it for feature extraction only. Then replace the classification head (to predict pedestrian or background only) and add regression head (to predict bounding boxes around pedestrians). Training should be very fast, since it's almost as training a linear regression (for regression head) and logistic regression (for classification head). 

To detect multiple pedestrians, we can slide a window and run the trained detector on each window. For example, if we use AlexNet as the backbone CNN, it uses 221x221 size images for input. A very simple approach is to just slide a window of this size over our input high resolution image, save the predicted bounding boxes on the way, and combine the predicted bounding boxes at the end. For combining, we can just take the intersection for bounding boxes that overlap. This way, the model even does some self correction since we run it multiple times. The testing has to run fast also so we should use some small and efficient backbone CNN.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torchvision
from torchvision import datasets, models, transforms

import matplotlib.pyplot as plt
import time
import os
import copy

In [2]:
print("PyTorch Version: ",torch.__version__)
print("Torchvision Version: ",torchvision.__version__)

PyTorch Version:  1.8.1+cu102
Torchvision Version:  0.9.1+cu102


We will use existing pretrained model (with modern CNN architecture) on 1000 class Imagenet dataset and use transfer learning. There are two types of transfer learning:

* where we update all model's parameters for our new task, this is called *finetuning*;
* where we only update the final layer weights from which we derive our new task predictions. It is called *feature extraction* because we use the pretrained CNN as a fixed feature-extractor, and only change the output layer.

### Backbone CNN selection

Pretrained CNN models on ImageNet we can choose from:

In [24]:
model_names = ['resnet', 'alexnet', 'vgg', 'squeezenet', 'densenet', 'inception']

Let's choose AlexNet:

In [3]:
model_name = model_names[1]

You can download the dataset used in this tutorial here: https://download.pytorch.org/tutorial/hymenoptera_data.zip

In [14]:
data_dir = "/home/marko/data/baseline/hymenoptera_data" # set the data dir
num_classes = 2
batch_size = 8 # change depending on how much memory you have
num_epochs = 15
feature_extract = True # we only update the reshaped layer params

## Helper functions

### Model training and validation

In [26]:
def train_model(model, dataloaders, criterion, optimizer, 
                num_epochs=25, is_inception=False):
    ''' Copy pasted from tutorial. '''
    since = time.time()

    val_acc_history = []

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    # Get model outputs and calculate loss
                    # Special case for inception because in training it has an auxiliary output. In train
                    #   mode we calculate the loss by summing the final output and the auxiliary output
                    #   but in testing we only consider the final output.
                    if is_inception and phase == 'train':
                        # From https://discuss.pytorch.org/t/how-to-optimize-inception-model-with-auxiliary-classifiers/7958
                        outputs, aux_outputs = model(inputs)
                        loss1 = criterion(outputs, labels)
                        loss2 = criterion(aux_outputs, labels)
                        loss = loss1 + 0.4*loss2
                    else:
                        outputs = model(inputs)
                        loss = criterion(outputs, labels)

                    _, preds = torch.max(outputs, 1)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())
            if phase == 'val':
                val_acc_history.append(epoch_acc)

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, val_acc_history

When we are feature extracting only, we set all parameters of our model to have `.requires_grad = False` since we only want to compute gradients for the newly initialized layer:

In [16]:
def set_parameter_requires_grad(model, feature_extracting):
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False

In [17]:
def initialize_model(model_name, num_classes, feature_extract, use_pretrained=True):
    """ Trying with Alexnet only for now. We could try Resnet18
    also, as our dataset is small and only has two classes. 
    See the code in tutorial on how to reshape other networks. """

    # Initialize these variables which will be set in this if statement. Each of these
    #   variables is model specific.
    model_ft = None
    input_size = 0
    
    model_ft = models.alexnet(pretrained=use_pretrained)
    set_parameter_requires_grad(model_ft, feature_extract)
    num_ftrs = model_ft.classifier[6].in_features
    model_ft.classifier[6] = nn.Linear(num_ftrs,num_classes)
    input_size = 224
    
    return model_ft, input_size

In [18]:
# Initialize the model for this run
model_ft, input_size = initialize_model(model_name, num_classes, feature_extract, use_pretrained=True)

In [19]:
# Print the model we just instantiated
print(model_ft)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

## Load Data

In [20]:
# Data augmentation and normalization for training
# Just normalization for validation
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(input_size),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(input_size),
        transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

In [21]:
print("Initializing Datasets and Dataloaders...")

# Create training and validation datasets
image_datasets = {
    x: datasets.ImageFolder(os.path.join(data_dir, x), 
                            data_transforms[x]) for x in ['train', 'val']}
# Create training and validation dataloaders
dataloaders_dict = {
    x: torch.utils.data.DataLoader(image_datasets[x], 
                                   batch_size=batch_size, 
                                   shuffle=True, 
                                   num_workers=4) for x in ['train', 'val']}

# Detect if we have a GPU available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Initializing Datasets and Dataloaders...


### Create the optimizer

We gather the parameters to be optimized/updated in this run. 
Because we are doing feature extract method, we will only update the parameters that we have just initialized, i.e. the parameters with `requires_grad is True`:

In [22]:
## Send the model to the CPU or GPU device:
model_ft = model_ft.to(device)

params_to_update = model_ft.parameters()
print("Params to learn:")
if feature_extract:
    params_to_update = []
    for name,param in model_ft.named_parameters():
        if param.requires_grad == True:
            params_to_update.append(param)
            print("\t",name)
else:
    for name,param in model_ft.named_parameters():
        if param.requires_grad == True:
            print("\t",name)

# Observe what parameters are being optimized
optimizer_ft = optim.SGD(params_to_update, lr=0.001, momentum=0.9)

Params to learn:
	 classifier.6.weight
	 classifier.6.bias


### Run Training and Validation

In [23]:
# Setup the loss fxn
criterion = nn.CrossEntropyLoss()

# Train and evaluate
model_ft, hist = train_model(
    model_ft, dataloaders_dict, criterion, 
    optimizer_ft, num_epochs=num_epochs, 
    is_inception=(model_name=="inception"))

Epoch 0/14
----------
train Loss: 0.5088 Acc: 0.7705
val Loss: 0.4402 Acc: 0.8627

Epoch 1/14
----------
train Loss: 0.3221 Acc: 0.9139
val Loss: 0.4062 Acc: 0.8954

Epoch 2/14
----------
train Loss: 0.3133 Acc: 0.8893
val Loss: 0.4167 Acc: 0.8889

Epoch 3/14
----------
train Loss: 0.2421 Acc: 0.9098
val Loss: 0.3085 Acc: 0.9085

Epoch 4/14
----------
train Loss: 0.2920 Acc: 0.9057
val Loss: 0.5238 Acc: 0.8562

Epoch 5/14
----------
train Loss: 0.2947 Acc: 0.9139
val Loss: 0.3435 Acc: 0.8954

Epoch 6/14
----------
train Loss: 0.1954 Acc: 0.9467
val Loss: 0.3855 Acc: 0.8693

Epoch 7/14
----------
train Loss: 0.1592 Acc: 0.9508
val Loss: 0.3207 Acc: 0.9346

Epoch 8/14
----------
train Loss: 0.0924 Acc: 0.9631
val Loss: 0.3486 Acc: 0.9150

Epoch 9/14
----------
train Loss: 0.2361 Acc: 0.9344
val Loss: 0.3986 Acc: 0.9085

Epoch 10/14
----------
train Loss: 0.2030 Acc: 0.9426
val Loss: 0.4078 Acc: 0.9281

Epoch 11/14
----------
train Loss: 0.1373 Acc: 0.9508
val Loss: 0.4130 Acc: 0.9085

Ep

***Training took about 1 minute and a half on 8th Gen Intel i5 cpu (i5-9500T CPU @ 2.20GHz).***

And achieved a similar accuracy (0.934641) as in the tutorial (0.941176).
