# Bird Classification with PyTorch on Amazon SageMaker - Directly in your notebook

1. [Introduction](#Introduction)
2. [Data Preparation](#Data-Preparation)
3. [Train the model](#Train-the-model)
4. [Test the model](#Test-the-model)

## Introduction

Image classification is an increasingly popular machine learning technique, in which a trained model predicts which of several classes is represented by a particular image. This technique is useful across a wide variety of use cases from manufacturing quality control to medical diagnosis. To create an image classification solution, we need to acquire and process an image dataset, and train a model from that dataset. The trained model is then capable of identifying features and predicting which class an image belongs to. Finally, we can make predictions using the trained model against previously unseen images.

This notebook is an end-to-end example showing how to build an image classifier using PyTorch from Amazon SageMaker's hosted Jupyter notebook directly. This is an easy transition from traditional machine learning development you may already be doing on your laptop or on an Amazon EC2 instance. Subsequent notebooks in this workshop demonstrate how to take full advantage of SageMaker's training service, hosting service, batch inference, and automatic model tuning. 

For each of the labs in this workshop, we use a publicly available set of bird images based on the [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset. We demonstrate transfer learning by leveraging pretrained ImageNet weights for a ResNet50 network architecture.

For a quick demonstration, pick a small handful of bird species (set `SAMPLE_ONLY = True` and choose a few classes / species). For a more complete model, you can train against all 200 bird species in the dataset. For anything more than a few classes, be sure to upgrade your notebook instance type to one of SageMaker's GPU instance types (ml.p2, ml.p3).

## Data Preparation

The [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species (the original technical report can be found [here](http://www.vision.caltech.edu/visipedia/papers/CUB_200_2011.pdf)).  Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels.  Bounding boxes are provided, as are annotations of bird parts.  A recommended train/test split is given, but image size data is not.

![](./cub_200_2011_snapshot.png)

The dataset can be downloaded [here](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html).

### Download and unpack the dataset

Here we download the birds dataset from CalTech. You can do this once and keep the unpacked dataset in your notebook instance.

In [None]:
import os 
import urllib.request

def download(url):
    filename = url.split('/')[-1]
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)

In [None]:
%%time
#download('http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz')

In [None]:
%%time
# Clean up prior version of the downloaded dataset if you are running this again
#!rm -rf CUB_200_2011  

# Unpack and then remove the downloaded compressed tar file
#!gunzip -c ./CUB_200_2011.tgz | tar xopf - 
#!rm CUB_200_2011.tgz

### Set some parameters for the rest of the notebook to use
Here we define a few parameters that help drive the rest of the notebook.  For example, `SAMPLE_ONLY` is defaulted to `True`. This will force the notebook to train on only a handful of species.  Setting `SAMPLE_ONLY` to false will make the notebook work with the entire dataset of 200 bird species.  This makes the training a more difficult challenge, and you will need to tune parameters and run more epochs.

An `EXCLUDE_IMAGE_LIST` is defined as a mechanism to address any corrupt images from the dataset and ensure they do not disrupt the process.

In [None]:
import pandas as pd
import json

import matplotlib.pyplot as plt
%matplotlib inline

# To speed up training and experimenting, you can use a small handful of species.
# To see the full list of the classes available, look at the content of CLASSES_FILE.
SAMPLE_ONLY  = True
CLASSES = [13, 17, 35, 36, 47, 68, 73, 87]
CLASSES = [13, 17, 35, 47]

# Otherwise, you can use the full set of species
if (not SAMPLE_ONLY):
    CLASSES = []
    for c in range(200):
        CLASSES += [c + 1]

BASE_DIR   = 'CUB_MINI/' #'CUB_200_2011/'
IMAGES_DIR = BASE_DIR + 'images/'

CLASSES_FILE = BASE_DIR + 'classes.txt'
IMAGE_FILE   = BASE_DIR + 'images.txt'
LABEL_FILE   = BASE_DIR + 'image_class_labels.txt'

SPLIT_RATIOS = (0.7, 0.2, 0.1)

CLASS_COLS      = ['class_number','class_id']

EXCLUDE_IMAGE_LIST = ['087.Mallard/Mallard_0130_76836.jpg']

## Understand the dataset
Show the list of bird species or dataset classes.

In [None]:
classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None)
criteria = classes_df['class_number'].isin(CLASSES)
classes_df = classes_df[criteria]

class_name_list = sorted(classes_df['class_id'].unique().tolist())
print(class_name_list)

For each species, there are dozens of images of various shapes and sizes. By dividing the entire dataset into individual named (numbered) folders, the images are in effect labelled for supervised learning using image classification and object detection algorithms. 

The following function displays a grid of thumbnail images for all the image files for a given species.

In [None]:
def show_species(species_id):
    _im_list = !ls $IMAGES_DIR/$species_id

    NUM_COLS = 4
    IM_COUNT = len(_im_list)

    print('Species ' + species_id + ' has ' + str(IM_COUNT) + ' images.')
    
    NUM_ROWS = int(IM_COUNT / NUM_COLS)
    if ((IM_COUNT % NUM_COLS) > 0):
        NUM_ROWS += 1

    fig, axarr = plt.subplots(NUM_ROWS, NUM_COLS)
    fig.set_size_inches(12.0, 20.0, forward=True)

    curr_row = 0
    for curr_img in range(IM_COUNT):
        # fetch the url as a file type object, then read the image
        f = IMAGES_DIR + species_id + '/' + _im_list[curr_img]
        a = plt.imread(f)

        # find the column by taking the current index modulo 3
        col = curr_img % NUM_ROWS
        # plot on relevant subplot
        axarr[col, curr_row].imshow(a)
        if col == (NUM_ROWS - 1):
            # we have finished the current row, so increment row counter
            curr_row += 1

    fig.tight_layout()       
    plt.show()
        
    # Clean up
    plt.clf()
    plt.cla()
    plt.close()

In [None]:
show_species('013.Bobolink')

### Create train/val/test dataframes from our dataset
Here we split our dataset into training, testing, and validation datasets, each in their own Pandas dataframe.

In [None]:
def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False):
    train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]

        lbl_train_df        = lbl_df.sample(frac=splits[0])
        lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index)
        lbl_test_df         = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2]))
        lbl_val_df          = lbl_val_and_test_df.drop(lbl_test_df.index)

        if verbose:
            print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl,
                                                                        len(lbl_df), 
                                                                        len(lbl_train_df), 
                                                                        len(lbl_val_df), 
                                                                        len(lbl_test_df)))
        train_df = train_df.append(lbl_train_df)
        val_df   = val_df.append(lbl_val_df)
        test_df  = test_df.append(lbl_test_df)

    # shuffle them on the way out using .sample(frac=1)
    return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1)

def get_train_val_dataframes():
    images_df = pd.read_csv(IMAGE_FILE, sep=' ',
                            names=['image_pretty_name', 'image_file_name'],
                            header=None)
    image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ',
                                names=['image_pretty_name', 'orig_class_id'], header=None)

    # Merge the metadata into a single flat dataframe for easier processing
    full_df = pd.DataFrame(images_df)
    full_df = full_df[~full_df.image_file_name.isin(EXCLUDE_IMAGE_LIST)]

    full_df.reset_index(inplace=True, drop=True)
    full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name')

    if SAMPLE_ONLY:
        # grab a small subset of species for testing
        criteria = full_df['orig_class_id'].isin(CLASSES)
        full_df = full_df[criteria]
        print('Using subset of total images based on sample class list. subtotal: {}'.format(full_df.shape[0]))

    unique_classes = full_df['orig_class_id'].drop_duplicates()
    sorted_unique_classes = sorted(unique_classes)
    id_to_one_based = {}
    i = 1
    for c in sorted_unique_classes:
        id_to_one_based[c] = str(i)
        i += 1

    full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based)
    full_df.reset_index(inplace=True, drop=True)

    def get_class_name(fn):
        return fn.split('/')[0]
    full_df['class_name'] = full_df['image_file_name'].apply(get_class_name)
    full_df = full_df.drop(['image_pretty_name'], axis=1)

    train_df = []
    test_df  = []
    val_df   = []

    # split into training and validation sets
    train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', SPLIT_RATIOS)

    train_df.reset_index(inplace=True, drop=True)
    val_df.reset_index(inplace=True, drop=True)
    test_df.reset_index(inplace=True, drop=True)
    
    print('num images total: ' + str(images_df.shape[0]))
    print('\nnum train: ' + str(train_df.shape[0]))
    print('num val: ' + str(val_df.shape[0]))
    print('num test: ' + str(test_df.shape[0]))
    return train_df, val_df, test_df

In [None]:
train_df, val_df, test_df = get_train_val_dataframes()

## Train the model 
Here we train the model.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

In [None]:
import numpy as np
import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models

In [None]:
HEIGHT = 224
WIDTH  = 224
BATCH_SIZE = 8

In [None]:
print(f'PyTorch version: {torch.__version__}')

### Prepare image data generators from our dataframes
Instead of having to make copies of all the images into separate train, test, and validation folders, we would like to leave the images in place. To let PyTorch train and test against these datasets, we create a custom PyTorch dataset that returns images and labels based on the images idenitified in a Pandas dataframe.

In [None]:
from PIL import Image
import io

class ImageDataset(torch.utils.data.Dataset):
    def __init__(self, transform=None, dataframe=None):
        self.data = dataframe
        self.transform = transform
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        fn = IMAGES_DIR + self.data.loc[index]['image_file_name']
        
        # make the label 0-based, not 1-based
        label = int(self.data.loc[index]['class_id']) - 1
        
        image = Image.open(fn)
        image = image.convert('RGB')
        
        if self.transform is not None:
            image = self.transform(image)
            
        label = torch.tensor(label)
            
        return image, label

In [None]:
train_dataset = ImageDataset(#file_path=None,
                             transform=transforms.Compose([
                                   transforms.RandomResizedCrop(size=256, scale=(0.8, 1.0), ratio=(0.75, 1.33)),
                                   transforms.RandomRotation(degrees=15),
                                   transforms.RandomHorizontalFlip(),
                                   transforms.CenterCrop(size=224),
                                   transforms.ToTensor(),
                                   transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                        std= [0.229, 0.224, 0.225])]), 
                             dataframe=train_df)
val_dataset = ImageDataset(#file_path=None,
                           transform=transforms.Compose([
                               transforms.Resize(size=256),
                               transforms.CenterCrop(size=224),
                               transforms.ToTensor(),
                               transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                    std= [0.229, 0.224, 0.225])]), 
                           dataframe=val_df)
test_dataset = ImageDataset(#file_path=None,
                           transform=transforms.Compose([
                               transforms.Resize(size=256),
                               transforms.CenterCrop(size=224),
                               transforms.ToTensor(),
                               transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                    std= [0.229, 0.224, 0.225])]), 
                           dataframe=test_df)

In [None]:
from torch.utils.data.sampler import SubsetRandomSampler
test_sampler = SubsetRandomSampler(test_df.index.values.tolist())
testloader   = torch.utils.data.DataLoader(test_dataset,
                   sampler=test_sampler, batch_size=BATCH_SIZE)

train_sampler = SubsetRandomSampler(train_df.index.values.tolist())
trainloader   = torch.utils.data.DataLoader(train_dataset,
                   sampler=train_sampler, batch_size=BATCH_SIZE)

val_sampler = SubsetRandomSampler(val_df.index.values.tolist())
valloader   = torch.utils.data.DataLoader(val_dataset,
                   sampler=val_sampler, batch_size=BATCH_SIZE)

In [None]:
print('Will process {}/{} ({:.0f}%) of train data'.format(
        len(trainloader.sampler), len(trainloader.dataset),
        100. * len(trainloader.sampler) / len(trainloader.dataset)))
print('Will process {}/{} ({:.0f}%) of train data'.format(
        len(valloader.sampler), len(valloader.dataset),
        100. * len(valloader.sampler) / len(valloader.dataset)))

In [None]:
print(len(trainloader))
for batch_idx, (data, target) in enumerate(trainloader, 1):
    print(f'Batch #{batch_idx}, target: {target}')

### Define the model

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() 
                                  else 'cpu')
device

In [None]:
model = models.resnet50(pretrained=True, progress=False)

In [None]:
for param in model.parameters():
    param.requires_grad = False
    
fc_inputs = model.fc.in_features
model.fc = nn.Sequential(nn.Linear(fc_inputs, 256),
                                 nn.ReLU(),
                                 nn.Dropout(0.3),
                                 nn.Linear(256, len(class_name_list)),
                                 nn.LogSoftmax(dim=1)) # for using NLLLoss()
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.003)
model.to(device)

In [None]:
def count_parameters(m):
    return sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'Number of trainable parameters: {count_parameters(model):,d}')

### Perform training and save the model

In [None]:
%%time

epochs = 5
steps  = 0
running_loss = 0
print_every  = 3
train_losses, test_losses, test_accuracies = [], [], []

for epoch in range(epochs):
    for inputs, labels in trainloader:
        steps += 1
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        logps = model.forward(inputs)
        loss  = criterion(logps, labels)
        
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for inputs, labels in valloader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    test_loss += batch_loss.item()
                    
                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            test_accuracies.append(accuracy/len(valloader))
            train_losses.append(running_loss/len(trainloader))
            test_losses.append(test_loss/len(valloader))                    
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Test loss: {test_loss/len(valloader):.3f}.. "
                  f"Test accuracy: {accuracy/len(valloader):.3f}")
            running_loss = 0
            model.train()

In [None]:
#torch.save(model, 'model.pth')

In [None]:
torch.save(model.state_dict(), 'model.pth')

### Plot accuracy and loss across epochs

In [None]:
plt.plot(train_losses, label='Training loss')
plt.plot(test_losses, label='Validation loss')
plt.legend(frameon=False)
plt.show()

In [None]:
plt.plot(test_accuracies, label='Test accuracy')
plt.legend(frameon=False)
plt.show()

### Test the model

In [None]:
test_transforms = transforms.Compose([
                               transforms.Resize(size=256),
                               transforms.CenterCrop(size=224),
                               transforms.ToTensor(),
                               transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                    std= [0.229, 0.224, 0.225])])

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
del model

In [None]:
#model=torch.load('model.pth', map_location=torch.device(device))

In [None]:
model     = models.resnet50(pretrained=True)
fc_inputs = model.fc.in_features
model.fc  = nn.Sequential(nn.Linear(fc_inputs, 256),
                                 nn.ReLU(),
                                 nn.Dropout(0.3),
                                 nn.Linear(256, len(class_name_list)),
                                 nn.LogSoftmax(dim=1)) # for using NLLLoss()
model.load_state_dict(torch.load('model.pth', map_location=torch.device(device)))

In [None]:
model.to(device)
model.eval()

In [None]:
from torch.autograd import Variable
def predict_image(image):
    image_tensor = test_transforms(image).float()
    image_tensor = image_tensor.unsqueeze_(0)
    input = Variable(image_tensor)
    input = input.to(device)
    output = model(input)
    ps = torch.exp(output).detach().cpu().numpy()
    index = output.data.cpu().numpy().argmax()
    conf = ps[0][index]
    return index, conf

In [None]:
def get_random_images(df, num):
    sample_df = df.sample(num)
    sample_df.reset_index(inplace=True, drop=True)

    images = []
    labels = []
    for i in range(num):
        fn = IMAGES_DIR + sample_df.loc[i]['image_file_name']
        # make the label 0-based, not 1-based
        lbl = int(sample_df.loc[i]['class_id']) - 1
        img = Image.open(fn)
        images.append(img)
        labels.append(lbl)
    return images, labels

In [None]:
images, labels = get_random_images(train_df, 7)
fig=plt.figure(figsize=(20,20))
classes = class_name_list 
for ii in range(len(images)):
    image = images[ii]
    index, conf = predict_image(image)
    sub = fig.add_subplot(1, len(images), ii+1)
    res = int(labels[ii]) == index
    sub.set_title(str(classes[index]) + ":" + str(res))
    plt.axis('off')
    plt.imshow(image)
    del image
plt.show()

In [None]:
from PIL import Image
def predict_bird_from_file(fn, verbose=True):
    image = Image.open(fn)
    image = image.convert('RGB')
    predicted_class_idx, confidence = predict_image(image)
    predicted_class = class_name_list[predicted_class_idx]
    if verbose:
        display(image)
        print('Class: {}, conf: {:.2f}'.format(predicted_class, confidence))
    del image
    return predicted_class_idx, confidence

In [None]:
fname = IMAGES_DIR + '/' + test_df.iloc[8]['image_file_name']
predict_bird_from_file(fname)

In [None]:
fname = IMAGES_DIR + '/' + test_df.iloc[0]['image_file_name']
predict_bird_from_file(fname)

### Assess prediction performance against validation and test datasets

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.GnBu):
    plt.figure(figsize=(7,7))
    plt.grid(False)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), 
                                  range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment='center',
                 color='white' if cm[i, j] > thresh else 'black')
    plt.tight_layout()
    plt.gca().set_xticklabels(class_name_list)
    plt.gca().set_yticklabels(class_name_list)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
from sklearn.metrics import confusion_matrix
def create_and_plot_confusion_matrix(actual, predicted):
    cnf_matrix = confusion_matrix(actual, np.asarray(predicted),labels=range(len(class_name_list)))
    plot_confusion_matrix(cnf_matrix, classes=range(len(class_name_list)))

In [None]:
#from IPython.display import Image, display

# Iterate through entire dataframe, tracking predictions and accuracy. For mistakes, show the image, and the predicted and actual classes to help understand
# where the model may need additional tuning.

def test_image_df(df):
    print('Testing {} images'.format(df.shape[0]))
    num_errors = 0
    preds = []
    acts  = []
    for i in range(df.shape[0]):
        fname = df.iloc[i]['image_file_name']
        act   = int(df.iloc[i]['class_id']) - 1
        acts.append(act)
        pred, conf = predict_bird_from_file(IMAGES_DIR + '/' + fname, verbose=False)
        preds.append(pred)
        if (pred != act):
            num_errors += 1
            print('ERROR on image index {} -- Pred: {} {:.2f}, Actual: {}'.format(i, 
                                                                    class_name_list[pred], conf, 
                                                                    class_name_list[act]))
            display(Image.open(IMAGES_DIR + '/' + fname))
    return num_errors, preds, acts

In [None]:
num_images = val_df.shape[0]
num_errors, preds, acts = test_image_df(val_df)
print('\nAccuracy: {:.2f}, {}/{}'.format(1 - (num_errors/num_images), num_images - num_errors, num_images))

In [None]:
create_and_plot_confusion_matrix(acts, preds)

In [None]:
num_images = test_df.shape[0]
num_errors, preds, acts = test_image_df(test_df)
print('\nAccuracy: {:.2f}, {}/{}'.format(1 - (num_errors/num_images), num_images - num_errors, num_images))

In [None]:
create_and_plot_confusion_matrix(acts, preds)

### Test model against previously unseen images
Here we download images that the algorithm has not yet seen.

In [None]:
!wget -q -O northern-flicker-1.jpg https://upload.wikimedia.org/wikipedia/commons/5/5c/Northern_Flicker_%28Red-shafted%29.jpg
!wget -q -O northern-cardinal-1.jpg https://cdn.pixabay.com/photo/2013/03/19/04/42/bird-94957_960_720.jpg
!wget -q -O blue-jay-1.jpg https://cdn12.picryl.com/photo/2016/12/31/blue-jay-bird-feather-animals-b8ee04-1024.jpg
!wget -q -O blue-jay-2.jpg https://www.pennington.com/-/media/Images/Pennington-NA/US/blog/Wild-Bird/Blue-Jays/Blue-Jay-Eating-Peanuts.jpg
!wget -q -O hummingbird-1.jpg http://res.freestockphotos.biz/pictures/17/17875-hummingbird-close-up-pv.jpg
!wget -q -O northern-cardinal-2.jpg https://www.allaboutbirds.org/guide/assets/photo/63667291-480px.jpg
!wget -q -O american-goldfinch-1.jpg https://download.ams.birds.cornell.edu/api/v1/asset/59574291/medium
!wget -q -O purple-finch-1.jpg https://indianaaudubon.org/wp-content/uploads/2016/04/PurpleFinchRyanSanderson-e1463792335814.jpg
!wget -q -O purple-finch-2.jpg https://www.singing-wings-aviary.com/wp-content/uploads/2018/06/Purple-Finch.jpg
!wget -q -O mallard-1.jpg https://www.herefordshirewt.org/sites/default/files/styles/node_hero_default/public/2018-01/Mallard%20%C2%A9%20Mark%20Hamblin.jpg

In [None]:
predict_bird_from_file('american-goldfinch-1.jpg')
predict_bird_from_file('northern-cardinal-1.jpg')