# Lab 1 Notebook - Preparing images for training

This Jupyter notebook steps you through preparing the bird images dataset for training.  The images have been downloaded for you ahead of time.  In this lab, you will examine them to get an understanding of the raw input required for supervised learning of SageMaker's [Image Classification algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html).  You will then package the images into a specific format required by that algorithm.  Lastly, you will upload those packaged files to your S3 bucket for use in training the model.

## Jupyter notebook basics

If you are already familiar with how to execute Jupyter notebooks, please proceed to the next cell.  

For those of you that are fairly new to using Jupyter notebooks, here are a couple of items to get you started:

* This notebook has two types of cells: Code and Markdown.  This cell is a Markdown (documentation) cell.  As you proceed through the notebook, you'll read the documentation in these cells to understand the steps being taken.
* Code cells are formatted differently and have tracking information to the left of the cell: `In [ ]`.  If there is a blank between the brackets, the cell has not yet been executed.  If there is a number between the brackets, that number indicates the sequence in which the cell was executed (1 for the first, 2 for the next, and so on).  If there is an asterisk in the brackets (i.e., `In [*]`, the cell is currently executing, or is waiting for some cell before it to complete before it will begin executing.
* To execute a code cell, simply click on the `Cell` menu and click `Run Cells and Select Below`.  Or, much more conveniently, simply press Shift-Enter (i.e., the Shift key and the Enter key at the same time).
* You can use the arrow keys on your keyboard (not the up and down arrow icons below the menu bar) to move back and forth between cells.

Online help is provided from the `Help` menu.  You can also find tutorials for Jupyter with a simple web search.

## Step 1 - Define parameters for the notebook

The only parameter you need to provide for Lab 1 is the S3 bucket name.  Set it to the name of your S3 bucket.  Note that the S3 bucket name is not the URI to the bucket.  It is not `s3://deeplens-sagemaker-20181126-smithjohn-2`, but instead it is just `deepelens-sagemaker-20181126-smithjohn-2`.

TRAIN_RATIO is used to identify the percentage of the images to use as training images.  The remaining images are used for validation.

In [None]:
BUCKET_NAME = '<bucket-name>'  # name of bucket you created, something like deeplens-sagemaker-20181126-smithjohn

TRAIN_RATIO = 0.75

import matplotlib.pyplot as plt
%matplotlib inline

## Step 2 - Add required Python packages

This workshop requires MXNet, which also requires OpenCV.  Install the corresponding Python packages in this next cell.

In [None]:
! pip install mxnet 
! pip install opencv-python 

# for some reason, boto3 can get errors with certain versions of the AWS CLI.
# To avoid hitting errors like 'serviceID not defined in the medata ...', we install
# a specific AWS CLI verion.
! pip install awscli==1.16.9 awsebcli==3.14.4 --user

## Step 3 - Explore the images dataset

For a 2 hour workshop, we do not have time to work with the complete NABirds dataset of 48,000+ images.  Instead, we operate on a very small subset but go through the entire process that would be used for the complete set.  For the workshop, we use 4 species, and you'll see 4 corresponding subdirectories in the `images` directory.  For the full dataset, you would see 555 species / subdirectories.

In [None]:
% cd ~/SageMaker/bird-classification-workshop
! ls images

### Examine images for each sample species

For each species, there are dozens of images of various shapes and sizes.  By dividing the entire dataset into individual named (numbered) folders, the images are in effect labelled for supervised learning using an image classification algorithm.  The following function shows a grid of thumbnail images for all the image files for a given species.

In [None]:
def show_species(species_id):
    _im_list = !ls ~/SageMaker/bird-classification-workshop/images/$species_id

    print('Initializing a large matplotlib.pylot figure...')

    NUM_COLS = 4
    IM_COUNT = len(_im_list)

    print('Species ' + species_id + ' has ' + str(IM_COUNT) + ' images.')
    
    NUM_ROWS = int(IM_COUNT / NUM_COLS)
    if ((IM_COUNT % NUM_COLS) > 0):
        NUM_ROWS += 1

    fig, axarr = plt.subplots(NUM_ROWS, NUM_COLS)
    fig.set_size_inches(8.0, 32.0, forward=True)

    print('Reading each image file and showing it in a grid (this could take a few seconds)...')
    curr_row = 0
    for curr_img in range(IM_COUNT):
        # fetch the url as a file type object, then read the image
        f = 'images/' + species_id + '/' + _im_list[curr_img]
        a = plt.imread(f)

        # find the column by taking the current index modulo 3
        col = curr_img % NUM_ROWS
        # plot on relevant subplot
        axarr[col, curr_row].imshow(a)
        if col == (NUM_ROWS - 1):
            # we have finished the current row, so increment row counter
            curr_row += 1

    fig.tight_layout()       

    print('Displaying thumbnails of all images in this folder (this could take a few seconds)...')
    plt.show()
        
    # Clean up
    plt.clf()
    plt.cla()
    plt.close()

#### Purple Martin

In [None]:
show_species('0752')

#### Northern Cardinal

In [None]:
show_species('0772')

#### American Goldfinch

In [None]:
show_species('0794')

#### Eastern Bluebird

In [None]:
show_species('0842')

### Look at thumbnails of the images

Now that you have captured thumbnail images for each species for this workshop, examine them and notice the variety:

* image aspect ratios
* image sizes
* size of the bird relative to the scene
* background of the scene

It is important to have a wide variety to ensure that the trained model is capable of making robust predictions when fed arbitrary new images.  Note that when we train the image classification model, SageMaker gives us an `augmentation` hyperparameter.  When set to `crop`, it will double the number of training images on your behalf, make a copy of each and flipping the image on the horizontal axis.  So, if the original image has a bird facing to the left, a copy will be added with the bird facing to the right.  This boosts the ability of the model to handle new images, as they could be facing in any direction.

Let's proceed to package the images in the format required by SageMaker's Image Classification algorithm.

## Step 4 - Package the images in RecordIO format

The following figure illustrates the process of packaging the images.

![](./docs/screenshots/prepare_images.png)

To use these images with SageMaker's Image Classification algorithm, they need to be packaged in RecordIO format.  RecordIO format enables you to put the entire set of training images into a single file, and all of the validation images into a second file.  For large datasets with tens of thousands or hundreds of thousands of images, [Apache MXNet RecordIO](https://mxnet.incubator.apache.org/architecture/note_data_loading.html) format makes transferring images and iterating on sets of images significantly more efficient.  

Packaging the images involves 2 steps:

1. Generate the list file containing the filenames of the images and the species identifier.  There is one `.lst` file created for training images and another for validation images.
2. Use the list files to create binary `.rec` files, one for training and another for validation.

MXNet provides a Python script called `im2rec` (**image** file to **RecordIO** file) to make it easy to perform these steps.

### Generate .lst files

The next cell does the first step of creating the `.lst` files given a ratio for the split between training data and validation data.  Notice the significant size difference between the two files.  We also display the tail of the validation file, so you can see the format which has 3 columns:

1. A unique number for the file within the total set of images.
2. The label for this image.  The label is a number between 0 and one less than the number of total classes / categories / species.  In our workshop, we are using only 4 species, so the label is one of 0, 1, 2, or 3.
3. The relative path to the image file from the `images` folder.

In [None]:
% cd ~/SageMaker/bird-classification-workshop/labs/lab1

print('\nGenerating .lst files.  This could take a minute...\n')
! python im2rec.py --list --recursive --train-ratio $TRAIN_RATIO nabirds_sample ../../images/

print('\nHere are the resulting files:')
! ls -l *.lst

print('\nHere are the last few lines of the validation list file:')
! tail *val.lst

print('\nShow the number of training images by counting lines in the list file.')
! more nabirds_sample_train.lst | wc -l

To train the Image Classification model, you will need to know the number of training images. You can use the linux wc command to count the number of lines in the training list file, as shown in the previous code cell.  Note that the  TRAIN_RATIO defined at the start of this notebook controls the percentage of images used for training.  A higher ratio will yield more training images, but can lead to overfitting the model.

### Generate .rec files

Now that the `.lst` files are created, the next step is to create the RecordIO files.  For the workshop, since we are only working with 4 species, the `.rec` files are not that large.  The full dataset of NABirds images contains 555 species, making the resulting `.rec` files more than 100 times larger than these.

In [None]:
print('Generating packaged RecordIO files.  This could take a minute...\n')
! python im2rec.py --resize 256 nabirds_sample ../../images/

print('\nHere are the resulting files:')
! ls -l *.rec

## Step 5 - Make the .rec files available for training

Now that the `.rec` files have been created, they must be made available in S3.  The SageMaker training job pulls the `.rec` files from S3 at the start of the job, and it creates the model artifacts in S3 once training is complete.

In [None]:
print('Copying the packaged RecordIO files to S3 (this could take a minute)...')
! aws s3 cp nabirds_sample_train.rec s3://$BUCKET_NAME/train/nabirds_sample_train.rec
! aws s3 cp nabirds_sample_val.rec s3://$BUCKET_NAME/validation/nabirds_sample_val.rec

print('\nHere are the resulting files in S3:')    
! aws s3 ls s3://$BUCKET_NAME/train/
! aws s3 ls s3://$BUCKET_NAME/validation/

Navigate to the S3 console and locate your newly uploaded `.rec` files.

You have completed preparation of the bird images dataset as input to training a model based on SageMaker's Image Classification algorithm.  You can save the notebook and leave this browser tab.

Proceed to Lab 2 of the workshop to perform the actual model training.