# Define Parameters

The only parameter required for Lab 0 and 1 is the S3 bucket name.  Set it to the name of your S3 bucket.

In [None]:
BUCKET_NAME = '<s3-bucket-name>'

# Add required Python package

This workshop requires MXNet.  Install the corresponding Python package.

In [None]:
! pip install mxnet

# Explore the images dataset

For a 2 hour workshop, we do not have time to work with the complete NABirds dataset of 48,000+ images.  Instead, we operate on a very small subset but go through the entire process that would be used for the complete set.  For the workshop, we use 4 species, and you'll see 4 corresponding subdirectories in the `images` directory.  For the full dataset, you would see 555 species / subdirectories.

In [None]:
% cd ~/SageMaker/bird-classification-workshop
! ls images

### Examine folder for a single species

For each species, there are dozens of images of various shapes and sizes.  By dividing the entire dataset into individual named (numbered) folders, the images are in effect labelled for supervised learning using an image classification algorithm.  The following code creates a Python list of all the image files for one species.

In [None]:
SPECIES = '0772'
im_list = !ls images/$SPECIES

### Look at thumbnails of the images

To get a quick look at the full set of images for a given species, we can write a few lines of Python.  Notice the variety:

* image aspect ratios
* image sizes
* proximity to the bird
* background of the scene

It is important to have a wide variety to ensure that the trained model is capable of making robust predictions when fed arbitrary new images.

In [None]:
print('Initializing a large matplotlib.pylot figure...')

import matplotlib.pyplot as plt
%matplotlib inline

NUM_COLS = 4
IM_COUNT = len(im_list)
NUM_ROWS = int(IM_COUNT / NUM_COLS)
if ((IM_COUNT % NUM_COLS) > 0):
    NUM_ROWS += 1

fig, axarr = plt.subplots(NUM_ROWS, NUM_COLS)
fig.set_size_inches(16.0, 64.0, forward=True)

print('Reading each image file and showing it in a grid (this could take a few seconds)...')
curr_row = 0
for curr_img in range(IM_COUNT):
    # fetch the url as a file type object, then read the image
    f = 'images/' + SPECIES + '/' + im_list[curr_img]
    a = plt.imread(f)

    # find the column by taking the current index modulo 3
    col = curr_img % NUM_ROWS
    # plot on relevant subplot
    axarr[col, curr_row].imshow(a)
    if col == (NUM_ROWS - 1):
        # we have finished the current row, so increment row counter
        curr_row += 1
        
print('Species ' + SPECIES + ' has ' + str(IM_COUNT) + ' images.')

fig.tight_layout()       

print('Displaying thumbnails of all images in this folder (this could take a few seconds)...')
plt.show()

# Package the images in RecordIO format

To use these images with SageMaker's Image Classification algorithm, they need to be packaged in RecordIO format.  This involves 2 steps:

1. Generate the list file containing the filenames of the images and the species identifier.  There is one `.lst` file created for training images and another for validation images.
2. Use the list files to create binary `.rec` files, one for training and another for validation.

MXNet provides a Python script called `im2rec` (**image** file to **RecordIO** file) to make it easy to perform these steps.

### Generate .lst files

The next cell does the first step of creating the `.lst` files given a ratio for the split between training data and validation data.  Notice the significant size difference between the two files.  We also display the tail of the validation file, so you can see the format which has 3 columns:

1. A unique number for the file within the total set of images.
2. The label for this image.  The label is a number between 0 and one less than the number of total classes / categories / species.  In our workshop, we are using only 4 species, so the label is one of 0, 1, 2, or 3.
3. The relative path to the image file from the `images` folder.

In [None]:
TRAIN_RATIO = 0.75

% cd ~/SageMaker/bird-classification-workshop/labs/lab1

print('\nGenerating .lst files.  This could take a minute...\n')
! python im2rec.py --list --recursive --train-ratio $TRAIN_RATIO nabirds_sample2 ../../images/

print('\nHere are the resulting files:')
! ls -l *.lst

print('\nHere are the last few lines of the validation list file:')
! tail *val.lst

### Generate .rec files

Now that the `.lst` files are created, the next step is to create the RecordIO files.  For the workshop, since we are only working with 4 species, the `.rec` files are not that large.  The full dataset of NABirds images contains 555 species, making the resulting `.rec` files more than 100 times larger than these.

In [None]:
print('Generating packaged RecordIO files.  This could take a few minutes...\n')
! python im2rec.py --resize 256 nabirds_sample2 ../../images/

print('\nHere are the resulting files:')
! ls -l *.rec

# Make the .rec files available for training

Now that the `.rec` files have been created, they must be made available in S3.  The SageMaker training job pulls the `.rec` files from S3 at the start of the job, and it creates the model artifacts in S3 once training is complete.

In [None]:
print('Copying the packaged RecordIO files to S3 (this could take a minute)...')
! aws s3 cp nabirds_sample2_train.rec s3://$BUCKET_NAME/train/nabirds_sample2_train.rec
! aws s3 cp nabirds_sample2_val.rec s3://$BUCKET_NAME/validation/nabirds_sample2_val.rec

print('\nHere are the resulting files in S3:')    
! aws s3 ls s3://$BUCKET_NAME/train/
! aws s3 ls s3://$BUCKET_NAME/validation/