# Data Preparation and Generate RecordIO files

The [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species (the original technical report can be found [here](http://www.vision.caltech.edu/visipedia/papers/CUB_200_2011.pdf)).  Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels.  Bounding boxes are provided, as are annotations of bird parts.  A recommended train/test split is given, but image size data is not.

![](./cub_200_2011_snapshot.png)

The dataset can be downloaded [here](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html).

## Step 0: Download and unpack the dataset

Here we download the birds dataset from CalTech.

In [1]:
import os 
import urllib.request
import conf
#print(conf.num_epochs)

def download(url):
    filename = url.split('/')[-1]
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)

In [2]:
download ('https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011.tgz')

In [3]:
%%time
# Clean up prior version of the downloaded dataset if you are running this again
!rm -rf CUB_200_2011  

# Unpack and then remove the downloaded compressed tar file
!gunzip -c ./CUB_200_2011.tgz | tar xopf - 

def combine(manyCalss, oneClass):
    file = open(manyCalss, "r")
    line_count = 0
    for line in file:
        if line != "\n":
            line_count += 1
    file.close()

    print(line_count)

    file_object = open(oneClass, 'w')
    for ii in range(line_count):
        file_object.write('{} 1\n'.format(ii+1))
    file_object.close()


combine("CUB_200_2011/image_class_labels.txt", "CUB_200_2011/one_image_class_labels.txt")
#combine("CUB_200_2011/image_class_labels.txt", "CUB_200_2011/one_image_class_labels.txt")
!echo "1 001.bird" > CUB_200_2011/one_classes.txt
#!rm CUB_200_2011.tgz

11788
CPU times: user 180 ms, sys: 23.8 ms, total: 204 ms
Wall time: 13.9 s


In [4]:
import pandas as pd
import cv2
import boto3
import json

runtime = boto3.client(service_name='runtime.sagemaker')

import matplotlib.pyplot as plt
%matplotlib inline

RANDOM_SPLIT = True



# To speed up training and experimenting, you can use a small handful of species.
# To see the full list of the classes available, look at the content of CLASSES_FILE.

CLASSES = [1]
print(CLASSES)
RESIZE_SIZE = 256

BASE_DIR   = 'CUB_200_2011/'
IMAGES_DIR = BASE_DIR + 'images/'

CLASSES_FILE = BASE_DIR + 'one_classes.txt'
BBOX_FILE    = BASE_DIR + 'bounding_boxes.txt'
IMAGE_FILE   = BASE_DIR + 'images.txt'
LABEL_FILE   = BASE_DIR + 'one_image_class_labels.txt'
SIZE_FILE    = BASE_DIR + 'sizes.txt'
SPLIT_FILE   = BASE_DIR + 'train_test_split.txt'

TRAIN_LST_FILE = 'birds_ssd_train.lst'
VAL_LST_FILE   = 'birds_ssd_val.lst'

TRAIN_RATIO     = 0.8
CLASS_COLS      = ['class_number','class_id']
IM2REC_SSD_COLS = ['header_cols', 'label_width', 'zero_based_id', 'xmin', 'ymin', 'xmax', 'ymax', 'image_file_name']

[1]


In [5]:
#classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None)
#criteria = classes_df['class_number'].isin(CLASSES)
#classes_df = classes_df[criteria]
#print(classes_df.to_csv(columns=['class_id'], sep='\t', index=False, header=False))

## Step 1. Gather image sizes

For this particular dataset, bounding box annotations are specified in absolute terms.  RecordIO format requires them to be defined in terms relative to the image size.  The following code visits each image, extracts the height and width, and saves this information into a file for subsequent use.  Some other publicly available datasets provide such a file for exactly this purpose. 

In [6]:
%%time
SIZE_COLS = ['idx','width','height']

def gen_image_size_file():
    print('Generating a file containing image sizes...')
    images_df = pd.read_csv(IMAGE_FILE, sep=' ',
                            names=['image_pretty_name', 'image_file_name'],
                            header=None)
    rows_list = []
    idx = 0
    for i in images_df['image_file_name']:
        # TODO: add progress bar
        idx += 1
        img = cv2.imread(IMAGES_DIR + i)
        dimensions = img.shape
        height = img.shape[0]
        width = img.shape[1]
        image_dict = {'idx': idx, 'width': width, 'height': height}
        rows_list.append(image_dict)

    sizes_df = pd.DataFrame(rows_list)
    print('Image sizes:\n' + str(sizes_df.head()))

    sizes_df[SIZE_COLS].to_csv(SIZE_FILE, sep=' ', index=False, header=None)

gen_image_size_file()

Generating a file containing image sizes...
Image sizes:
   idx  width  height
0    1    500     335
1    2    500     336
2    3    500     347
3    4    415     500
4    5    331     380
CPU times: user 45.3 s, sys: 855 ms, total: 46.1 s
Wall time: 46.5 s


## Step 2. Generate list files for producing RecordIO files 

In [7]:
def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\ntest_df:{}'.format(lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df)))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)
    return train_df, test_df

def gen_list_files():
    # use generated sizes file
    sizes_df = pd.read_csv(SIZE_FILE, sep=' ',
                names=['image_pretty_name', 'width', 'height'],
                header=None)
    bboxes_df = pd.read_csv(BBOX_FILE, sep=' ',
                names=['image_pretty_name', 'x_abs', 'y_abs', 'bbox_width', 'bbox_height'],
                header=None)
    split_df = pd.read_csv(SPLIT_FILE, sep=' ',
                            names=['image_pretty_name', 'is_training_image'],
                            header=None)
    print(IMAGE_FILE)
    images_df = pd.read_csv(IMAGE_FILE, sep=' ',
                            names=['image_pretty_name', 'image_file_name'],
                            header=None)
    print('num images total: ' + str(images_df.shape[0]))
    image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ',
                                names=['image_pretty_name', 'class_id'], header=None)

    # Merge the metadata into a single flat dataframe for easier processing
    full_df = pd.DataFrame(images_df)
    full_df.reset_index(inplace=True)
    full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name')
    full_df = pd.merge(full_df, sizes_df, on='image_pretty_name')
    full_df = pd.merge(full_df, bboxes_df, on='image_pretty_name')
    full_df = pd.merge(full_df, split_df, on='image_pretty_name')
    full_df.sort_values(by=['index'], inplace=True)

    # Define the bounding boxes in the format required by SageMaker's built in Object Detection algorithm.
    # the xmin/ymin/xmax/ymax parameters are specified as ratios to the total image pixel size
    full_df['header_cols'] = 2  # one col for the number of header cols, one for the label width
    full_df['label_width'] = 5  # number of cols for each label: class, xmin, ymin, xmax, ymax
    full_df['xmin'] = full_df['x_abs'] / full_df['width']
    full_df['xmax'] = (full_df['x_abs'] + full_df['bbox_width']) / full_df['width']
    full_df['ymin'] = full_df['y_abs'] / full_df['height']
    full_df['ymax'] = (full_df['y_abs'] + full_df['bbox_height']) / full_df['height']

    # object detection class id's must be zero based. map from
    # class_id's given by CUB to zero-based (1 is 0, and 200 is 199).

    
    unique_classes = full_df['class_id'].drop_duplicates()
    sorted_unique_classes = sorted(unique_classes)

    id_to_zero = {}
    i = 0.0
    for c in sorted_unique_classes:
        id_to_zero[c] = i
        i += 1.0

    full_df['zero_based_id'] = full_df['class_id'].map(id_to_zero)

    full_df.reset_index(inplace=True)

    # use 4 decimal places, as it seems to be required by the Object Detection algorithm
    pd.set_option("display.precision", 4)

    train_df = []
    val_df = []

    # split into training and validation sets
    train_df, val_df = split_to_train_test(full_df, 'class_id', TRAIN_RATIO)

    train_df[IM2REC_SSD_COLS].to_csv(TRAIN_LST_FILE, sep='\t',float_format='%.4f', header=None)
    val_df[IM2REC_SSD_COLS].to_csv(VAL_LST_FILE, sep='\t',float_format='%.4f', header=None)
        
    print('num train: ' + str(train_df.shape[0]))
    print('num val: ' + str(val_df.shape[0]))
    return train_df, val_df

In [8]:
train_df, val_df = gen_list_files()

CUB_200_2011/images.txt
num images total: 11788

1:
---------
total:11788
train_df:9430
test_df:2358
num train: 9430
num val: 2358


## Step 2. Convert data into RecordIO format

Now we create im2rec databases (.rec files) for training and validation based on the list files created earlier.

In [9]:
!python tools/im2rec.py --resize $RESIZE_SIZE --pack-label birds_ssd $BASE_DIR/images/

Creating .rec file from /home/ec2-user/SageMaker/object_detection_birds_2020-11-20/birds_ssd_train.lst in /home/ec2-user/SageMaker/object_detection_birds_2020-11-20
multiprocessing not available, fall back to single threaded encoding
time: 0.007500648498535156  count: 0
time: 6.310894727706909  count: 1000
time: 6.302682161331177  count: 2000
time: 6.378289699554443  count: 3000
time: 6.28742241859436  count: 4000
time: 6.34315824508667  count: 5000
time: 6.32803201675415  count: 6000
time: 6.27110743522644  count: 7000
time: 6.419079780578613  count: 8000
time: 6.3166420459747314  count: 9000
Creating .rec file from /home/ec2-user/SageMaker/object_detection_birds_2020-11-20/birds_ssd_val.lst in /home/ec2-user/SageMaker/object_detection_birds_2020-11-20
multiprocessing not available, fall back to single threaded encoding
time: 0.005661725997924805  count: 0
time: 6.2773332595825195  count: 1000
time: 6.387125253677368  count: 2000


## Step 3. Upload RecordIO files to S3
Upload the training and validation data to the S3 bucket. We do this in multiple channels. Channels are simply directories in the bucket that differentiate the types of data provided to the algorithm. For the object detection algorithm, we call these directories `train` and `validation`.

In [10]:
import sagemaker
train_channel = conf.prefix + '/train'
validation_channel = conf.prefix + '/validation'
sagemaker.s3.S3Uploader.upload("birds_ssd_train.rec", f"s3://{conf.bucket}/{train_channel}")
sagemaker.s3.S3Uploader.upload("birds_ssd_val.rec", f"s3://{conf.bucket}/{validation_channel}")




's3://deeplens-sziit/DEMO-ObjectDetection-birds/validation/birds_ssd_val.rec'