# Prepare data set for training

This is a template to train your dataset, please make a copy and modify from there.

## Environment variables

In this section, we are going to setup some environment variables for all scripts below. Please modify these variable accordingly.

In [None]:
%env BUCKET=deeplens-sagemaker-2bbe16b4-c056-4ae2-9332-d31dd7aeb470
%env DATA_SET_PATH=datasets
%env DATA_PATH=gender
%env INCUBATOR_GIT_PATH=https://github.com/apache/incubator-mxnet.git
%env INCUBATOR_PATH=incubator-mxnet
%env IM2REC_PATH=tools/im2rec.py
%env TRAIN_CHANNEL=train
%env VALIDATION_CHANNEL=validation
%env RECORD_PATH=gender
%env TRAINING_RATIO=0.90
%env OUTPUT_PREFIX=om_gender
%env RESIZE=300
%env QUALITY=95
%env NUM_THREAD=16
%env CLEAN_UP=true

In [None]:
%%bash

echo "Remove exsiting dataset."
rm -r "./$DATA_PATH"

echo "Download image dataset from S3 bucket"
aws s3 cp --recursive "s3://$BUCKET/$DATA_SET_PATH/$DATA_PATH" "./$DATA_PATH"

## Download mxnet

We are going to use im2rec.py python tool script provided by incubator-mxnet to prepare our images into RecordIO format.

In [None]:
%%bash

rm -rf "./$INCUBATOR_PATH"
echo "Clone incubator-mxnet, we are going to use im2rec.py to prepare our RecordIO dataset"
git clone $INCUBATOR_GIT_PATH $INCUBATOR_PATH

## Prepare image datasets

Type `python3 ./incubator-mxnet/tools/im2rec.py --help` in terminal for detailed explanation of all command line arguments. 

Apache mxnet team may change the tool script in the future. In this case, please refer to the latest help of the script `im2rec.py`.

In [None]:
%%bash

echo "Create image list with im2rec.py script"
python3 ./$INCUBATOR_PATH/$IM2REC_PATH $OUTPUT_PREFIX "./$DATA_PATH/" --list --recursive --train-ratio $TRAINING_RATIO > "$OUTPUT_PREFIX"_label

echo "$OUTPUT_PREFIX label indices"
cat "$OUTPUT_PREFIX"_label

echo "Create image recordio format binary file from the image list"
python3 ./$INCUBATOR_PATH/$IM2REC_PATH $OUTPUT_PREFIX"_" "./$DATA_PATH" --resize $RESIZE --center-crop --quality $QUALITY --num-thread $NUM_THREAD

## Upload artefacts

Upload all created artefacts to the S3 bucket, we will use them in the next notebook to train our model.

In [None]:
%%bash

echo "Remove existing artefacts"
aws s3 rm "s3://$BUCKET/$OUTPUT_PREFIX"_label
aws s3 rm --recursive "s3://$BUCKET/$TRAIN_CHANNEL/$RECORD_PATH/" --exclude "s3://$BUCKET/$TRAIN_CHANNEL/"
aws s3 rm --recursive "s3://$BUCKET/$VALIDATION_CHANNEL/$RECORD_PATH/" --exclude "s3://$BUCKET/$VALIDATION_CHANNEL/"

echo "Upload labels text to S3 bucket"
aws s3 cp "$OUTPUT_PREFIX"_label "s3://$BUCKET"

echo "Upload training record to the S3 bucket"
aws s3 cp "$OUTPUT_PREFIX"_train.rec "s3://$BUCKET/$TRAIN_CHANNEL/$RECORD_PATH/"
aws s3 cp "$OUTPUT_PREFIX"_train.idx "s3://$BUCKET/$TRAIN_CHANNEL/$RECORD_PATH/"
aws s3 cp "$OUTPUT_PREFIX"_train.lst "s3://$BUCKET/$TRAIN_CHANNEL/$RECORD_PATH/"

echo "Upload validation record to the S3 bucket"
aws s3 cp "$OUTPUT_PREFIX"_val.rec "s3://$BUCKET/$VALIDATION_CHANNEL/$RECORD_PATH/"
aws s3 cp "$OUTPUT_PREFIX"_val.idx "s3://$BUCKET/$VALIDATION_CHANNEL/$RECORD_PATH/"
aws s3 cp "$OUTPUT_PREFIX"_val.lst "s3://$BUCKET/$VALIDATION_CHANNEL/$RECORD_PATH/"

In [None]:
%%bash

if [ $CLEAN_UP = true ]; then
    echo "Clean up folders"
    echo "Removing $DATA_PATH"
    rm -rf "./$DATA_PATH"
    echo "Removing $INCUBATOR_PATH"
    rm -rf "./$INCUBATOR_PATH"
    echo "Removing recordio files"
    rm "$OUTPUT_PREFIX"_*
fi