# Setup Environment

This notebook works on building machine learning model, optimize training parameter to achieve best metric result by hyperparameter tuning, save model and then deploy model as an server endpoint. Web, applications or third-party service could use this enpdoint for infering data.

The inventory tracking system aims to automate and enhance inventory monitoring, recognizing and sorting tasks. Our output model aims to reduce manual labor, minimize errors, increase overall efficiency; and most important, simulate a full machine learning pipeline in a logistic data-processing job.

## 1. Preconfig

<h4>Import modules</h4>

In [21]:
import os
import json
from PIL import Image

import sagemaker
import boto3
from sagemaker.pytorch import PyTorch


from sklearn.model_selection import train_test_split
from tqdm import tqdm

<h4>Configure environment</h4>

In [6]:
project_bucket = ""
dataset_bucket = 'aft-vbi-pds'
role = ""

## 2. Data Preparation
**TODO:** Run the cell below to download the data.

The cell below creates a folder called `train_data`, downloads training data and arranges it in subfolders. Each of these subfolders contain images where the number of objects is equal to the name of the folder. For instance, all images in folder `1` has images with 1 object in them. Images are not divided into training, testing or validation sets. If you feel like the number of samples are not enough, you can always download more data (instructions for that can be found [here](https://registry.opendata.aws/amazon-bin-imagery/)). However, we are not acessing you on the accuracy of your final trained model, but how you create your machine learning engineering pipeline.

In [None]:

def download_and_arrange_data():
    with open('file_list.json', 'r') as f:
        d=json.load(f)
        # split data train =0.6, test=0.2, validation=0.2
        train = {}
        test = {}
        validation = {}
        for k, v in d.items():
            train[k], test[k] = train_test_split(d[k], test_size =0.4, random_state=0)
            test[k], validation[k] = train_test_split(test[k], test_size=0.5, random_state=0)
        download_images(train, 'train')
        download_images(test, 'test')
        download_images(validation, 'validation')

def download_images(files_list, data_path):
    s3_client = boto3.client('s3')
    data_path = os.path.join('dataset', 'bin-images', data_path)
    for k, v in files_list.items():
        print(f"Downloading Images with {k} objects to the path {data_path}")
        directory=os.path.join(data_path, k)
        if not os.path.exists(directory):
            os.makedirs(directory)
        for file_path in tqdm(v):
            file_name=os.path.basename(file_path).split('.')[0]+'.jpg'
            s3_key_file = 'bin-images/' + file_name
            s3_client.download_file(dataset_bucket, s3_key_file,
                             os.path.join(directory, file_name))

download_and_arrange_data()

In [None]:
!aws s3 sync 

## 3. Dataset

*The Amazon Bin Image Dataset* contains over **500,000 images and metadata** from bins of a pod in an operating *Amazon Fulfillment Center*. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.

As for a large dataset, we plan to use only a subset of **10441** samples from the original dataset which is served as experiment tasks to examine the efficiency of the model. The subset only contains bin images that store the number of items between '1' and '5'. 

Let's examine dataset

<h4>Show the number of images of each label</h4>

In [3]:
import json

# Load the JSON data from the file
with open("file_list.json", "r") as file:
    data = json.load(file)

# Count the length of each list for each label
labels_length = {key: len(value) for key, value in data.items()}

print("Number of images for each labels: ", labels_length)
print("Total number of images: ", sum(labels_length.values()))

Number of images for each labels:  {'1': 1228, '2': 2299, '3': 2666, '4': 2373, '5': 1875}
Total number of images:  10441


<h4>Show size of images to decide actions for further process

In [28]:
LIMIT = 5 # Limit the number of images to display
count_display = 0

for dirpath, dirs, files in os.walk("./dataset"):
   if count_display == LIMIT:
         break
   for file in files:
      if count_display == LIMIT:
         break
      image_path = os.path.join(dirpath, file)
      print("Image file path: ", image_path)
      with Image.open(image_path) as img:
         print("Image size: ", img.size)
      count_display += 1

Image file path:  ./dataset\bin-images\test\1\00014.jpg
Image size:  (526, 313)
Image file path:  ./dataset\bin-images\test\1\00024.jpg
Image size:  (482, 550)
Image file path:  ./dataset\bin-images\test\1\00100.jpg
Image size:  (530, 460)
Image file path:  ./dataset\bin-images\test\1\00214.jpg
Image size:  (374, 255)
Image file path:  ./dataset\bin-images\test\1\00229.jpg
Image size:  (438, 561)


<h4>Upload to S3 bucket</h4>

In [None]:
!aws s3 sync dataset s3://aft-vbi-pds/bin-images

## Model Training
**TODO:** This is the part where you can train a model. The type or architecture of the model you use is not important. 

**Note:** You will need to use the `train.py` script to train your model.

In [None]:
#TODO: Declare your model training hyperparameter.
#NOTE: You do not need to do hyperparameter tuning. You can use fixed hyperparameter values

In [None]:
#TODO: Create your training estimator

In [None]:
# TODO: Fit your estimator

## Standout Suggestions
You do not need to perform the tasks below to finish your project. However, you can attempt these tasks to turn your project into a more advanced portfolio piece.

### Hyperparameter Tuning
**TODO:** Here you can perform hyperparameter tuning to increase the performance of your model. You are encouraged to 
- tune as many hyperparameters as you can to get the best performance from your model
- explain why you chose to tune those particular hyperparameters and the ranges.


In [None]:
#TODO: Create your hyperparameter search space

In [None]:
#TODO: Create your training estimator

In [None]:
# TODO: Fit your estimator

In [None]:
# TODO: Find the best hyperparameters

### Model Profiling and Debugging
**TODO:** Use model debugging and profiling to better monitor and debug your model training job.

In [None]:
# TODO: Set up debugging and profiling rules and hooks

In [None]:
# TODO: Create and fit an estimator

In [None]:
# TODO: Plot a debugging output.

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output

### Model Deploying and Querying
**TODO:** Can you deploy your model to an endpoint and then query that endpoint to get a result?

In [None]:
# TODO: Deploy your model to an endpoint

In [None]:
# TODO: Run an prediction on the endpoint

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done

### Cheaper Training and Cost Analysis
**TODO:** Can you perform a cost analysis of your system and then use spot instances to lessen your model training cost?

In [None]:
# TODO: Cost Analysis

In [None]:
# TODO: Train your model using a spot instance

### Multi-Instance Training
**TODO:** Can you train your model on multiple instances?

In [None]:
# TODO: Train your model on Multiple Instances