# Image Classification with AWS SageMaker
In this notebook we will be leveraging AWS SageMaker to fine-tune a pre-trained model for the task of image classification.
We are implementing SageMaker's profiling and debugging tools to monitor model training and performance and we conduct hyperparameter tuning to optimize our model's performance.
Finally the model is deployed to a SageMaker endpoint and tested.

In [None]:
# TODO: Install any packages that you might need
!pip install smdebug

In [None]:
# TODO: Import any packages that you might need
import sagemaker
import boto3

from sagemaker.pytorch import PyTorch
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

from sagemaker.debugger import (
    Rule,
    DebuggerHookConfig,
    ProfilerRule,
    rule_configs,
    ProfilerConfig,
    FrameworkProfile
)

from sagemaker.debugger import CollectionConfig

In [None]:
session = sagemaker.Session()

bucket_sagemaker = session.default_bucket()
print("Default Bucket: {}".format(bucket_sagemaker))

region = session.boto_region_name
print("AWS Region: {}".format(region))

role = sagemaker.get_execution_role()
print("RoleArn: {}".format(role))

## Dataset
In this project we use the dog breed classication dataset to classify between different breeds of dogs in images.

In [None]:
#TODO: Fetch and upload the data to AWS S3
%%capture
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip

# !wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip # Slow!
# !aws s3 cp s3://udacity-aind/dog-project/dogImages.zip ./ # Faster?

In [None]:
!unzip dogImages.zip

In [None]:
! ls  | grep dogImages

### How many images are in the dataset?

In [None]:
! find dogImages/train -type f | wc -l

In [None]:
! find dogImages/test -type f | wc -l

In [None]:
! find dogImages/valid -type f | wc -l

## Global Variables

### Upload dataset to S3

In [None]:
# Upload metadata
BUCKET_DATA_PATH = f"s3://{bucket_sagemaker}/dogImages"

In [None]:
!aws s3 sync ./dogImages/ {BUCKET_DATA_PATH}

In [None]:
!aws s3 ls {BUCKET_DATA_PATH}/
# Alternative:
# s3_data_path = sagemaker_session.upload_data(path="./dogImages", bucket=bucket_sagemaker)
# print(s3_data_path)

**TODO:** ### Data Exploration

## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [None]:
estimator = PyTorch(
    entry_point="hpo.py",       #name & path of training script
    role=role,
    py_version='py36',          #version of Python
    framework_version="1.8",    #version of pytorch
    instance_count=1,           #number of training instances
    instance_type="ml.m5.large" #type of training instance
)

In [None]:
#TODO: Declare your HP ranges, metrics etc.
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch-size": CategoricalParameter([32, 64, 128]),
}

In [None]:
hyperparameter_ranges

Note: 
Specify the metric that we are trying to optimize. 
When optimizing for loss, our objective is to minimize. Metrics like accuracy needs to maximize.

In [None]:
objective_metric_name = "average validation loss"
objective_type = "Minimize"
metric_definitions = [{"Name": objective_metric_name, "Regex": "Phase validation, Epoc loss ([0-9\\.]+)"}]

In [None]:
#TODO: Create estimators for your HPs
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=2,
    objective_type=objective_type,
)

In [None]:
f"{BUCKET_DATA_PATH}/train"

In [None]:
# TODO: Fit tuner
tuner.fit(wait=True, inputs={"training": f"{BUCKET_DATA_PATH}/train", "validation":f"{BUCKET_DATA_PATH}/valid" })

In [None]:
# TODO: Get the best estimators and the best HPs
tuner.best_training_job()

In [None]:

best_estimator = tuner.best_estimator()
best_estimator

In [None]:
type(best_estimator)

In [None]:
#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [None]:
# TODO: Set up debugging and profiling rules and hooks
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]


collection_configs=[
    CollectionConfig(
        name="CrossEntopyLoss_output_0",
        parameters={
            "include_regex": "CrossEntropyLoss_output_0",
            "train.save_interval": "1",
            "eval.save_interval": "1"
        }
    )
]
hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "5",
        "eval.save_interval": "1"
    },
    collection_configs=collection_configs
)

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

In [None]:
# Creating the hyperparameters dictionary
select_batch_size = eval(best_estimator.hyperparameters()['batch-size'])
selected_lr = best_estimator.hyperparameters()['lr']

hyperparameters = {'batch-size':select_batch_size, 'lr':selected_lr}

In [None]:
# TODO: Create and fit an estimator
estimator = PyTorch(
    entry_point="train_model.py",
    base_job_name="final-training-job",
    role=role,
    py_version='py36',
    framework_version="1.8",
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type="ml.m5.large",
    ## Debugger parameters
    rules=rules,
    debugger_hook_config=hook_config
)

In [None]:
inputs_mapping = {
    "training": f"{BUCKET_DATA_PATH}/train", 
    "validation":f"{BUCKET_DATA_PATH}/valid", 
    "testing":f"{BUCKET_DATA_PATH}/test"
}

estimator.fit(wait=True, inputs=inputs_mapping)

In [None]:
estimator

In [None]:
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=estimator.latest_training_job.name)

print(f"Training jobname: {job_name}")
print(f"Client: {client}")
print(f"Description: {description}")

In [None]:
# TODO: Plot a debugging output.
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

### Fetch tensor names and print their lengths

trial.tensor_names()

In [None]:
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.TRAIN))

In [None]:
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.EVAL))

### Functions to plot the output sensors

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot

def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals

def plot_tensor(trial, tensor_name):

    steps_train, vals_train = get_data(trial, tensor_name, mode=ModeKeys.TRAIN)
    print("loaded TRAIN data")
    steps_eval, vals_eval = get_data(trial, tensor_name, mode=ModeKeys.EVAL)
    print("loaded EVAL data")

    fig = plt.figure(figsize=(10, 7))
    host = host_subplot(111)

    par = host.twiny()

    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")
    host.set_ylabel(tensor_name)

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    print("completed TRAIN plot")
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    print("completed EVAL plot")
    leg = plt.legend()

    host.xaxis.get_label().set_color(p1.get_color())
    leg.texts[0].set_color(p1.get_color())

    par.xaxis.get_label().set_color(p2.get_color())
    leg.texts[1].set_color(p2.get_color())

    plt.ylabel(tensor_name)

    plt.show()

In [None]:
plot_tensor(trial, "CrossEntropyLoss_output_0")

#### Checking system utilization

In [None]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

In [None]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

### SageMaker Debugger Profiling Report
Save the profiler report a S3 bucket. Fetch and display the path of the report below.

In [None]:
# TODO: Display the profiler output
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

In [None]:
! aws s3 ls {rule_output_path} --recursive

In [None]:
! aws s3 cp {rule_output_path} ./ --recursive

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
import os

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

In [None]:
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

## Model Deploying

In [None]:
# # Uncomment this cell when retrieving a model already trained
# from sagemaker.pytorch import PyTorch
# estimator = PyTorch.attach("pytorch-training-230218-1626-002-89bdc10e")

In [None]:
predictor=estimator.deploy(initial_instance_count=1, instance_type="ml.t2.medium")
predictor

#### Load and preprocess image to send to endpoint for prediction

In [None]:
import os
import numpy as np
import random
from PIL import Image

data_dir = './dogImages/test'

# Select a random image :)
file_handles = os.listdir(data_dir)
random_breed = random.choice(file_handles)
file_handles_breed = os.listdir(f"{data_dir}/{random_breed}")
random_img = random.choice(file_handles_breed)

print("Breed: ", random_breed)
img = Image.open(f"{data_dir}/{random_breed}/{random_img}")
img

In [None]:
file_handles_breed

In [None]:
np.asarray(img).shape

In [None]:
np.asarray(img).transpose().shape

In [None]:
np.expand_dims(np.asarray(img).transpose(), axis=0).shape

#### Prediction

In [None]:
img_reshaped = np.expand_dims(np.asarray(img).transpose(), axis=0).astype(np.float32)
img_reshaped.shape

In [None]:
response_raw = predictor.predict(img_reshaped)
print("Prediction result with no processing:")
print(np.argmax(response_raw[0]) + 1)
print("Breed: ", random_breed)

The prediction without processing of the image did not work. We got breed `83` and the actual breed is `118`.

In [None]:
import torchvision.transforms as transforms

def process_image(image):    
    img = image.convert('RGB')
    data_transform = transforms.Compose(
        [transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])]
    )

    return data_transform(img)[:3,:,:].unsqueeze(0).numpy()

img_processed = process_image(img)

In [None]:
response = predictor.predict(img_processed)
print("Prediction result with processing:")
print(np.argmax(response[0]) + 1)
print("Breed: ", random_breed)

It works! We got breed `118` from the endpoint

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()