# Image Classification with AWS Sagemaker

This project will use a dog images dataset with 133 type of dogs. A CNN model will be used and trained on top of the ResNet50 Model using transfer learning. Then hyperparameter tuning is performed in an effort to try to look for the best parameter configuration possible. Then, the best model is deployed and some profiling and debugging metrics are obtained in order to check that the model was trained as expected. Finally, the model is deployed into an endpoint and queried to make some inferences.

In [None]:
!pip install smdebug

In [None]:
import sagemaker
import boto3
import os

## Dataset
The data set is downloaded from a s3 bucket and it is unziped into our local directoroy.

In [None]:
# Command to download and unzip data
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip

Then the dataset will be uplooaded into an s3 bucket so it can be used by different instances to train different models.

In [None]:

bucket_name = 'sagemaker-us-east-1-272259209864'

def upload_directory_to_s3(local_directory, s3_bucket, s3_prefix=''):
    s3 = boto3.client('s3')
    for root, dirs, files in os.walk(local_directory):
        for file in files:
            local_path = os.path.join(root, file)
            relative_path = os.path.relpath(local_path, local_directory)
            s3_path = os.path.join(s3_prefix, relative_path).replace("\\", "/")
            s3.upload_file(local_path, s3_bucket, s3_path)

upload_directory_to_s3('dogImages', bucket_name,'dogImagesDS')

Here we can visualize an image from the dataset

In [None]:
bucket_name = 'sagemaker-us-east-1-272259209864'
img_s3_key = 'dogImagesDS/train/091.Japanese_chin/Japanese_chin_06190.jpg'
local_path = '/tmp/Poodle_07899.jpg'

s3 = boto3.client('s3')
s3.download_file(bucket_name, img_s3_key, local_path)

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Display the image
img = mpimg.imread(local_path)
plt.imshow(img)
plt.axis('off')
plt.show()

In [None]:
!pip install torchvision

Some transformations are applied to the different images in order to see how this transformations work. For instance, we can resize the images to 224x224 pixels and normalize them.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

In [None]:

data_dir='dogImages'

# Define data transformations for training and validation
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

#train_dataset = ImageFolder(root='s3://sagemaker-us-east-1-272259209864/dogImagesDS/train/', transform=transform)
train_dataset = ImageFolder(root=data_dir + '/train', transform=transform)
test_dataset = ImageFolder(root=data_dir + '/test', transform=transform)
val_dataset = ImageFolder(root=data_dir + '/valid', transform=transform)

In [None]:
# image after transformation
plt.imshow(train_dataset[30][0].permute(1, 2, 0))
plt.show()

We can also see how many classes we have (type of dogs).

In [None]:
num_classes = len(train_dataset.classes)
num_classes

Also we can see how many images we have for every class for the training dataset.

In [None]:
class_to_idx = train_dataset.class_to_idx
class_names = train_dataset.classes

class_counts = [0] * len(class_names)

for _, label in train_dataset:
    class_counts[label] += 1

In [None]:
plt.subplots(figsize=(20,5))
plt.bar(class_names, class_counts , color ='maroon',
        width = 0.3)

plt.xticks(rotation=90,fontsize=8)
plt.title("Number of images per class")
plt.show()

We can see the type of dog called Alaskan Malamute has more images than the rest with 78 approx.

## Test the hpo.py script

This script will be used to perform parameter tuning. In order to check quickly if it is working properly, we can execute it with the following command:

In [None]:
!python hpo.py \
    --batch_size 32 \
    --bucket_name sagemaker-us-east-1-272259209864 \
    --ds_path_s3 dogImages \
    --lr 0.001 \
    --momentum=0.9 \
    --epochs 1 \
    --path model.h5

## Hyperparameter Tuning

In this part we will try to find the best parameters configuration for the model. For this purpose we will use `HyperparameterTuner.

In [None]:
import sagemaker
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
from sagemaker.pytorch import PyTorch

role = sagemaker.get_execution_role()

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch_size": CategoricalParameter([32, 64, 128]),
    "momentum": CategoricalParameter([0.7,0.8,0.9])
}

objective_metric_name = "Test Accuracy"
objective_type = "Maximize"
metric_definitions = [{"Name": "Test Accuracy", "Regex": "Test Accuracy: ([0-9\\.]+)"}]

In [None]:
training_path = 's3://sagemaker-us-east-1-272259209864/dogImagesDS'

In [None]:
estimator = PyTorch(
    entry_point="hpo.py",
    role=role,
    py_version='py36',
    framework_version='1.8',
    instance_count=1,
    instance_type='ml.g4dn.xlarge'
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=2,
    objective_type=objective_type
)

tuner.fit({"train": training_path}, wait=True)

In [None]:
best_estimator = tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()

## Model Profiling and Debugging
Using the best hyperparameters, we will create and finetune a new model. We will use the `train_model.py` script to perform model profiling and debugging.

In [None]:
rom sagemaker.debugger import Rule, ProfilerRule, rule_configs
from sagemaker.debugger import DebuggerHookConfig, ProfilerConfig, FrameworkProfile, CollectionConfig
import sagemaker
from sagemaker.pytorch import PyTorch

role = sagemaker.get_execution_role()
rules = [
    # debugger rules
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    
    # profiling rules
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]

collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0",parameters={
    "include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "10","eval.save_interval": "1"})]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)
debugger_hook_config = DebuggerHookConfig(collection_configs=collection_configs)

In [None]:
import os
os.environ["SM_CHANNEL_TRAIN"] = 'dogImages'

In [None]:
!python train_model.py \
    --batch_size 32 \
    --ds_path_s3 dogImages \
    --lr 0.001 \
    --momentum=0.9 \
    --epochs 1 \
    --path model.h5

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    entry_point="train_model.py",
    framework_version="1.6",
    py_version="py36",
    output_path='s3://sagemaker-us-east-1-272259209864/output',
    hyperparameters={
        'batch_size': 128,
        'lr': 0.05377,
        'momentum': 0.7, # obtained from best model
        'epochs': 5
    },
    profiler_config=profiler_config,
    debugger_hook_config=debugger_config,
    rules=rules
)

estimator.fit({"train": training_path}, wait=True)

In [None]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
import boto3
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

session = boto3.session.Session()
region = session.region_name

training_job_name = estimator.latest_training_job.name
print(f"TRAINING JOB NAME: {training_job_name}")
print(f"REGION: {region}")

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

print('TENSOR NAMES:',trial.tensor_names())
print('TRAIN: CrossEntropyLoss_output_0',len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.TRAIN)))
print('TEST: CrossEntropyLoss_output_0',len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.EVAL)))

tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

Visualize loss for train and eval. We can see how it decreases for both over time which is a good sign that our model is learning.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot

def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals

def plot_tensor(trial, tensor_name):
    steps_train, vals_train = get_data(trial, tensor_name, mode=ModeKeys.TRAIN)
    print("loaded TRAIN data")
    steps_eval, vals_eval = get_data(trial, tensor_name, mode=ModeKeys.EVAL)
    print("loaded EVAL data")

    fig = plt.figure(figsize=(10, 7))
    host = host_subplot(111)

    par = host.twiny()

    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")
    host.set_ylabel(tensor_name)

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    print("completed TRAIN plot")
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    print("completed EVAL plot")
    leg = plt.legend()

    host.xaxis.get_label().set_color(p1.get_color())
    leg.texts[0].set_color(p1.get_color())

    par.xaxis.get_label().set_color(p2.get_color())
    leg.texts[1].set_color(p2.get_color())

    plt.ylabel(tensor_name)

    plt.show()

plot_tensor(trial, "CrossEntropyLoss_output_0")

In [None]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

In [None]:
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

In [None]:
import os

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]][0]

import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

## Model Deploying

In [None]:
estimator.model_data

In [None]:
from sagemaker.pytorch import PyTorchModel
import sagemaker

role = sagemaker.get_execution_role()

pytorch_model = PyTorchModel(
    model_data=estimator.model_data, 
    role=role, 
    entry_point='predictor.py',
    py_version="py36",
    framework_version="1.6"
)


In [None]:
predictor = pytorch_model.deploy(initial_instance_count=1,
                                 instance_type='ml.t2.medium')

## Model Inference

In [None]:
import io
import torchvision.transforms as transforms
from sagemaker.serializers import IdentitySerializer
import numpy as np
from PIL import Image

predictor.serializer = IdentitySerializer("image/jpeg")

In [None]:
def make_prediction_to_endpoint(input_path, actual_label):
    buf = io.BytesIO()
    image = Image.open(input_path).save(buf, format='JPEG')
    prediction = predictor.predict(buf.getvalue())
    actual = actual_label
    print('[LABELS INFO]   Predicted:', np.argmax(prediction)+1,'Actual:', actual) 
    print('Correct prediction?')
    if (np.argmax(prediction)+1)==actual:  
        print('Correct prediction?', 'yes')
    else:
        print('Correct prediction?', 'no :(')

In [None]:
make_prediction_to_endpoint('test-images/French_bulldog_04764.jpg', 69) # FRENCH BULLDOG
make_prediction_to_endpoint('test-images/Chihuahua_03459.jpg', 48) # CHIHUAHUA
make_prediction_to_endpoint('test-images/187bb51c-fd77-444e-9797-895e54eb238b.JPG', 61) # English_cocker_spaniel

In [None]:
predictor.delete_endpoint()