# Image Classification using AWS Sagemaker

This notebook will help us interface with Sagemaker, determine the best hyperparameters, submit training jobs to it, and deploy the model for inference. We'll have profiling and debugging on the model training and the corresponding reportages.

In [2]:
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install smdebug -q -U

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
import sagemaker
import boto3
from sagemaker.tuner import (IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner)
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import (Rule, ProfilerRule, rule_configs, DebuggerHookConfig, ProfilerConfig, FrameworkProfile)

In [5]:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()

In [6]:
print(f"Default Bucket: {bucket}")
print(f"RoleArn: {role}")

Default Bucket: sagemaker-us-east-1-924952372462
RoleArn: arn:aws:iam::924952372462:role/service-role/AmazonSageMaker-ExecutionRole-20230203T004189


## Dataset
The provided dataset is the dog breed classification dataset which can be found in the classroom. 
It contains images from 133 dog breeds divided into training, testing and validation datasets. The dataset can be 
downloaded from [here](https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip).

In [8]:
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip -q dogImages.zip

--2023-02-02 21:42:05--  https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.116.232
Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.116.232|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1132023110 (1.1G) [application/zip]
Saving to: ‘dogImages.zip.1’


2023-02-02 21:42:43 (30.2 MB/s) - ‘dogImages.zip.1’ saved [1132023110/1132023110]



In [None]:
# This cell needs to be run at first time. from second time on, we didn't have to it and it's enough 
# to run the next cell to get S3 bucket.
local_dir = 'dogImages'
prefix = "image_classification_project"
inputs = sagemaker_session.upload_data(path=local_dir, bucket=bucket, key_prefix=prefix)
print(f"input spec (in this case, just an S3 path): {inputs}")

In [7]:
inputs = "s3://sagemaker-us-east-1-924952372462/image_classification_project"
print(f"input spec (in this case, just an S3 path): {inputs}")

input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-924952372462/image_classification_project


## Hyperparameter Tuning
It's time to finetune a pretrained model with hyperparameter tuning. The selected parameters are:

1. learning rate: to have a faster convergence
2. batch size: to have an efficient training time
3. epochs: to have an efficient training time

```hpo.py``` script is the one which be used for setting up hyperparameter tuning process.

In [8]:
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch-size": CategoricalParameter([16, 64]),
    "epochs": IntegerParameter(5, 10),
}

In [9]:
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test Loss: ([+-]?[0-9\\.]+)"}]

In [10]:
estimator = PyTorch(
    entry_point="hpo.py",
    role=role,
    py_version="py36",
    framework_version="1.8",
    instance_count=1,
    instance_type="ml.g4dn.xlarge"
)

tuner = HyperparameterTuner(
    estimator = estimator,
    early_stopping_type = "Auto",
    metric_definitions = metric_definitions,
    objective_metric_name = objective_metric_name,
    objective_type = objective_type,
    max_jobs = 4,
    max_parallel_jobs = 2,
    hyperparameter_ranges = hyperparameter_ranges
)

In [11]:
tuner.fit({"training": inputs}, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating hyperparameter tuning job with name: pytorch-training-230205-1534


.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!


In [12]:
best_estimator = tuner.best_estimator()

best_estimator.hyperparameters()


2023-02-05 16:11:54 Starting - Found matching resource for reuse
2023-02-05 16:11:54 Downloading - Downloading input data
2023-02-05 16:11:54 Training - Training image download completed. Training in progress.
2023-02-05 16:11:54 Uploading - Uploading generated training model
2023-02-05 16:11:54 Completed - Resource retained for reuse


{'_tuning_objective_metric': '"average test loss"',
 'batch-size': '"64"',
 'epochs': '5',
 'lr': '0.005959193242609645',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"pytorch-training-2023-02-05-15-34-51-995"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-924952372462/pytorch-training-2023-02-05-15-34-51-995/source/sourcedir.tar.gz"'}

## Model Profiling and Debugging
Having the best hyperparameter values in hand, we can train the model. We also enable debugger and profiler to monitor the trainig process. We use 
```train_model.py``` for handling the training phase of our classification task.

In [13]:
best_hyperparameters = {
    "batch-size": int(best_estimator.hyperparameters()["batch-size"].replace('"', "")),
    "epochs": best_estimator.hyperparameters()["epochs"],
    "lr": best_estimator.hyperparameters()["lr"],
}

In [14]:
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

debugger_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "100", "eval.save_interval": "10"}
)

In [15]:
estimator = PyTorch(
    entry_point="train_model.py",
    framework_version="1.6",
    py_version="py36",
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    hyperparameters=best_hyperparameters,
    rules=rules,
    profiler_config=profiler_config,
    debugger_hook_config=debugger_config,
)

In [16]:
estimator.fit({"training": inputs}, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving i

2023-02-05 16:20:54 Starting - Starting the training job...
2023-02-05 16:21:11 Starting - Preparing the instances for trainingVanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
......
2023-02-05 16:22:23 Downloading - Downloading input data.........
2023-02-05 16:23:44 Training - Downloading the training image...
2023-02-05 16:24:24 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-02-05 16:24:13,796 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-02-05 16:24:13,826 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-02-05 16:24:13,828 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-02-05 16:24:1

In [17]:
import boto3

session = boto3.session.Session()
region = session.region_name

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

Training jobname: pytorch-training-2023-02-05-16-20-53-990
Region: us-east-1


In [18]:
print(type(estimator.latest_training_job))

<class 'sagemaker.estimator._TrainingJob'>


As we can see, our training job is so IO-intensive because ```GPUMemoryUtilization``` is oscillating due to memory
allocation and release. This observation is compatible with coming results obtained from profiler.

In [38]:
# bokeh 1.4.0
# jinja2 3.1.2
# flask 1.1.1
# !pip show bokeh
# !pip uninstall bokeh -y
# !pip uninstall panel -y

# !pip install panel==0.9.3 -q
# !pip install bokeh -q

!pip show smdebug

Name: smdebug
Version: 1.0.12
Summary: Amazon SageMaker Debugger is an offering from AWS which helps you automate the debugging of machine learning training jobs.
Home-page: https://github.com/awslabs/sagemaker-debugger
Author: AWS DeepLearning Team
Author-email: 
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: boto3, numpy, packaging, protobuf, pyinstrument
Required-by: 


In [19]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-924952372462/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-924952372462/pytorch-training-2023-02-05-16-20-53-990/profiler-output


Profiler data from system is available


In [20]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

[2023-02-05 16:37:35.693 pytorch-1-12-gpu-py38-ml-t3-medium-5c243b1aae6a0c39c0e73a7744b4:32 INFO metrics_reader_base.py:134] Getting 15 event files
select events:['total']
select dimensions:['CPU', 'GPU']
filtered_events:{'total'}
filtered_dimensions:{'GPUMemoryUtilization-nodeid:algo-1', 'CPUUtilization-nodeid:algo-1', 'GPUUtilization-nodeid:algo-1'}


In [21]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

You will find the profiler report in s3://sagemaker-us-east-1-924952372462/pytorch-training-2023-02-05-16-20-53-990/rule-output


In [22]:
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

2023-02-05 16:36:36     416685 pytorch-training-2023-02-05-16-20-53-990/rule-output/ProfilerReport/profiler-output/profiler-report.html
2023-02-05 16:36:35     272335 pytorch-training-2023-02-05-16-20-53-990/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2023-02-05 16:36:30        587 pytorch-training-2023-02-05-16-20-53-990/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2023-02-05 16:36:30      20041 pytorch-training-2023-02-05-16-20-53-990/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2023-02-05 16:36:30        126 pytorch-training-2023-02-05-16-20-53-990/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2023-02-05 16:36:30        130 pytorch-training-2023-02-05-16-20-53-990/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
2023-02-05 16:36:30       1050 pytorch-training-2023-02-05-16-20-53-990/rule-output/ProfilerReport/profiler-output/profiler-re

In [23]:
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

In [24]:
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

Unnamed: 0,Description,Recommendation,Number of times rule triggered,Number of datapoints,Rule parameters
LowGPUUtilization,"Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size.","Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size.",6,1650,threshold_p95:70  threshold_p5:10  window:500  patience:1000
BatchSize,"Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization.","The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size.",6,1649,cpu_threshold_p95:70  gpu_threshold_p95:70  gpu_memory_threshold_p95:70  patience:1000  window:500
StepOutlier,"Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues.","Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers.",0,33402,threshold:3  mode:None  n_outliers:10  stddev:3
Dataloader,"Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU.",Change the number of data loader processes.,0,0,min_threshold:70  max_threshold:200
MaxInitializationTime,Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes.,"Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework.",0,33402,threshold:20
IOBottleneck,Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.,"Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance.",0,1653,threshold:50  io_threshold:50  gpu_threshold:10  patience:1000
CPUBottleneck,"Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.",Consider increasing the number of data loaders or applying data pre-fetching.,0,1653,threshold:50  cpu_threshold:90  gpu_threshold:10  patience:1000
GPUMemoryIncrease,Measures the average GPU memory footprint and triggers if there is a large increase.,Choose a larger instance type with more memory if footprint is close to maximum available memory.,0,1650,increase:5  patience:1000  window:10
LoadBalancing,"Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization.",Choose a different distributed training strategy or a different distributed training framework.,0,1650,threshold:0.2  patience:1000

Unnamed: 0,mean,max,p99,p95,p50,min
Step Durations in [s],0.01,85.37,0.02,0.01,0.0,0.0


## Model Deploying

The model deployment is implemented using a stand-alone script([inference.py](./inference.py) in our project). This
script should at least all the things for inference of the model which is ```model_fn```.

In [25]:
from sagemaker.pytorch import PyTorchModel

In [26]:
model_data = estimator.output_path + estimator.latest_training_job.job_name + "/output/model.tar.gz"
print(f"Model: {model_data}")

Model: s3://sagemaker-us-east-1-924952372462/pytorch-training-2023-02-05-16-20-53-990/output/model.tar.gz


In [27]:
pytorch_model = PyTorchModel(
    model_data=model_data, 
    role=role, 
    entry_point='inference.py',
    py_version="py36",
    framework_version="1.8"
)

In [28]:
predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.t2.medium")

INFO:sagemaker:Creating model with name: pytorch-inference-2023-02-05-16-39-08-195
INFO:sagemaker:Creating endpoint-config with name pytorch-inference-2023-02-05-16-39-08-888
INFO:sagemaker:Creating endpoint with name pytorch-inference-2023-02-05-16-39-08-888


---------!

In [29]:
!pip install torchvision -q

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [30]:
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import torchvision.transforms as transforms

INFO:matplotlib.font_manager:generated new fontManager


In [31]:
image_path = "./dogImages/test/011.Australian_cattle_dog/Australian_cattle_dog_00734.jpg"

In [32]:
image = Image.open(image_path)
transform = transforms.Compose([
            transforms.Resize(224),
            transforms.ToTensor(),
        ])
preprocessed_image = transform(image).unsqueeze(0)
preprocessed_image = preprocessed_image.to("cpu")
response = predictor.predict(preprocessed_image)

pred = np.argmax(response, 1) + 1

actual = int(image_path.split('.')[1].split('/')[-1])
print(f"Actual: {actual}, Prediction: {pred[0]}")

Actual: 11, Prediction: 11


In [None]:
predictor.delete_endpoint()