# Deep learning: classifying dog breeds

This project consists of setting up infrastructure to manage the lifecycle of a machine learning model, using AWS resources such as Sagemaker and EC2 to create, train and deploy a pretrained model that can classify images from a dataset containing different dog breeds.

The focus of this proyect is to set up infrastructure that supports the functioning of machine learning models, therefore other tools to monitor and test the model's performance will be used in addition to the activities mentioned above.

In [23]:
#Install needed dependencies
!pip install smdebug

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
#Import libraries
import sagemaker
import boto3
import os
from torchvision.datasets.stanford_cars import StanfordCars
#For Hyperparameter tuning
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
from torchvision import transforms
from sagemaker.pytorch import PyTorch #Estimator

#For debugging and profiling
from sagemaker.debugger import Rule, DebuggerHookConfig, ProfilerRule, rule_configs
from sagemaker.debugger import ProfilerConfig, FrameworkProfile


from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

import IPython



session = sagemaker.session.Session()
bucket = session.default_bucket()
role = sagemaker.get_execution_role()
region = session.boto_region_name

  from .autonotebook import tqdm as notebook_tqdm


[2023-02-16 01:20:47.542 pytorch-1-12-cpu-py38-ml-t3-medium-f8be1a063b37f44eb7b009d8cbea:27 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None




## Dataset

This project uses the Stanford Cars dataset, which according to its creators:
> Contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe.

In [4]:
# Command to download and unzip data
local_dir = 'data'
StanfordCars.mirrors = ["https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/StanfordCars/"]
image_dataset = StanfordCars(
    local_dir,
    download=True,
    transform=transforms.Compose(
        [transforms.ToTensor(),
        transforms.Resize([244,244])]
    )
)

In [24]:
#Uploading dataset to S3
prefix = "StanfordCars"

inputs = session.upload_data(path="data", bucket=bucket, key_prefix=prefix)
print("input spec (in this case, just an S3 path): {}".format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-793615537614/StanfordCars


#### For curious readers:

Datasets included in the pytorch library are listed in this [link](https://pytorch.org/vision/stable/generated/torchvision.datasets.StanfordCars.html#torchvision.datasets.StanfordCars) <br><br>
For the STANFORD CARS dataset, which is used for this project, documentation can be consulted in the [Pytorch website](https://pytorch.org/vision/stable/generated/torchvision.datasets.StanfordCars.html#torchvision.datasets.StanfordCars). Moreover, the original documentation on this dataset can be found [here](https://ai.stanford.edu/~jkrause/cars/car_dataset.html)

## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [36]:
#Declare hyperparameters that will be tuned
# "lr": ContinuousParameter(0.01, 0.1),
hyperparameter_ranges = {
    "lr": CategoricalParameter([0.01, 0.1]),
    "batch-size": CategoricalParameter([256, 512]),
    'epochs': CategoricalParameter([2,3])
}

In [40]:
estimator = PyTorch(
    entry_point='hpo.py',
    role=role,
    py_version='py38',
    framework_version='1.12',
    instance_count=1,
    instance_type='ml.m5.xlarge'
)

#The training script uses cross entropy loss function, as this is a multiclass classification problem
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=8,
    max_parallel_jobs=8,
    objective_type=objective_type
)

In [41]:
#Fit your HP Tuner
tuner.fit({'training': inputs}, wait=True)

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [43]:
best_estimator = tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()


2023-02-12 04:37:52 Starting - Preparing the instances for training
2023-02-12 04:37:52 Downloading - Downloading input data
2023-02-12 04:37:52 Training - Training image download completed. Training in progress.
2023-02-12 04:37:52 Uploading - Uploading generated training model
2023-02-12 04:37:52 Completed - Resource retained for reuse


{'_tuning_objective_metric': '"average test loss"',
 'batch-size': '"256"',
 'epochs': '"3"',
 'lr': '"0.01"',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"pytorch-training-2023-02-12-02-31-40-303"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-793615537614/pytorch-training-2023-02-12-02-31-40-303/source/sourcedir.tar.gz"'}

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [14]:
# TODO: Set up debugging and profiling rules and hooks
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]
hook_config = DebuggerHookConfig(
    hook_parameters={'train.save_interval': "10", "eval.save_interval": "1000"})
profiler_config = ProfilerConfig(
    system_monitor_interval_millis=5000, framework_profile_params=FrameworkProfile(num_steps=100)
)

In [15]:
#Hyperparameters derived from previous tuning job
best_hyperparameters = {
    'batch-size': '256',
    'epochs': '3',
    'lr': '0.01',
}

estimator = PyTorch(
    entry_point='train_model.py',
    role=role,
    py_version='py38',
    framework_version='1.12',
    instance_count=1,
    instance_type='ml.m5.2xlarge',
    hyperparameters=best_hyperparameters,
    rules=rules,
    debugger_hook_config=hook_config,
    profiler_config=profiler_config,
)

In [12]:
inputs = 's3://sagemaker-us-east-1-793615537614/StanfordCars'

In [18]:
estimator.fit({'training': inputs})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-02-16-02-57-48-775


2023-02-16 02:57:49 Starting - Starting the training job...
2023-02-16 02:58:05 Starting - Preparing the instances for trainingVanishingGradient: InProgress
Overfit: InProgress
ProfilerReport: InProgress
......
2023-02-16 02:59:16 Downloading - Downloading input data......
2023-02-16 03:00:21 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-02-16 03:00:12,824 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-02-16 03:00:12,826 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-02-16 03:00:12,828 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-02-16 03:00:12,838 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-02-16 03:00:12,

In [19]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

training_job_name = 'pytorch-training-2023-02-16-02-57-48-775'
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-793615537614/', 'ProfilingIntervalInMilliseconds': 5000, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 100, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 100, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 100, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 100, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 100, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-793615537614/pytorch-training-2023-02-16-02-57-48-775/profiler-output


Profiler data from system is available


In [20]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU"],
    select_events=["total"],
)

[2023-02-16 04:09:25.115 pytorch-1-12-cpu-py38-ml-t3-medium-f8be1a063b37f44eb7b009d8cbea:27 INFO metrics_reader_base.py:134] Getting 70 event files
select events:['total']
select dimensions:['CPU']
filtered_events:{'total'}
filtered_dimensions:{'CPUUtilization-nodeid:algo-1'}


Profiler analysis code taken from [aws github](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-debugger/pytorch_profiling/pt-resnet-profiling-single-gpu-single-node.ipynb)

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [21]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive
# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

2023-02-16 04:08:41     363500 pytorch-training-2023-02-16-02-57-48-775/rule-output/ProfilerReport/profiler-output/profiler-report.html
2023-02-16 04:08:41     209525 pytorch-training-2023-02-16-02-57-48-775/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2023-02-16 04:08:37        191 pytorch-training-2023-02-16-02-57-48-775/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2023-02-16 04:08:37        199 pytorch-training-2023-02-16-02-57-48-775/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2023-02-16 04:08:37       2072 pytorch-training-2023-02-16-02-57-48-775/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2023-02-16 04:08:37        127 pytorch-training-2023-02-16-02-57-48-775/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
2023-02-16 04:08:37        198 pytorch-training-2023-02-16-02-57-48-775/rule-output/ProfilerReport/profiler-output/profiler-re

Unnamed: 0,Description,Recommendation,Number of times rule triggered,Number of datapoints,Rule parameters
Dataloader,"Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU.",Change the number of data loader processes.,1,103,min_threshold:70  max_threshold:200
BatchSize,"Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization.","The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size.",0,827,cpu_threshold_p95:70  gpu_threshold_p95:70  gpu_memory_threshold_p95:70  patience:1000  window:500
MaxInitializationTime,Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes.,"Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework.",0,203,threshold:20
LowGPUUtilization,"Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size.","Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size.",0,0,threshold_p95:70  threshold_p5:10  window:500  patience:1000
LoadBalancing,"Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization.",Choose a different distributed training strategy or a different distributed training framework.,0,0,threshold:0.2  patience:1000
GPUMemoryIncrease,Measures the average GPU memory footprint and triggers if there is a large increase.,Choose a larger instance type with more memory if footprint is close to maximum available memory.,0,0,increase:5  patience:1000  window:10
CPUBottleneck,"Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.",Consider increasing the number of data loaders or applying data pre-fetching.,0,828,threshold:50  cpu_threshold:90  gpu_threshold:10  patience:1000
IOBottleneck,Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.,"Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance.",0,828,threshold:50  io_threshold:50  gpu_threshold:10  patience:1000
StepOutlier,"Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues.","Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers.",0,203,threshold:3  mode:None  n_outliers:10  stddev:3

Unnamed: 0,mean,max,p99,p95,p50,min
Step Durations in [s],20.84,24.45,22.8,21.56,20.86,16.99


## Model Deploying

In [22]:
predictor=estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge') # TODO: Add your deployment configuration like instance type and number of instances

INFO:sagemaker:Creating model with name: pytorch-training-2023-02-16-04-26-45-378
INFO:sagemaker:Creating endpoint-config with name pytorch-training-2023-02-16-04-26-45-378
INFO:sagemaker:Creating endpoint with name pytorch-training-2023-02-16-04-26-45-378


-----!

In [26]:
image = image.unsqueeze(0)

In [27]:
image.size()

torch.Size([1, 3, 244, 244])

In [28]:
# TODO: Run an prediction on the endpoint
from torch import randint
sample_idx = randint(len(image_dataset), size=(1,)).item()
image, label = image_dataset[sample_idx]

image = image.unsqueeze(0)#Adds a dimension in front of the image, as the model expects a tensor of size 4 (batch_size, channels, width, height)

response = predictor.predict(image)
print(response)

[[-14.9828558   -7.18462467  -4.1730566   -5.8168602   -5.31557989
   -7.3323431   -6.88033867  -7.58815765  -4.94617033  -8.56749153
   -9.08531189  -8.98237133  -7.94167328  -6.97027397  -6.70899677
  -15.6803503  -10.7051239  -11.07425785  -6.57576323  -4.28439665
   -9.47986317  -9.79704762  -6.83502531  -5.79636717  -7.80280495
   -5.9516077   -7.98729849  -4.52152538  -7.43041515  -6.93487883
   -8.38533497  -7.17361832  -6.04325294  -5.35902405  -3.99037361
   -7.24512577  -6.12412453  -7.27836609  -8.14006805 -10.97202682
  -12.03221703  -7.38398266  -8.29374123  -6.69008398 -10.68607426
   -7.84497547  -6.72831869  -5.34897947  -4.22060251  -4.07202482
   -5.92886066  -4.99185324  -8.9110899  -11.02114868  -8.00544071
   -7.62163877  -9.44877625  -4.6481452  -11.69565868  -7.2536602
   -4.77673578  -9.08067513  -6.18085051  -9.70236397  -7.84622192
   -7.28109455  -7.45898056  -8.32450485 -10.25645542 -12.39648533
   -9.82196808  -5.56873131  -8.40819263 -10.80032063  -9.43325

In [31]:
labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])
print("Most likely answer: {}".format(labeled_predictions[0]))

Most likely answer: (116, 1.7885642051696777)


In [33]:
print(f'The correct label for this image was: {label}. The predicted label was: {labeled_predictions[0][0]}')

The correct label for this image was: 116. The predicted label was: 116


In [30]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: pytorch-training-2023-02-16-04-26-45-378
INFO:sagemaker:Deleting endpoint with name: pytorch-training-2023-02-16-04-26-45-378
