# TODO: Title

This notebook lists all the steps that you need to complete the complete this project. You will need to complete all the TODOs in this notebook as well as in the README and the two python scripts included with the starter code.


**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of these the TODO's and use more than one TODO code cell to do all your tasks.

# Environment

## Install modules

In [None]:
# Install packages for Debugging and Profiling on SageMaker
# 'smdebug' may have error with latest version, so we should use another version.
# Reference: https://pypi.org/project/smdebug/#history
!pip install -U torch torchvision
!pip install -U smdebug==1.0.3
!pip install -U seaborn plotly opencv-python shap imageio bokeh
!pip install -U sagemaker

## Pre-Config

Configure AWS Credential by "environment variable" or edit "~/.aws/credentials" or "aws configure"

In [3]:
# Import any packages that you might need
import boto3
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
from sagemaker.debugger import (
    Rule, ProfilerRule, rule_configs,
    DebuggerHookConfig, CollectionConfig,
    ProfilerConfig, FrameworkProfile, 
)

from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

import random
import os

from PIL import Image
from IPython.display import display

sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\tranh\AppData\Local\sagemaker\sagemaker\config.yaml
[2024-05-28 22:11:25.515 MYSHITDESKTOP:26356 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None


In [4]:
# Initial Settings
role = "arn:aws:iam::028609567580:role/project03-khangtictoc"
region = "us-east-1"

bucket = "project03-khangtictoc"
prefix = "dataset"

# debugger_s3_output = "s3://{}/debugger-output".format(bucket)
# profiler_s3_output = "s3://{}/profiler-output".format(bucket)
local_dataset_path = "./dataset/dogImages"

image_name = "resnet50-training-job"
ecr_name = "public.ecr.aws/q1p8o7w7/resnet50-training-job:latest"


In [5]:
# Create channel for data input's location
train_loc = "s3://project03-khangtictoc/dataset/train"
validation_loc = "s3://project03-khangtictoc/dataset/valid"
test_loc = "s3://project03-khangtictoc/dataset/test"

channels = {
    "training": train_loc,
    "validation": validation_loc,
    "testing": test_loc
}

## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.

### Download and sync to S3

In [25]:
!aws s3api create-bucket --bucket project03-khangtictoc --region us-east-1


An error occurred (OperationAborted) when calling the CreateBucket operation: A conflicting conditional operation is currently in progress against this resource. Please try again.


In [None]:
# Fetch and upload the data to AWS S3

# Command to download and unzip data
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip

In [24]:
#!aws s3api create-bucket --bucket %DEFAULT_S3_BUCKET%

# Window
!aws s3 sync ./dataset/dogImages/train/ s3://{bucket}/dataset/train/
!aws s3 sync ./dataset/dogImages/test/ s3://{bucket}/dataset/test/
!aws s3 sync ./dataset/dogImages/valid/ s3://{bucket}/dataset/valid/

# Linux
#!aws s3 sync ./dataset/train s3://${DEFAULT_S3_BUCKET}/train/
#!aws s3 sync ./dataset/test s3://${DEFAULT_S3_BUCKET}/test/

fatal error: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist
fatal error: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist
fatal error: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist


### Data Loader (Optional)

Use local workspace for investigating data

Define transforming actions

In [2]:
transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

In [3]:
trainset = datasets.ImageFolder("./dataset/dogImages/train", transform=transforms)
valset = datasets.ImageFolder("./dataset/dogImages/valid", transform=transforms)
testset = datasets.ImageFolder("./dataset/dogImages/test", transform=transforms)

In [4]:
train_loader = DataLoader(trainset, batch_size=32, shuffle=True)
val_loader = DataLoader(valset, batch_size=32, shuffle=True)
test_loader = DataLoader(testset, batch_size=32, shuffle=True)

In [5]:
# Show labels
train_loader.dataset.classes[0:5]

['001.Affenpinscher',
 '002.Afghan_hound',
 '003.Airedale_terrier',
 '004.Akita',
 '005.Alaskan_malamute']

In [6]:
# View some samples and labels
random_img = random.sample(train_loader.dataset.imgs, 5)
random_img

[('./dataset/dogImages/train\\129.Tibetan_mastiff\\Tibetan_mastiff_08184.jpg',
  128),
 ('./dataset/dogImages/train\\025.Black_and_tan_coonhound\\Black_and_tan_coonhound_01781.jpg',
  24),
 ('./dataset/dogImages/train\\091.Japanese_chin\\Japanese_chin_06201.jpg', 90),
 ('./dataset/dogImages/train\\104.Miniature_schnauzer\\Miniature_schnauzer_06887.jpg',
  103),
 ('./dataset/dogImages/train\\094.Komondor\\Komondor_06353.jpg', 93)]

In [7]:
# File extensions
train_loader.dataset.extensions

('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp')

In [8]:
# Number of labels
print("Number of labels: %d" % len(train_loader.dataset.classes))

# Number of samples
print("Number of samples: %d" % len(train_loader.dataset.imgs))

Number of labels: 133
Number of samples: 6680


# Hyperparameter Tuning

TODO: This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

Note: You will need to use the hpo.py script to perform hyperparameter tuning.

In [4]:
# Declare HP ranges, metrics etc.

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch-size": CategoricalParameter([16, 32, 64, 128, 256, 512]),
    "epochs": IntegerParameter(10, 20)
}

objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]

In [6]:
# Create estimators for HPs

estimator = PyTorch(
    role=role,
    instance_count=1,
    instance_type="ml.m5.4xlarge",
    source_dir="scripts",
    entry_point="hpo.py",
    framework_version="2.2",
    py_version="py310",
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=1,
    objective_type=objective_type,
)

In [7]:
# Fit HP Tuner
tuner.fit(inputs=channels, wait=True) # Include data channels

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [9]:
# Get the best estimators and the best HPs
best_estimator = tuner.best_estimator()


2024-05-27 23:35:14 Starting - Found matching resource for reuse
2024-05-27 23:35:14 Downloading - Downloading the training image
2024-05-27 23:35:14 Training - Training image download completed. Training in progress.
2024-05-27 23:35:14 Uploading - Uploading generated training model
2024-05-27 23:35:14 Completed - Resource released due to keep alive period expiry


In [10]:
# Get the hyperparameters of the best trained model
best_estimator.hyperparameters()

{'_tuning_objective_metric': '"average test loss"',
 'batch-size': '"256"',
 'epochs': '16',
 'lr': '0.004909233792902113',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"pytorch-training-2024-05-27-17-21-30-757"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-028609567580/pytorch-training-2024-05-27-17-21-30-757/source/sourcedir.tar.gz"'}

In [11]:
# Create hyperparameter dict for the best model for later use
best_hyperparameters = {
    "lr": best_estimator.hyperparameters()['lr'],
    "batch-size": int(best_estimator.hyperparameters()['batch-size'].replace('"', '')),
    "epochs": int(best_estimator.hyperparameters()['epochs'])
}
best_hyperparameters

{'lr': '0.004909233792902113', 'batch-size': 256, 'epochs': '16'}

# Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

Note: You will need to use the train_model.py script to perform model profiling and debugging.

## Model Training

In [67]:
collection_configs=[
    CollectionConfig(
        name="CrossEntopyLoss_output_0",
        parameters={
            "include_regex": "CrossEntropyLoss_output_0",
            "train.save_interval": "1",
            "eval.save_interval": "1"
        }
    )
]

In [68]:
# Configure Debugger

debugger_hook_config = DebuggerHookConfig(
    #s3_output_path=debugger_s3_output,
    hook_parameters={
        "train.save_interval": "100",
        "eval.save_interval": "10"
    }
)

In [69]:
# Configure Profiler

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(num_steps=10),
    #s3_output_path=profiler_s3_output
)

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [70]:
# Set up the rules for debugging

rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
]

In [71]:
# Create and fit an estimator

estimator = PyTorch(
    role=role,

    instance_count=1,
    instance_type="ml.c5.4xlarge",

    source_dir="scripts",
    entry_point="train_model.py",
    framework_version="1.8",
    py_version="py36",
    hyperparameters={'lr': '0.004909233792902113', 'batch-size': 256, 'epochs': 16},

    rules=rules,
    debugger_hook_config=debugger_hook_config,
    profiler_config=profiler_config
)

estimator.fit(inputs=channels, wait=False)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framewo

## Plot a debugging output.

In [None]:
!aws s3 sync  s3://sagemaker-us-east-1-028609567580/pytorch-training-2024-05-28-08-35-06-969/ ./training-job/

In [6]:
training_job_name = "pytorch-training-2024-05-28-08-35-06-969" #estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")
tj = TrainingJob(training_job_name, region)

Training jobname: pytorch-training-2024-05-28-08-35-06-969
Region: us-east-1
ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-028609567580/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-028609567580/pytorch-training-2024-05-28-08-35-06-969\profiler-output


In [8]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-028609567580/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-028609567580/pytorch-training-2024-05-28-08-35-06-969\profiler-output
Profiler data from system not available yet
time: 1716909713.464636 TrainingJobStatus:Completed TrainingJobSecondaryStatus:Completed
Profiler data from system not avail

In [None]:
trial = create_trial("./training-job")


In [97]:
system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader

In [None]:
print(trial.tensor_names())

In [None]:
print(len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.TRAIN)))
print(len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.EVAL)))

In [None]:
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()


In [None]:
system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

In [None]:
view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

In [None]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

## Display the profiler output

## Model Deploying

In [None]:
# Deploy your model to an endpoint

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge'
)

In [None]:
# Create dictionary for class-to-label mapping

test_path = os.path.join(local_dataset_path, "test")

label_breed_mapping = {k:v.split(".")[1] for k, v in enumerate(os.listdir(test_path))}
label_breed_mapping


{0: 'Affenpinscher',
 1: 'Afghan_hound',
 2: 'Airedale_terrier',
 3: 'Akita',
 4: 'Alaskan_malamute',
 5: 'American_eskimo_dog',
 6: 'American_foxhound',
 7: 'American_staffordshire_terrier',
 8: 'American_water_spaniel',
 9: 'Anatolian_shepherd_dog',
 10: 'Australian_cattle_dog',
 11: 'Australian_shepherd',
 12: 'Australian_terrier',
 13: 'Basenji',
 14: 'Basset_hound',
 15: 'Beagle',
 16: 'Bearded_collie',
 17: 'Beauceron',
 18: 'Bedlington_terrier',
 19: 'Belgian_malinois',
 20: 'Belgian_sheepdog',
 21: 'Belgian_tervuren',
 22: 'Bernese_mountain_dog',
 23: 'Bichon_frise',
 24: 'Black_and_tan_coonhound',
 25: 'Black_russian_terrier',
 26: 'Bloodhound',
 27: 'Bluetick_coonhound',
 28: 'Border_collie',
 29: 'Border_terrier',
 30: 'Borzoi',
 31: 'Boston_terrier',
 32: 'Bouvier_des_flandres',
 33: 'Boxer',
 34: 'Boykin_spaniel',
 35: 'Briard',
 36: 'Brittany',
 37: 'Brussels_griffon',
 38: 'Bull_terrier',
 39: 'Bulldog',
 40: 'Bullmastiff',
 41: 'Cairn_terrier',
 42: 'Canaan_dog',
 43: '

In [None]:
# Run an prediction on the endpoint

predictor.serializer = IdentitySerializer("image/png")

test_folder = "012.Australian_shepherd"
test_image = "Australian_shepherd_00830.jpg"

with open(os.path.join(test_path, test_folder, test_image), "rb") as f:
    image = f.read()
response = predictor.predict(image)

print("Expected label: \"{}\" with  index of {}".format(
      test_folder.split(".")[1],
      test_folder.split(".")[0]))
print("Predicted label: \"{}\" with index of {}".format(
    response,
    label_breed_mapping[response]
))

In [None]:
# IIPORTANT) Shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()

# Package your Model

Set up remote registry

In [None]:
ecr_client = boto3.client("ecr")
ecr_response = ecr_client.create_repository(repositoryName="only-test")
ecr_response

In [None]:
# Create ECR Repository if not exists
!aws ecr-public create-repository --repository-name {image_name}

{
    "repository": {
        "repositoryArn": "arn:aws:ecr-public::008748484958:repository/resnet50-training-job",
        "registryId": "008748484958",
        "repositoryName": "resnet50-training-job",
        "repositoryUri": "public.ecr.aws/q1p8o7w7/resnet50-training-job",
        "createdAt": "2024-05-24T15:42:19.395000+07:00"
    },
    "catalogData": {}
}


In [5]:
# Authenticate AWS ECR
# Push command on Console
#!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
!aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q1p8o7w7

Login Succeeded


https://docs.docker.com/engine/reference/commandline/login/#credentials-store



Upload image to registry

In [None]:
# Package training model to Docker Image
!docker build -t {image_name} .

In [12]:
# Re-tag images to remote ECR registry
!docker tag {image_name}:latest public.ecr.aws/q1p8o7w7/{image_name}:latest
# Push image to remote AWS ECR
!docker push public.ecr.aws/q1p8o7w7/{image_name}:latest

The push refers to repository [public.ecr.aws/q1p8o7w7/resnet50-training-job]
bbe00fe64b55: Preparing
9501e9fcc404: Preparing
1c4281c1b467: Preparing
8f3dd49b5d19: Preparing
555315c92a50: Preparing
3be02ac4f30b: Preparing
4a36b0c85768: Preparing
6addedefeb30: Preparing
f21b8e6382b2: Preparing
3be02ac4f30b: Waiting
4a36b0c85768: Waiting
6addedefeb30: Waiting
f21b8e6382b2: Waiting
a1c8c36e3146: Preparing
a1c8c36e3146: Waiting
88c6b56a015e: Preparing
d4c2f7022d7b: Preparing
88c6b56a015e: Waiting
d4c2f7022d7b: Waiting
37d54eb19c68: Preparing
e7c49386baa5: Preparing
37d54eb19c68: Waiting
e7c49386baa5: Waiting
6723bb391f86: Preparing
2b77c1837fdd: Preparing
c308ea0f2bb6: Preparing
91e1df25c605: Preparing
72b51d6d1d55: Preparing
90dfc325730b: Preparing
ef7f8b990cf9: Preparing
454df595b059: Preparing
078266e4cb4c: Preparing
fb99a34a5b47: Preparing
6723bb391f86: Waiting
454df595b059: Waiting
2b77c1837fdd: Waiting
91e1df25c605: Waiting
72b51d6d1d55: Waiting
90dfc325730b: Waiting
c308ea0f2bb6: Wa