* notebook created by nov05 on 2024-12-01  
* local conda env [`awsmle_py310`](https://gist.github.com/nov05/d9c3be6c2ab9f6c050e3d988830db08b) (no cuda)    

---   

* https://sagemaker.readthedocs.io/en/v2.34.0/frameworks/pytorch/sagemaker.pytorch.html   
* https://docs.wandb.ai/guides/integrations/sagemaker/  

In [None]:
# TODO: Install any packages that you might need
# !pip install smdebug

In [None]:
## windows cmd to launch notepad to edit aws credential file
!notepad C:\Users\guido\.aws\credentials

In [13]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore

role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    ## your own role here
    role_arn = "arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392"
print("Role ARN:", role_arn) ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()
print("AWS Region: {}".format(region))
print("Default Bucket: {}".format(bucket))
print("Role Arn: {}".format(role_arn))

import wandb
## generate secrets.env. remember add it to .gitignore  
wandb.sagemaker_auth(path="scripts")  

Role ARN: arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392


AWS Region: us-east-1
Default Bucket: sagemaker-us-east-1-061096721307
Role Arn: arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392


In [15]:
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput
from datetime import datetime
## Moving the 1.1GB data from one bucket to another takes 1 hours.
## This is roughly the same amount of time as uploading the data from a local machine to S3.
data_base_path = "s3://p3-dog-breed-classification/dogImages/"
train_data = TrainingInput(data_base_path+"train/", content_type="image/jpeg")
val_data = TrainingInput(data_base_path+"valid/", content_type="image/jpeg")
test_data = TrainingInput(data_base_path+"test/", content_type="image/jpeg")
output_path = "s3://p3-dog-breed-classification/jobs/"
hyperparameters = {
    # 'epochs': 40,  # Define how many epochs you want to train for
    # 'batch-size': 64,  ## ⚠️ this probably needs to be small for smaller training dataset?
    # 'opt-learning-rate': 1e-5,  ## optimizer lr. ⚠️ keep it small for pre-trained model
    # 'opt-weight-decay': 1e-4, ## optimizer weight decay
    'model-name': 'resnet50',  # Specify the ResNet model you want to use
}
# Define the PyTorch estimator
# estimator = PyTorch(
#     entry_point='train.py',  # Your training script that defines the ResNet50 model and training loop
#     source_dir='scripts',  # Directory where your script and dependencies are stored
#     role=role_arn,
#     framework_version='1.13.1',  # Use the PyTorch version you need
#     py_version='py39',
#     instance_count=1,  # Adjust based on the number of instances you want to use
#     # instance_type='ml.p3.2xlarge',  # 16GB, Use GPU instances for deep learning
#     instance_type='ml.g4dn.xlarge',  ## 16GB
#     output_path=output_path,
#     hyperparameters=hyperparameters,
#     # use_spot_instances=True,
# )

In [16]:
# %%time
# # Fit the estimator with the input channels (train, val)
# estimator.fit(
#     wait=True,  
#     job_name=f"p3-dog-breeds-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}",  
#     inputs={
#         "train": train_data, 
#         "validation": val_data, 
#         "test": test_data,
#     },  
# )

## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.  
* Refer to the `00.EDA.ipynb` file.  

In [17]:
#TODO: Fetch and upload the data to AWS S3
# Command to download and unzip data
# !wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
# !unzip dogImages.zip

## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [None]:
## TODO: Declare your HP ranges, metrics etc.
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
hyperparameter_ranges = {
    'epochs': IntegerParameter(20, 40, scaling_type="Auto"),
    'batch-size': CategoricalParameter([16, 32, 64]),
    'opt-learning-rate': ContinuousParameter(1e-6, 1e-4),
    'opt-weight-decay': ContinuousParameter(1e-5, 1e-3),
}
objective_metric_name = "eval_loss_epoch"
objective_type = "Minimize"
metric_definitions = [{
    # "Name": "eval_loss_epoch", 
    "Regex": "EVAL: Average loss: ([0-9\\.]+)"}]
## TODO: Create estimators for your HPs
estimator = PyTorch(
    entry_point='train.py',  # Your training script that defines the ResNet50 model and training loop
    source_dir='scripts',  # Directory where your script and dependencies are stored
    role=role_arn,
    framework_version='1.13.1',  # Use the PyTorch version you need
    py_version='py39',
    instance_count=1,  # Adjust based on the number of instances you want to use
    ## Running 5 ml.g4dn.xlarge instances concurrently would cost around $3.76 per hour in total.
    instance_type='ml.g4dn.xlarge',  ## 16GB $0.752/hr, Use GPU instances for deep learning
    # instance_type='ml.p3.2xlarge',  # 16GB $3.825/hr
    # instance_type='ml.p4d.24xlarge, ## 40*8GB $32.77/hr
    output_path=output_path,
    hyperparameters=hyperparameters,
    # use_spot_instances=True,
)
## TODO: Your HP tuner here
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=40,
    max_parallel_jobs=2,  ## this account limits gpu instance concurrent usage
    objective_type=objective_type,
    base_tuning_job_name='p3-dog-breeds-hpo',
    early_stopping_type='Auto',
)

In [None]:
%%time
## TODO: Fit your HP Tuner
## TODO: Remember to include your data channels
tuner.fit(
    wait=False,  
    # job_name=f"p3-dog-breeds-hpo-{datetime.now().strftime('%Y%m%d-%H%M%S')}",  
    inputs={
        "train": train_data, 
        "validation": val_data, 
        "test": test_data,
    }, 
) 
print("👉", tuner.latest_tuning_job.name)
## check hpo jobs in "SageMaker - Traning - Hyperparameter tuning jobs"
## e.g. p3-dog-breeds-hpo-241203-0321

👉 p3-dog-breeds-hpo-241203-0321
CPU times: total: 469 ms
Wall time: 4.21 s


In [None]:
## if the tuning job doesn't take long time...
# print(tuner.best_training_job())
# print(tuner.best_estimator().hyperparameters())
# predictor = tuner.deploy(
#     initial_instance_count=1, 
#     instance_type="ml.t2.medium")

## **👉 W&B Sweep** 

[Check the Sweep workspace](https://wandb.ai/nov05/udacity-awsmle-resnet50-dog-breeds/sweeps/tkeo613o)    

<img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2024-12-03%2019_24_26-sagemaker-hpo%20_%20udacity-awsmle-resnet50-dog-breeds%20Workspace%20%E2%80%93%20Weights%20%26%20Biases.jpg" width=600>  

<img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2024-12-03%2019_51_24-sagemaker-hpo%20_%20udacity-awsmle-resnet50-dog-breeds%20Workspace%20%E2%80%93%20Weights%20%26%20Biases.jpg" width=600>  

In [None]:
## launch wandb sweep job
# !wandb agent nov05/udacity-awsmle-resnet50-dog-breeds/tkeo613o

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model  
**Note:** You will need to use the `train.py` script to perform model profiling and debugging.

In [None]:
# TODO: Set up debugging and profiling rules and hooks
from sagemaker.debugger import (
    Rule,
    ProfilerRule,
    DebuggerHookConfig,
    rule_configs,
)
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]
hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "100", 
        "eval.save_interval": "10"
    }
)

In [None]:
## instanticate an estimator from the hpo job name
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuningJobAnalytics
from pprint import pprint
tuning_job_name = "p3-dog-breeds-hpo-241203-0321"
hpo_analytics = HyperparameterTuningJobAnalytics(tuning_job_name, session)
best_training_job = hpo_analytics.best_training_job()
print("👉 Best training job:", best_training_job)
# TODO: Get the best estimators and the best HPs
best_estimator = Estimator.attach(best_training_job)
print("👉 Best estimator hyperparameters:")
best_hyperparameters = best_estimator().hyperparameters()
pprint(best_hyperparameters)

In [None]:
# TODO: Create and fit an estimator
best_hyperparameters['debug'] = True
estimator = PyTorch(
    entry_point='train.py',  # Your training script that defines the ResNet50 model and training loop
    source_dir='scripts',  # Directory where your script and dependencies are stored
    role=role_arn,
    framework_version='1.13.1',  # Use the PyTorch version you need
    py_version='py39',
    instance_count=1,  # Adjust based on the number of instances you want to use
    # instance_type='ml.p3.2xlarge',  # 16GB, Use GPU instances for deep learning
    instance_type='ml.g4dn.xlarge',  ## 16GB
    output_path=output_path,
    hyperparameters=best_hyperparameters,
    # use_spot_instances=True,
    ## Debugger parameters
    rules=rules,
    debugger_hook_config=hook_config,    
)
estimator.fit(
    wait=True,  
    job_name=f"p3-dog-breeds-debug-{datetime.now().strftime('%Y%m%d-%H%M%S')}",  
    inputs={
        "train": train_data, 
        "validation": val_data, 
        "test": test_data,
    },  
)

In [None]:
# TODO: Plot a debugging output.

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output

## Model Deploying

In [None]:
## instanticate an estimator from the hpo job name
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuningJobAnalytics
from pprint import pprint
tuning_job_name = "p3-dog-breeds-hpo-241203-0321"
hpo_analytics = HyperparameterTuningJobAnalytics(tuning_job_name, session)
best_training_job = hpo_analytics.best_training_job()
print("👉 Best training job:", best_training_job)
# TODO: Get the best estimators and the best HPs
best_estimator = Estimator.attach(best_training_job)
# print("👉 Best estimator hyperparameters:")
# best_hyperparameters = best_estimator().hyperparameters()
# pprint(best_hyperparameters)

In [None]:
# TODO: Deploy your model to an endpoint
predictor=best_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge', #"ml.m5.large",
) # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint
# TODO: Your code to load and preprocess image to send to endpoint for prediction
image = ""
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()

* 🟢⚠️ Issue solved:     

  > ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateHyperParameterTuningJob 
  operation: The account-level service limit 'ml.g4dn.xlarge for training job usage' is 2 Instances, with current 
  utilization of 0 Instances and a request delta of 10 Instances. Please use AWS Service Quotas to request an 
  increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for 
  this quota.

  * You can still create an HPO job with as many `max_jobs` as you want. However, the number of concurrent jobs is limited to 2 (`max_parallel_jobs=2`). For example, if your `max_jobs` is set to 20, only 2 training jobs will run at a time. If each training job takes about an hour, the entire HPO job will take at least 10 hours to complete.

  * Go to `Service Quotas > AWS services > Amazon SageMaker`, search for `ml.g4dn.xlarg`.  

    <img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2024-12-03%2002_03_35-Quotas%20list%20-%20Amazon%20SageMaker%20_%20AWS%20Service%20Quotas.jpg" width=600>  

    <img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/Udacity/20241119_aws-mle-nanodegree/2024-12-03%2002_06_13-Quotas%20list%20-%20Amazon%20SageMaker%20_%20AWS%20Service%20Quotas.jpg" width=600>  