* created by nov05 on 2024-11-28   
* local conda env `awsmle_py310` (sagemaker installed, no cuda)  
* [Udacity course notes](https://www.evernote.com/shard/s139/u/0/client/snv?isnewsnv=true&noteGuid=e2ab3701-2d26-45ce-8d24-4378b0e929d9&noteKey=q4ZhC5FKGCQcV8k3HeIMRaFL2DU9WOH3nf1xQdJV7DWKcKtWTQoiDrl89g&sn=https%3A%2F%2Fwww.evernote.com%2Fshard%2Fs139%2Fsh%2Fe2ab3701-2d26-45ce-8d24-4378b0e929d9%2Fq4ZhC5FKGCQcV8k3HeIMRaFL2DU9WOH3nf1xQdJV7DWKcKtWTQoiDrl89g&title=Deploy%2BDeep%2BLearning%2BModels%2Bon%2BSageMaker%2B-%2BSageMaker%2BDebugger) (ND189, Course 4, 4.8 SageMaker Debugger)   
* [SageMaker Debugger documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html)  
* [Example of using Pytorch for model debugging](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-debugger/pytorch_model_debugging)      

In [None]:
## windows powershell command to set up aws credentials
# !notepad C:\Users\guido\.aws\credentials

In [5]:
%pwd

'd:\\github\\udacity-CD0387-deep-learning-topics-within-computer-vision-nlp-project-starter\\cd0387_deploy_deeplearning_models_on_sagemaker\\debugging'

In [None]:
## reset the session after updating credentials
import boto3
boto3.DEFAULT_SESSION = None

from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    ## your own role here
    role_arn = "arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392"
print("👉 Role ARN:", role_arn) ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs

# SageMaker Model Debugging

Here we will see how we can use Sagemaker Debugging to see our model training performance as well as generate a simple report called the Profiler Report that gives us an overview of our training job.

First we will need to install `smdebug`.

In [None]:
# !pip install smdebug
## Successfully installed protobuf-3.20.3 pyinstrument-3.4.2 pyinstrument-cext-0.2.4 smdebug-1.0.34

## Debugger Rule and Configs

Next we need to import the packages we will need and specify the debugger rules and configs. We will check for overfitting, overtraining, poor weight initialization and vanishing gradients. We will also set a save interval of 100 and 10 for training and testing respectively.

In [None]:
from sagemaker.debugger import (
    Rule,
    ProfilerRule,
    DebuggerHookConfig,
    rule_configs,
)
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]
hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "100", 
        "eval.save_interval": "10"
    }
)



sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\guido\AppData\Local\sagemaker\sagemaker\config.yaml


👉 Role ARN: arn:aws:iam::061096721307:role/service-role/AmazonSageMaker-ExecutionRole-20241128T055392


Next we will specify the hyperparameters and create our estimator. In our estimator, we will additionally need to specify the debugger rules and configs that we created before.

In [None]:
hyperparameters = {"epochs": "2", 
                   "batch-size": "32", 
                   "test-batch-size": "100", 
                   "lr": "0.001"
}
estimator = PyTorch(
    entry_point="..\script mode\scripts\pytorch_mnist.py",  ## my own script
    base_job_name="smdebugger-mnist-pytorch",
    role=role_arn, #get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.large",
    hyperparameters=hyperparameters,
    framework_version="1.8",
    py_version="py36",
    ## Debugger parameters
    rules=rules,
    debugger_hook_config=hook_config,
)
estimator.fit(wait=True)
## find the folder in the default SageMaker folder in S3
## e.g. smdebugger-mnist-pytorch-2024-12-01-06-25-24-603/
## 5m 21.6s

2024-12-01 06:25:34 Starting - Starting the training job...
2024-12-01 06:26:02 Starting - Preparing the instances for trainingVanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
...
2024-12-01 06:26:35 Downloading - Downloading input data...
2024-12-01 06:27:03 Downloading - Downloading the training image...
2024-12-01 06:27:43 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-12-01 06:27:46,111 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2024-12-01 06:27:46,114 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2024-12-01 06:27:46,122 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2024-12-01 06:27:46,125 sagemaker_pytorch_container.training INFO  

In [4]:
from pprint import pprint
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)
print("👉", job_name)
pprint(description)

👉 smdebugger-mnist-pytorch-2024-12-01-06-25-24-603
{'AlgorithmSpecification': {'EnableSageMakerMetricsTimeSeries': True,
                            'TrainingImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8-cpu-py36',
                            'TrainingInputMode': 'File'},
 'BillableTimeInSeconds': 230,
 'CreationTime': datetime.datetime(2024, 12, 1, 0, 25, 31, 727000, tzinfo=tzlocal()),
 'DebugHookConfig': {'CollectionConfigurations': [{'CollectionName': 'relu_input',
                                                   'CollectionParameters': {'include_regex': '.*relu_input',
                                                                            'save_interval': '500'}},
                                                  {'CollectionName': 'gradients',
                                                   'CollectionParameters': {'save_interval': '500'}}],
                     'HookParameters': {'eval.save_interval': '10',
                                  

## Checking Training Performance
Below is some boilerplate code to get the training job object using the training job name and display the training metrics that we were tracking as well as some of the training tensors. The plots may not show up in the classroom, but it will show up when you train the model in SageMaker Studio.

⚠️ If on a local computer, **skip this section**. Check the `model_debugging_create_trial.ipynb` file, which was run in SageMaker Studio.  

In [5]:
debug_output_path = estimator.latest_job_debugger_artifacts_path()
debug_output_path = debug_output_path.replace("\\", "/")  ## windows path
print("👉", debug_output_path)
## s3://sagemaker-us-east-1-061096721307/smdebugger-mnist-pytorch-2024-12-01-04-25-13-088/debug-output

👉 s3://sagemaker-us-east-1-061096721307/smdebugger-mnist-pytorch-2024-12-01-06-25-24-603/debug-output


In [None]:
## copy the files to local
!mkdir debug_output
!aws s3 cp {debug_output_path} ./debug_output/ --recursive

In [None]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
# trial = create_trial(debug_output_path)  ## S3 path
trial = create_trial(r"\\debug_output")  ## local path
## fetch tensor name and their lengths
trial.tensor_names()
## s3://sagemaker-us-east-1-061096721307/smdebugger-mnist-pytorch-2024-12-01-04-25-13-088/debug-output

In [None]:
len(trial.tensor("nll_loss_output_0").steps(mode=ModeKeys.TRAIN))

In [None]:
## set up function to plot the output tensors
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot


def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals


def plot_tensor(trial, tensor_name):
    steps_train, vals_train = get_data(trial, tensor_name, mode=ModeKeys.TRAIN)
    print("loaded TRAIN data")
    steps_eval, vals_eval = get_data(trial, tensor_name, mode=ModeKeys.EVAL)
    print("loaded EVAL data")

    plt.figure(figsize=(10, 7))
    host = host_subplot(111)
    par = host.twiny()
    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")
    host.set_ylabel(tensor_name)

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    print("completed TRAIN plot")
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    print("completed EVAL plot")
    leg = plt.legend()

    host.xaxis.get_label().set_color(p1.get_color())
    leg.texts[0].set_color(p1.get_color())

    par.xaxis.get_label().set_color(p2.get_color())
    leg.texts[1].set_color(p2.get_color())

    plt.ylabel(tensor_name)
    plt.show()


plot_tensor(trial, "nll_loss_output_0")

## Display the Profiler Report
The profiler report will be saved in an S3 bucket. Below we can see how to get the path of the report, fetch it and display it. The profiler report may not display in the notebook, but you can take a look at it from the ProfilerReport folder.  

⚠️ Check result in the `model_debugging_profiling_report.ipynb` file.

In [17]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print("👉", rule_output_path)

👉 s3://sagemaker-us-east-1-061096721307/smdebugger-mnist-pytorch-2024-12-01-06-25-24-603/rule-output


In [None]:
## copy the files to local
!mkdir rule_output
!aws s3 cp {rule_output_path} ./rule_output/ --recursive

Completed 199 Bytes/491.0 KiB (202 Bytes/s) with 13 file(s) remaining
Completed 326 Bytes/491.0 KiB (331 Bytes/s) with 13 file(s) remaining
Completed 524 Bytes/491.0 KiB (533 Bytes/s) with 13 file(s) remaining
Completed 650 Bytes/491.0 KiB (661 Bytes/s) with 13 file(s) remaining
Completed 841 Bytes/491.0 KiB (855 Bytes/s) with 13 file(s) remaining
download: s3://sagemaker-us-east-1-061096721307/smdebugger-mnist-pytorch-2024-12-01-06-25-24-603/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json to rule_output\ProfilerReport\profiler-output\profiler-reports\Dataloader.json
Completed 841 Bytes/491.0 KiB (855 Bytes/s) with 12 file(s) remaining
download: s3://sagemaker-us-east-1-061096721307/smdebugger-mnist-pytorch-2024-12-01-06-25-24-603/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json to rule_output\ProfilerReport\profiler-output\profiler-reports\CPUBottleneck.json
Completed 841 Bytes/491.0 KiB (855 Bytes/s) with 11 file(s) remaining


In [21]:
import os
# get the autogenerated folder name of profiler report
profiler_report_folder = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]
filename = "rule_output/" + profiler_report_folder + "/profiler-output/profiler-report.html"
print("👉 ", filename)

from IPython.display import display, HTML
# HTML(filename=filename)
###########################################################################
## If you are using dark mode and the bright HTML display hurts your eyes, 
## run the following code to clear it. You are welcome!
###########################################################################
## Clear any existing HTML or formatting issues
display(HTML(''))

👉  rule_output/ProfilerReport/profiler-output/profiler-report.html
