## SageMaker Debugger Demo and Status

This notebook presents an example case of the SageMaker Debugger for training a ResNet50 model on Imagenet data. It covers the implementation of Debugger rules, configurations, Tensorboard, and profiling. We also note current limitations for the Debugger, and future enhacements.

### Environment

This notebook is designed to run in SageMaker Studio using the PyTorch DLC. We also recommend updating to the most recent version of SageMaker, as this will ensure you have all the latest containers for training instances.

In [None]:
#!pip install --upgrade sagemaker

### Studio

Working in SageMaker Studio has a couple benefits, but to get the most out of it, there's a few additional setup steps.

First, when you're running in Studio, you're using multiple instances. You can see everything that's currently running by clicking the circle with a square in it on the far left of the window. There's also one additional instance that's actually managing the Jupyter environment. You can get a terminal to this instance by clicking the folder on the far left, then the plus sign to the right of it to get the launcher page, then select `System Terminal`. 

Other than the Jupyter instance, all other instances in your environment will have associated Docker containers. Each instance can have multiple Docker containers, referred to as apps. In the launcher you can select a SageMaker Docker image to launch a new container with a notebook. From within the notebook, you can get access to the terminal on the respective instance by selecting the $_ symbol at the top of the notebook.

There's two additions we want to make to the environment. First, we want to add Tensorboard to our Jupyter instance.

1. Open a system terminal from the launcher page.
2. Enter `pip install tensorboard`
3. To test Tensorboard, run `tensorboard --host 0.0.0.0 --port 6006`
4. To get to Tensorboard, copy your studio URL into a new browser window, and change `lab` to `proxy/6006/`
    - For example, `https://a-stringofsomething.studio.us-east-1.sagemaker.aws/jupyter/default/lab?`
    - Becomes `https://a-stringofsomething.studio.us-east-1.sagemaker.aws/jupyter/default/proxy/6006/`
    - Make sure to include the final `/`, or Tensorboard will not load correctly
    - Shutdown Tensorboard by going back to the system terminal and pressing `ctrl-c`
    - More details can be found in the [Tensorboard Studio Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tensorboard.html)
    
The other addition is if you want to be able to prototype your model on Studio, rather than launching a training job each time. This can be a useful time saver for interactive development.

The `Scratch.ipynb` notebook contains a prototype training script. At the top of the notebook, the first cell runs `pip install -e ./src`. This installs all dependencies for our model, including PyTorch Lightning and Webdataset. 

Also note that training jobs will automatically run this same setup. For example, if you launch a trianing job on a standard PyTorch DLC, and include a `setup.py` file in your source directory with a python file as your entry point, SageMaker will automatically run `pip install` on the included setup file. This allows for easier customization of training images, without the need to create your own image ahead of time.

### Debugger

The SageMaker Debugger is a collection of tools for monitoring model training, and taking action based on obeserved behavior. The Debugger consists of 4 main components:

#### Tensor Collection Hook

The collection hook wraps your model and collects tensor data at regular intervals. For example, we can collect the loss tensor every 25 steps, and the gradients every 500 steps. You can also optionally apply reductions to these tensors (for example taking the mean and standard deviations of gradients). At each collection step, the results are stored in the training job's S3 output. The tensor collection hook can be setup manually, or you can specify rules, to determine which tensors to collect.

#### Rules

Adding rules to your training job will launch additional monitoring instances to observe for some behavior. For example, the `Overfit` rule will add the necessary tensor collection config to save the loss tensor to S3, and launch a processing instance which will monitor the loss during training, and raise an alert if the conditions for overfitting are triggered. This trigger can be monitored in Studio, or passed to other tools. For example, you can have a rule send out a notification over email, SMS, or Slack when a condition is triggered. You can also have a rule trigger a lambda job, for example to shut down training if certain conditions are observed.

#### Tensorboard

The Debugger can take a Tensorboard configuration, which will add events files to the S3 output. You can monitor training by pointing the Tensorboard log directory to this location.

#### System Profiling

The system profiler will monitor instance performance for any issues during training. It has a set of rules to monitor for common training issues. For example, if GPU utilization is low, and CPU utilization is high, it will report a CPU bottleneck. Once training is complete, the system profiler will generate a profile report, which can be downloaded either from Studio or S3.

In [1]:
import os
from datetime import datetime

import boto3
from sagemaker import analytics, image_uris
from sagemaker.pytorch import PyTorch
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
from sagemaker import get_execution_role
from sagemaker.debugger import (
    Rule,
    DebuggerHookConfig,
    TensorBoardOutputConfig,
    CollectionConfig,
    ProfilerConfig,
    FrameworkProfile,
    DetailedProfilingConfig,
    DataloaderProfilingConfig,
    rule_configs,
)
from smdebug.core.collection import CollectionKeys

### S3 Setup

This section is not required, just the way I like to setup my S3 bucket to keep jobs organized.

In [2]:
time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")

region = boto3.session.Session().region_name
boto_sess = boto3.Session()
sm = boto_sess.client('sagemaker')

s3_bucket = "s3://jbsnyder-sagemaker-us-east/"

base_job_name = "jbsnyder-pl-resnet-debugger"
date_str = datetime.now().strftime("%d-%m-%Y")
time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
job_name = f"{base_job_name}-{time_str}"

output_path = os.path.join(s3_bucket, "sagemaker-output", date_str, job_name)
code_location = os.path.join(s3_bucket, "sagemaker-code", date_str, job_name)

### Trial Setup

Also not required for a training job, but a good way to keep job info organized.

In [3]:
try: # Create new experiment
    experiment = Experiment.create(
        experiment_name=base_job_name,
        description='Resnet50 Classifier Training',
        sagemaker_boto_client=sm)
except: # Or reload existing
    experiment = Experiment.load(
        experiment_name=base_job_name,
        sagemaker_boto_client=sm)

trial = Trial.create(
    trial_name=job_name,
    experiment_name=experiment.experiment_name,
    sagemaker_boto_client=sm)
experiment_config = {
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'Training'}

# Configure metric definitions
metric_definitions = [
    {'Name': 'train_loss_step', 'Regex': 'train_loss_step: [0-9\\.]+'},
    {'Name': 'train_acc_step', 'Regex': 'train_acc_step: ([0-9\\.]+)'}]

### Tensorboard Configuration

In [4]:
tensorboard_output_config = TensorBoardOutputConfig(s3_output_path=os.path.join(output_path, 'tensorboard'))

### Manual Debugger Configuration

You can manually tell the debugger hook what to collect. In this case, we'll leave it as none, and let the rules determine what to collect.

In [5]:
collection_configs=[
    CollectionConfig(
        name="losses",
        parameters={
            "save_interval": "25",
            "reductions": "mean",
        }
    ),
    CollectionConfig(
        name=CollectionKeys.GRADIENTS,
        parameters={
            "save_interval": "100",
            "reductions": "mean",
        }
    )
]

debugger_hook_config=DebuggerHookConfig(
    collection_configs=collection_configs
)

# debugger_hook_config=None

### Rules

More information on available rules, and how to create your own, can be found in the [rule documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html). 

Important note: Some rules require certain parameters to be set. If they are not set, the rule will fail with `Internal Server Error`. 

In [6]:
# vanishing_gradient
# overfit
# overtraining
# poor_weight_initialization
# all_zero
# check_input_images
# class_imbalance
# dead_relu
# exploding_tensor
# loss_not_decreasing
# saturated_activation
# weight_update_ratio
# tensor_variance

rules = []

rules.append(Rule.sagemaker(
        base_config=rule_configs.tensor_variance(),
        rule_parameters={
                "collection_names": "weights",
                "max_threshold": "10",
                "min_threshold": "0.00001",
        },
        collections_to_save=[ 
            CollectionConfig(
                name="weights", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
)

rules.append(Rule.sagemaker(
        base_config=rule_configs.overfit(),
        collections_to_save=[
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "25",
                } 
            )
        ]
    )
)

rules.append(Rule.sagemaker(
        base_config=rule_configs.poor_weight_initialization(),
        rule_parameters={
                "activation_inputs_regex": ".*relu_input|.*ReLU_input",
                "threshold": "10.0",
                "distribution_range": "0.001",
                "patience": "5",
                "steps": "10"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_relu_collection", 
                parameters={
                    "include_regex": ".*relu_input|.*ReLU_input",
                    "save_interval": "500"
                } 
            )
        ]
    )
)

'''
rules.append(Rule.sagemaker(
        base_config=rule_configs.all_zero(),
        rule_parameters={
                "tensor_regex": ".*",
                "threshold": "100"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="all", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
)

rules.append(Rule.sagemaker(
        base_config=rule_configs.loss_not_decreasing(),
        rule_parameters={
                "tensor_regex": ".*",
                "use_losses_collection": "True",
                "num_steps": "10",
                "diff_percent": "0.1",
                "increase_threshold_percent": "5",
                "mode": "GLOBAL"
        },
    )
)'''

'\nrules.append(Rule.sagemaker(\n        base_config=rule_configs.all_zero(),\n        rule_parameters={\n                "tensor_regex": ".*",\n                "threshold": "100"\n        },\n        collections_to_save=[ \n            CollectionConfig(\n                name="all", \n                parameters={\n                    "save_interval": "500"\n                } \n            )\n        ]\n    )\n)\n\nrules.append(Rule.sagemaker(\n        base_config=rule_configs.loss_not_decreasing(),\n        rule_parameters={\n                "tensor_regex": ".*",\n                "use_losses_collection": "True",\n                "num_steps": "10",\n                "diff_percent": "0.1",\n                "increase_threshold_percent": "5",\n                "mode": "GLOBAL"\n        },\n    )\n)'

### Setup Profiler

In [7]:
profiler_config=ProfilerConfig(
    system_monitor_interval_millis=500,
)

### Model Hyperparameters

In [8]:
hyperparameters = {"train_file_dir": os.path.join(s3_bucket, "data", "imagenet", "train"), # "/opt/ml/input/data/train/", 
                   "validation_file_dir": os.path.join(s3_bucket, "data", "imagenet", "val"), # "/opt/ml/input/data/val/", 
                   "max_epochs": 20,
                   'optimizer': 'adamw',
                   'lr': 0.001,
                   'batch_size': 64,
                   'dataloader_workers': 4,
                   'warmup_epochs': 0,
                   'mixup_alpha': 0.1,
                   'precision': 16,
                   'strategy': 'ddp',
                   'train_batches': 2048,
                   }

In [9]:
if hyperparameters.get('strategy')=='ddp':
    distribution=None
    entry_point="launch_ddp.py"
    hyperparameters['training_script']="train.py"
else:
    distribution = {"mpi": {"enabled": True}}
    entry_point = "train.py"

In [10]:
# training_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker"

instance_type = 'ml.p3.16xlarge'
instance_count = 1

image_uri = image_uris.retrieve(
    framework='pytorch',
    region=region,
    version='1.10',
    py_version='py38',
    image_scope='training',
    instance_type=instance_type,
)

# image_uri = '920076894685.dkr.ecr.us-east-1.amazonaws.com/jbsnyder:nvidia-pt-20.10-sm'

In [11]:
# For fast file mode, S3 location must end with "/" if it's not a specific object
# channels = {"train": os.path.join(s3_bucket, "data", "imagenet", "train/"),
#             "val": os.path.join(s3_bucket, "data", "imagenet", "val/")}

In [12]:
estimator = PyTorch(
    source_dir="./src",
    entry_point=entry_point,
    base_job_name=job_name,
    role=get_execution_role(),
    instance_count=instance_count,
    instance_type=instance_type,
    distribution=distribution,
    volume_size=400,
    max_run=7200,
    hyperparameters=hyperparameters,
    image_uri=image_uri,
    output_path=os.path.join(output_path, 'training-output'),
    checkpoint_s3_uri=os.path.join(output_path, 'training-checkpoints'),
    model_dir=os.path.join(output_path, 'training-model'),
    code_location=code_location,
    ## Debugger parameters
    metric_definitions=metric_definitions,
    enable_sagemaker_metrics=True,
    rules=rules,
    # debugger_hook_config=debugger_hook_config,
    disable_profiler=False,
    tensorboard_output_config=tensorboard_output_config,
    profiler_config=profiler_config,
    input_mode='File',
)

In [13]:
# Run training
estimator.fit(
    inputs=None if hyperparameters['train_file_dir'].startswith('s3') else channels,
    wait=False,
    job_name=job_name,
    experiment_config=experiment_config,
)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: jbsnyder-pl-resnet-debugger-03-06-2022-05-37-06


In [14]:
print("tensorboard --logdir {} --host 0.0.0.0 --port 6006".format(estimator.tensorboard_output_config.s3_output_path))

tensorboard --logdir s3://jbsnyder-sagemaker-us-east/sagemaker-output/03-06-2022/jbsnyder-pl-resnet-debugger-03-06-2022-05-37-06/tensorboard --host 0.0.0.0 --port 6006


In [15]:
estimator.logs()

2022-06-03 05:37:13 Starting - Starting the training job...TensorVariance: InProgress
Overfit: InProgress
PoorWeightInitialization: InProgress
ProfilerReport-1654234632: InProgress
...
2022-06-03 05:38:12 Starting - Preparing the instances for training............
2022-06-03 05:40:09 Downloading - Downloading input data
2022-06-03 05:40:09 Training - Downloading the training image...........................
2022-06-03 05:44:46 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-06-03 05:44:29,691 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-06-03 05:44:29,765 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-06-03 05:44:29,772 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-06-03 

UnexpectedStatusException: Error for Training job jbsnyder-pl-resnet-debugger-03-06-2022-05-37-06: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.

In [None]:
# sm.stop_training_job(TrainingJobName=estimator.base_job_name)