# Visualizing Debugging Tensors of MXNet training

### Overview

This demo is based on the SageMaker example documented here https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_plot. Because AWS@Apple has network isolation enforced in the SCP for SageMaker CreateTrainingJob, you cannot download public dataset using framework provided API directly. Here we demonstrate a way to leverage S3 data channels to avoid downloading from public. 

SageMaker Debugger is a new capability of Amazon SageMaker that allows debugging machine learning models. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all the tensors 'flowing through the graph' during training. SageMaker Debugger helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected an inconsistency in the training flow.

Using SageMaker Debugger is a two step process: Saving tensors and Analysis. In this notebook we will run an MXNet training job and configure SageMaker Debugger to store all tensors from this job. Afterwards we will visualize those tensors in our notebook.


### Dependencies
Before we begin, let us install the library plotly if it is not already present in the environment.
If the below cell installs the library for the first time, you'll have to restart the kernel and come back to the notebook. In addition to that, in order for our vizualiation to access tensors let's install smdebug - debugger library that provides API access to tensors emitted during training job.

In [1]:
! python -m pip install plotly
! python -m pip install smdebug

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m


### Configure and run the training job

Now we'll call the Sagemaker MXNet Estimator to kick off a training job with Debugger attached to it.

The `entry_point_script` points to the MXNet training script.

In [2]:
import os
import sagemaker
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
from sagemaker.mxnet import MXNet

sagemaker_session = sagemaker.Session()

entry_point_script = './scripts/mxnet_gluon_save_all_demo.py'
hyperparameters = {'batch-size': 256}
base_job_name = 'mnist-tensor-plot'

# Make sure to set this to your bucket and location
BUCKET_NAME = 'sagemaker-bhan-dev'
LOCATION_IN_BUCKET = 'mnist-tensor-plot'
s3_bucket_for_tensors = 's3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.format(BUCKET_NAME=BUCKET_NAME, LOCATION_IN_BUCKET=LOCATION_IN_BUCKET)
kms_key='db53dd38-c590-4dd2-9605-b35b668a3966' # Replace with your Sagemaker Notebook Instance KMS key

estimator = MXNet(
    role=sagemaker.get_execution_role(),
    base_job_name=base_job_name,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    entry_point=entry_point_script,
    framework_version='1.6.0',
    train_max_run=3600,
    sagemaker_session=sagemaker_session,
    py_version='py3',
    output_path=s3_bucket_for_tensors,
    train_volume_kms_key=kms_key,
    encrypt_inter_container_traffic=True,
    subnets=['subnet-09b09827a04a561c0','subnet-033717301a9ddb40d'],
    security_group_ids=['sg-0a58f487922c7143e'],
    enable_network_isolation=True,
    debugger_hook_config = DebuggerHookConfig(
      s3_output_path=s3_bucket_for_tensors,  # Required
      collection_configs=[
          CollectionConfig(
              name="all_tensors",
              parameters={
                  "include_regex": ".*",
                  "save_steps": "1, 2, 3"
              }
          )
      ]
    )
)

Estimator described above will save all tensors of all layers during steps 1, 2 and 3. Now, we can prepare the data set and define the data channels. 

Because network isolated training job cannot download the required data set from public, we need to download them and upload to S3 in advance, then we can use them as data channels for training jobs.
You may download the data set like below:
```
import mxnet    
mnist_train = mxnet.gluon.data.vision.datasets.MNIST(root='/tmp', train=True)
mnist_valid = mxnet.gluon.data.vision.FashionMNIST(root='/tmp', train=False)
```
This will download the training images/lables and validation images/labels under `/tmp`. You can upload the data set to S3 via boto3 API or aws CLI. Now we can define the data channels pointing to those S3 locations.

In [3]:
train_data_channel = sagemaker.session.s3_input('s3://sagemaker-bhan-dev/mnist-datasets/mnist/train-images-idx3-ubyte.gz')
train_label_channel = sagemaker.session.s3_input('s3://sagemaker-bhan-dev/mnist-datasets/mnist/train-labels-idx1-ubyte.gz')
valid_data_channel = sagemaker.session.s3_input('s3://sagemaker-bhan-dev/mnist-datasets/fashion-mnist/t10k-images-idx3-ubyte.gz')
valid_label_channel = sagemaker.session.s3_input('s3://sagemaker-bhan-dev/mnist-datasets/fashion-mnist/t10k-labels-idx1-ubyte.gz')

data_channels = {
    'train_data': train_data_channel, 
    'train_label': train_label_channel, 
    'valid_data': valid_data_channel, 
    'valid_label': valid_label_channel
}

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


Training jobs will download the data set defined in data channel to local and populate environment variable `SM_CHANNEL_{channel_name}` to point to these  local data location. In the example above, we defined 4 data channels: `train_data`, `train_label`, `valid_data`, `valid_label`, so a training job will populate 4 environment variables: `SM_CHANNEL_TRAIN_DATA`, `SM_CHANNEL_TRAIN_LABEL`, `SM_CHANNEL_VALID_DATA`, `SM_CHANNEL_VALID_LABEL`, and point them to the downloaded data set in local (by default it's under `/opt/ml/input/data/`). 

Now, we can start the training job with the `data_channels` as `inputs`.

You may check the `entry_point` script `./scripts/mxnet_gluon_save_all_demo.py` for how it references the data set in the training job for more details.

In [4]:
estimator.fit(inputs=data_channels,  logs=True)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-06-16 22:03:07 Starting - Starting the training job...
2020-06-16 22:03:09 Starting - Launching requested ML instances.........
2020-06-16 22:04:42 Starting - Preparing the instances for training...
2020-06-16 22:05:33 Downloading - Downloading input data...
2020-06-16 22:05:56 Training - Downloading the training image..
2020-06-16 22:06:16 Training - Training image download completed. Training in progress.[34m2020-06-16 22:06:17,413 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-06-16 22:06:17,416 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-06-16 22:06:17,432 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{}', 'SM_USER_ENTRY_POINT': 'mxnet_gluon_save_all_demo.py', 'SM_FRAMEWORK_PARAMS': '{}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"

### Get S3 location of tensors

Now we can retrieve the S3 location of the tensors:

In [5]:
tensors_path = estimator.latest_job_debugger_artifacts_path()
print('S3 location of tensors is: ', tensors_path)

S3 location of tensors is:  s3://sagemaker-bhan-dev/mnist-tensor-plot/mnist-tensor-plot-2020-06-16-22-03-06-879/debug-output


### Download tensors from S3

Next we download the tensors from S3, so that we can visualize them in the notebook.

In [6]:
folder_name = tensors_path.split("/")[-1]
os.system("aws s3 cp --recursive " + tensors_path + " " + folder_name)
print('Downloaded tensors into folder: ', folder_name)

Downloaded tensors into folder:  debug-output


### Visualize
The main purpose of this class (TensorPlot) is to visualise the tensors in your network. This could be to determine dead or saturated activations, or the features maps the network.

To use this class (TensorPlot), you will need to supply the argument regex with the tensors you are interested in. e.g., if you are interested in activation outputs, then you need to supply the following regex .*relu|.*tanh|.*sigmoid.

Another important argument is the `sample_batch_id`, which allows you to specify the index of the batch size to display. For example, given an input tensor of size (batch_size, channel, width, height), `sample_batch_id = n` will display (n, channel, width, height). If you set sample_batch_id = -1 then the tensors will be summed over the batch dimension (i.e., `np.sum(tensor, axis=0)`). If batch_sample_id is None then each sample will be plotted as separate layer in the figure.

Here are some interesting use cases:

1) If you want to determine dead or saturated activations for instance ReLus that are always outputting zero, then you would want to sum the batch dimension (sample_batch_id=-1). The sum gives an indication which parts of the network are inactive across a batch.

2) If you are interested in the feature maps for the first image in the batch, then you should provide batch_sample_id=0. This can be helpful if your model is not performing well for certain set of samples and you want to understand which activations are leading to misprediction.

An example visualization of layer outputs:
![](./images/tensorplot.gif)


`TensorPlot` normalizes tensor values to the range 0 to 1 which means colorscales are the same across layers. Blue indicates value close to 0 and yellow indicates values close to 1. This class has been designed to plot convolutional networks that take 2D images as input and predict classes or produce output images. You can use this  for other types of networks like RNNs, but you may have to adjust the class as it is currently neglecting tensors that have more than 4 dimensions.

Let's plot Relu output activations for the given MNIST training example.

In [7]:
import tensor_plot 

visualization = tensor_plot.TensorPlot(
    regex=".*relu_output", 
    path=folder_name,
    steps=10,  
    batch_sample_id=0,
    color_channel = 1,
    title="Relu outputs",
    label=".*sequential0_input_0",
    prediction=".*sequential0_output_0"
)

[2020-06-16 22:07:21.577 ip-10-10-60-246:6974 INFO local_trial.py:35] Loading trial debug-output at path debug-output
[2020-06-16 22:07:21.582 ip-10-10-60-246:6974 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2020-06-16 22:07:22.585 ip-10-10-60-246:6974 INFO trial.py:210] Loaded all steps


If we plot too many layers, it can crash the notebook. If you encounter performance or out of memory issues, then either try to reduce the layers to plot by changing the `regex` or run this Notebook in JupyterLab instead of Jupyter. 

In the below cell we vizualize outputs of all layers, including final classification. Please note that because training job ran only for 1 epoch classification accuracy is not high.

In [8]:
visualization.fig.show(renderer="iframe")

For additional example of working with debugging tensors and visualizing them in real time please feel free to try it out at [MXNet realtime analysis](../mxnet_realtime_analysis/mxnet-realtime-analysis.ipynb) example.