# Using SageMaker debugger to visualise model training in real time

**Before starting the lab please ensure the notegbook is deployed in <span style="color:red">us-west-2</span> and the Kernel is selected as <span style="color:red">MXNet 1.8 Python 3.7 CPU Optimized</span>**

This notebook will train a convolutional autoencoder model on [MNIST dataset of handwritten digits](http://yann.lecun.com/exdb/mnist/). When running training through a convolutional model, the layers are hidden from sight. In this demo we will use SageMaker Debugger to better understand how well the autoencoder is learning. We can check if the model is training well by checking:

1. Reconstructed images (autoencoder output): 

The autoencoder is able to learn how to decompose data (in our case, images) into fairly small bits of data, and then using that representation, reconstruct the original data as closely as it can to the original. This visualisation helps us understand how much of the original data has been lost while the autoencoder 

2. The t-Distributed Stochastic Neighbor Embedding (t-SNE) of the latent variables: 

The t-SNE map will plot the training results into a 2-dimensional representation of clustered data. Eachcluster represents on MNIST class - a number from 0 to 9. As the training progresses the autoencoder becomes better in separating those classes.

### Training MXNet autoencoder model in Amazon SageMaker with debugger 
Before starting the SageMaker training job, we need to install some libraries. We will use `smdebug` library to read, filter and analyze raw tensors that are stored in Amazon S3. We install `seaborn` library that will be used later on to plot t-Distributed Stochastic Neighbor Embedding (t-SNE) of the latent variables.

In [None]:
!pip install smdebug --quiet
!pip install seaborn --quiet

## Setting up the training

The following code will define the MXNet estimator and run a training job. Notice that the estimatoor has the debugger hook configuration (debugger_hook_config=...). Also notice that the model training is implemented in the entry point script `autoencoder_mnist.py` provided in this demo. 

We can access the tensors from S3 once the training job is in status Training or Completed. **This should take about 2-3 minutes.** During this time, we can inspect the training job in the [SageMaker Managemnt Console](https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs)

In [None]:
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
import time

#Set up the training S3 bucket:
sagemaker_session = sagemaker.Session()
BUCKET_NAME = sagemaker_session.default_bucket()
LOCATION_IN_BUCKET = "smdebug-autoencoder-example"

s3_bucket_for_tensors = "s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}".format(
    BUCKET_NAME=BUCKET_NAME, LOCATION_IN_BUCKET=LOCATION_IN_BUCKET
)

#MXNet estimator and the debugger hook configuration:

estimator = MXNet(
    role=sagemaker.get_execution_role(),
    base_job_name="mxnet",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size=400,
    source_dir=".",
    entry_point="autoencoder_mnist.py",
    framework_version="1.6.0",
    py_version="py3",
    debugger_hook_config=DebuggerHookConfig(
        s3_output_path=s3_bucket_for_tensors,
        collection_configs=[
            CollectionConfig(
                name="all",
                parameters={
                    "include_regex": ".*convolutionalautoencoder0_hybridsequential0_dense0_output_0|.*convolutionalautoencoder0_input_1|.*loss",
                    "save_interval": "10",
                },
            )
        ],
    ),
)

#Start the training job:

estimator.fit(wait=False)

#Wait for the job to start training and creating tensors in the S3 bucket:

path = estimator.latest_job_debugger_artifacts_path()
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)


if description["TrainingJobStatus"] != "Completed":
    while description["SecondaryStatus"] not in {"Training", "Completed"}:
        description = client.describe_training_job(TrainingJobName=job_name)
        primary_status = description["TrainingJobStatus"]
        secondary_status = description["SecondaryStatus"]
        print(
            "Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]".format(
                primary_status, secondary_status
            )
        )
        time.sleep(30)

## Visualize model training in real-time

In the final section, the code will retrieve the tensors from the bottlneck layer and input/output tensors while the model is still training. Once we have the tensors, we will compute t-SNE and plot the results.

Helper function to compute stochastic neighbor embeddings:

In [None]:
#Helper function to compute stochastic neighbor embeddings:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

def compute_tsne(tensors, labels):

    # compute TSNE
    tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
    tsne_results = tsne.fit_transform(tensors)

    # add results to dictionary
    data = {}
    data["x"] = tsne_results[:, 0]
    data["y"] = tsne_results[:, 1]
    data["z"] = labels

    return data

#Helper function to plot t-SNE results and autoencoder input/output:
def plot_autoencoder_data(tsne_results, input_tensor, output_tensor):
    fig, (ax0, ax1, ax2) = plt.subplots(
        ncols=3, figsize=(30, 15), gridspec_kw={"width_ratios": [1, 1, 3]}
    )
    plt.rcParams.update({"font.size": 20})
    ax0.imshow(input_tensor, cmap=plt.cm.gray)
    ax1.imshow(output_tensor, cmap=plt.cm.gray)
    ax0.set_axis_off()
    ax1.set_axis_off()
    ax2.set_axis_off()
    ax0.set_title("autoencoder input")
    ax1.set_title("autoencoder output")
    plt.title("Step " + str(step))
    sns.scatterplot(
        x="x", y="y", hue="z", data=tsne_results, palette="viridis", legend="full", s=100
    )
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
    plt.axis("off")
    plt.show()
    plt.clf()

#Create a trial
from smdebug.trials import create_trial

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

#Retrieve available steps
steps = 0
while steps == 0:
    steps = trial.steps()
    print("Waiting for tensors to become available...")
    time.sleep(3)
print("\nDone")

print("Getting tensors...")
rendered_steps = []

#Define the tensors to retrieve
label_input = "convolutionalautoencoder0_input_1"
autoencoder_bottleneck = "convolutionalautoencoder0_hybridsequential0_dense0_output_0"
autoencoder_input = "l2loss0_input_1"
autoencoder_output = "l2loss0_input_0"

Finally, the following code retrieves the tensors and computes t-SNE, and plots the data on a map. 

The model training ususally takes between 10 nad 15 minutes,, however the tensors produced by the training are continuously emitted to the training S3 bucket. This allows the results of the training to be visualised in real time.

In this step we can visualise how the autoencoder output has transformed the input image by passing it through the CNN. We can aslo see the clustering of the classes (numbers 0 throug 9 od f the MNIST dataset), and how the clustering has improved over the training run. Once the model training has completed you should see the visualisation in step 990 has achieved a very good clustering of the classes.

In [None]:
from smdebug.exceptions import TensorUnavailableForStep
from smdebug.mxnet import modes


loaded_all_steps = False
while not loaded_all_steps:

    # get available steps
    loaded_all_steps = trial.loaded_all_steps
    steps = trial.steps(mode=modes.EVAL)

    # quick way to get diff between two lists
    steps_to_render = list(set(steps).symmetric_difference(set(rendered_steps)))

    tensors = []
    labels = []

    # iterate over available steps
    for step in sorted(steps_to_render):
        try:
            if len(tensors) > 1000:
                tensors = []
                labels = []

            # get tensor from bottleneck layer and label
            tensor = trial.tensor(autoencoder_bottleneck).value(step_num=step, mode=modes.EVAL)
            label = trial.tensor(label_input).value(step_num=step, mode=modes.EVAL)
            for batch in range(tensor.shape[0]):
                tensors.append(tensor[batch, :])
                labels.append(label[batch])

            # compute tsne
            tsne_results = compute_tsne(tensors, labels)

            # get autoencoder input and output
            input_tensor = trial.tensor(autoencoder_input).value(step_num=step, mode=modes.EVAL)[
                0, 0, :, :
            ]
            output_tensor = trial.tensor(autoencoder_output).value(step_num=step, mode=modes.EVAL)[
                0, 0, :, :
            ]

            # plot results
            plot_autoencoder_data(tsne_results, input_tensor, output_tensor)

        except TensorUnavailableForStep:
            print("Tensor unavilable for step {}".format(step))

    rendered_steps.extend(steps_to_render)

    time.sleep(5)

print("\nDone")
