### Install Python SDKs

In [23]:
import sys

In [24]:
!{sys.executable} -m pip install sagemaker-experiments==0.1.24

Defaulting to user installation because normal site-packages is not writeable


### Install PyTroch

In [25]:
# pytorch version needs to be the same in both the notebook instance and the training job container
# https://github.com/pytorch/pytorch/issues/25214
!{sys.executable} -m pip install torch==1.1.0
!{sys.executable} -m pip install torchvision==0.3.0
!{sys.executable} -m pip install pillow==6.2.2
!{sys.executable} -m pip install --upgrade sagemaker

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


### Setup

In [5]:
import time

import boto3
import numpy as np
import pandas as pd
from IPython.display import set_matplotlib_formats
from matplotlib import pyplot as plt
from torchvision import datasets, transforms

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.analytics import ExperimentAnalytics

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

set_matplotlib_formats("retina")



In [11]:
sess = boto3.Session()
sm = sess.client("sagemaker")
role = get_execution_role()

### Create a S3 bucket to hold data

### Dataset

Images in the dataset, we upload it to s3.

Now lets track the parameters from the data pre-processing step.

### Step 1 - Set up the Experiment

Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for : [1] a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or [2] a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or [3] a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

### Create an Experiment

In [13]:
seti_experiment = Experiment.create(
    experiment_name=f"SETI-Breakthrough-Listen-{int(time.time())}",
    description="Classification of\seti",
    sagemaker_boto_client=sm,
)
print(seti_experiment)

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f06b95cbf10>,experiment_name='SETI-Breakthrough-Listen-1626519963',description='Classification of\\seti',tags=None,experiment_arn='arn:aws:sagemaker:us-east-1:435647692548:experiment/seti-breakthrough-listen-1626519963',response_metadata={'RequestId': '1e5ba14f-31df-4e8d-a1c9-1b261a6c57f3', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '1e5ba14f-31df-4e8d-a1c9-1b261a6c57f3', 'content-type': 'application/x-amz-json-1.1', 'content-length': '107', 'date': 'Sat, 17 Jul 2021 11:06:02 GMT'}, 'RetryAttempts': 0})


### Step 2 - Track Experiment
### Now create a Trial for each training run to track the it's inputs, parameters, and metrics.
While training the CNN model on SageMaker, we will experiment with several values for the number of hidden channel in the model. We will create a Trial to track each training job run. We will also create a TrialComponent from the tracker we created before, and add to the Trial. This will enrich the Trial with the parameters we captured from the data pre-processing stage.

Note the execution of the following code takes a while.

In [14]:
from sagemaker.pytorch import PyTorch, PyTorchModel

In [15]:
hidden_channel_trial_name_map = {}

If you want to run the following training jobs asynchronously, you may need to increase your resource limit. Otherwise, you can run them sequentially.

In [16]:
preprocessing_trial_component = tracker.trial_component

NameError: name 'tracker' is not defined

In [26]:
for i, num_hidden_channel in enumerate([2, 5]):
    # create trial
    trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}"
    cnn_trial = Trial.create(
        trial_name=trial_name,
        experiment_name=seti_experiment.experiment_name,
        sagemaker_boto_client=sm,
    )
    hidden_channel_trial_name_map[num_hidden_channel] = trial_name

    # associate the proprocessing trial component with the current trial
    #cnn_trial.add_trial_component(preprocessing_trial_component)

    # all input configurations, parameters, and metrics specified in estimator
    # definition are automatically tracked
    estimator = PyTorch(
        #py_version="py3",
        entry_point="./seti.py",
        role=role,
        sagemaker_session=sagemaker.Session(sagemaker_client=sm),
        #framework_version="1.6.0",
        instance_count=1,
        instance_type="ml.p2.xlarge",
        image_uri = "435647692548.dkr.ecr.us-east-1.amazonaws.com/pytorch-extend-image:latest",
        hyperparameters={
            "epochs": 2,
            "backend": "gloo",
            "optimizer": "Adam",
        },
        metric_definitions=[
            {"Name": "train:loss", "Regex": "Train Loss: (.*?);"},
            {"Name": "test:loss", "Regex": "Test Average loss: (.*?),"},
            {"Name": "test:accuracy", "Regex": "Test Accuracy: (.*?)%;"},
        ],
        enable_sagemaker_metrics=True,
    )

    cnn_training_job_name = "cnn-training-job-{}".format(int(time.time()))

    # Now associate the estimator with the Experiment and Trial
    estimator.fit(
        inputs={"training": "s3://cue-kaggle-seti-us-east-1-435647692548/seti/"},
        job_name=cnn_training_job_name,
        experiment_config={
            "TrialName": cnn_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        },
        wait=True,
    )

    # give it a while before dispatching the next training job
    time.sleep(2)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: cnn-training-job-1626526073


2021-07-17 12:47:54 Starting - Starting the training job...
2021-07-17 12:48:16 Starting - Launching requested ML instancesProfilerReport-1626526073: InProgress
......
2021-07-17 12:49:17 Starting - Preparing the instances for training......
2021-07-17 12:50:19 Downloading - Downloading input data................................................
2021-07-17 12:58:20 Failed - Training job failed
..

UnexpectedStatusException: Error for Training job cnn-training-job-1626526073: Failed. Reason: ClientError: Data download failed:Could not download s3://cue-kaggle-seti-us-east-1-435647692548/seti/old_leaky_data/train_old/0/08398d02eb7c.npy: insufficient disk space

### Compare the model training runs for an experiment

Now we will use the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

### Some Simple Analyses

In [None]:
search_expression = {
    "Filters": [
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}

In [None]:
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm),
    experiment_name=mnist_experiment.experiment_name,
    search_expression=search_expression,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
    metric_names=["test:accuracy"],
    parameter_names=["hidden_channels", "epochs", "dropout", "optimizer"],
)

In [18]:
trial_component_analytics.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,dropout,epochs,hidden_channels,optimizer,test:accuracy - Min,test:accuracy - Max,test:accuracy - Avg,...,test:accuracy - Last,test:accuracy - Count,training - MediaType,training - Value,SageMaker.DebugHookOutput - MediaType,SageMaker.DebugHookOutput - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value,Trials,Experiments
0,cnn-training-job-1621777148-aws-training-job,Training,arn:aws:sagemaker:us-east-1:435647692548:train...,0.2,2.0,2.0,"""sgd""",95.0,97.0,96.0,...,97.0,2,,s3://sagemaker-experiments-us-east-1-435647692...,,s3://sagemaker-us-east-1-435647692548/,,s3://sagemaker-us-east-1-435647692548/cnn-trai...,[cnn-training-job-2-hidden-channels-1621777148],[mnist-hand-written-digits-classification-1621...
1,cnn-training-job-1621778004-aws-training-job,Training,arn:aws:sagemaker:us-east-1:435647692548:train...,0.2,2.0,20.0,"""sgd""",96.0,97.0,96.5,...,97.0,2,,s3://sagemaker-experiments-us-east-1-435647692...,,s3://sagemaker-us-east-1-435647692548/,,s3://sagemaker-us-east-1-435647692548/cnn-trai...,[cnn-training-job-20-hidden-channels-1621778004],[mnist-hand-written-digits-classification-1621...
2,cnn-training-job-1621777719-aws-training-job,Training,arn:aws:sagemaker:us-east-1:435647692548:train...,0.2,2.0,10.0,"""sgd""",95.0,97.0,96.0,...,97.0,2,,s3://sagemaker-experiments-us-east-1-435647692...,,s3://sagemaker-us-east-1-435647692548/,,s3://sagemaker-us-east-1-435647692548/cnn-trai...,[cnn-training-job-10-hidden-channels-1621777719],[mnist-hand-written-digits-classification-1621...
3,cnn-training-job-1621778289-aws-training-job,Training,arn:aws:sagemaker:us-east-1:435647692548:train...,0.2,2.0,32.0,"""sgd""",95.0,97.0,96.0,...,97.0,2,,s3://sagemaker-experiments-us-east-1-435647692...,,s3://sagemaker-us-east-1-435647692548/,,s3://sagemaker-us-east-1-435647692548/cnn-trai...,[cnn-training-job-32-hidden-channels-1621778289],[mnist-hand-written-digits-classification-1621...
4,cnn-training-job-1621777434-aws-training-job,Training,arn:aws:sagemaker:us-east-1:435647692548:train...,0.2,2.0,5.0,"""sgd""",94.0,96.0,95.0,...,96.0,2,,s3://sagemaker-experiments-us-east-1-435647692...,,s3://sagemaker-us-east-1-435647692548/,,s3://sagemaker-us-east-1-435647692548/cnn-trai...,[cnn-training-job-5-hidden-channels-1621777434],[mnist-hand-written-digits-classification-1621...


To isolate and measure the impact of change in hidden channels on model accuracy, we vary the number of hidden channel and fix the value for other hyperparameters.

Next let's look at an example of tracing the lineage of a model by accessing the data tracked by SageMaker Experiments for `cnn-training-job-2-hidden-channels` trial

In [19]:
lineage_table = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm),
    search_expression={
        "Filters": [
            {
                "Name": "Parents.TrialName",
                "Operator": "Equals",
                "Value": hidden_channel_trial_name_map[2],
            }
        ]
    },
    sort_by="CreationTime",
    sort_order="Ascending",
)

In [20]:
lineage_table.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,normalization_mean,normalization_std,mnist-dataset - MediaType,mnist-dataset - Value,Trials,Experiments,SourceArn,SageMaker.ImageUri,...,train:loss - Avg,train:loss - StdDev,train:loss - Last,train:loss - Count,training - MediaType,training - Value,SageMaker.DebugHookOutput - MediaType,SageMaker.DebugHookOutput - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value
0,TrialComponent-2021-05-23-133852-xitn,Preprocessing,0.1307,0.3081,s3/uri,s3://sagemaker-experiments-us-east-1-435647692...,[cnn-training-job-5-hidden-channels-1621777434...,[mnist-hand-written-digits-classification-1621...,,,...,,,,,,,,,,
1,cnn-training-job-1621777148-aws-training-job,Training,,,,,[cnn-training-job-2-hidden-channels-1621777148],[mnist-hand-written-digits-classification-1621...,arn:aws:sagemaker:us-east-1:435647692548:train...,520713654638.dkr.ecr.us-east-1.amazonaws.com/s...,...,0.456703,0.352488,0.157259,18.0,,s3://sagemaker-experiments-us-east-1-435647692...,,s3://sagemaker-us-east-1-435647692548/,,s3://sagemaker-us-east-1-435647692548/cnn-trai...


## Deploy endpoint for the best training-job / trial component

Now we'll take the best (as sorted) and create an endpoint for it.

In [21]:
# Pulling best based on sort in the analytics/dataframe so first is best....
best_trial_component_name = trial_component_analytics.dataframe().iloc[0]["TrialComponentName"]
best_trial_component = TrialComponent.load(best_trial_component_name)

model_data = best_trial_component.output_artifacts["SageMaker.ModelArtifact"].value
env = {
    "hidden_channels": str(int(best_trial_component.parameters["hidden_channels"])),
    "dropout": str(best_trial_component.parameters["dropout"]),
    "kernel_size": str(int(best_trial_component.parameters["kernel_size"])),
}
model = PyTorchModel(
    model_data,
    role,
    "./mnist.py",
    py_version="py3",
    env=env,
    sagemaker_session=sagemaker.Session(sagemaker_client=sm),
    framework_version="1.1.0",
    name=best_trial_component.trial_component_name,
)

predictor = model.deploy(instance_type="ml.m5.xlarge", initial_instance_count=1)

INFO:sagemaker:Creating model with name: cnn-training-job-1621777148-aws-training-job
INFO:sagemaker:Creating endpoint with name cnn-training-job-1621777148-aws-trainin-2021-05-23-14-02-56-920


-------------!

## Cleanup

Once we're doing don't forget to clean up the endpoint to prevent unnecessary billing.

> Trial components can exist independent of trials and experiments. You might want keep them if you plan on further exploration. If so, comment out tc.delete()

In [22]:
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: cnn-training-job-1621777148-aws-trainin-2021-05-23-14-02-56-920
INFO:sagemaker:Deleting endpoint with name: cnn-training-job-1621777148-aws-trainin-2021-05-23-14-02-56-920


In [23]:
mnist_experiment.delete_all(action="--force")

## Contact
Submit any questions or issues to https://github.com/aws/sagemaker-experiments/issues or mention @aws/sagemakerexperimentsadmin 