## Questions

1. List all previous experiments using the `smexperiments` SDK.
2. Delete all `Experiments`, `Trials`, and `TrialComponents` from experiments run before today.
3. Let's extend our scikit-learn training example from chapter 3 to incorporate experiment tracking:
    1. Make a copy of `sklearn_rf.py` and name it `sklearn-experiment.py`. Update `sklearn-experiment.py` to accept 3 additional command line arguments: 1) **max_depth**: type=int, default=None, 2)**min_samples_split**: type=int, default=2, and 3)**validation**': type=str, `default=os.environ['SM_CHANNEL_VALIDATION']`. Also update the script to calculate the accuracy on the test set and print it out to screen in the format `Test Accuracy: <value>;`.
    2. In a Jupyter notebook, create an `Experiment` object with `experiment_name` of `churn-experiment`.
    3. For each `max_depth` in `[2, 4, 8, 10]` and `min_samples_split` in `[2, 5, 10, 20]`:
        * create a `Trial` object and an `sagemaker.sklearn.estimator.SKLearn` object for training. 
        * Pass the appropriate hyperparameter values to the `hyperparameter` argument. 
        * Pass `metric_definitions=[{'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'}]`
    4. Fit the estimator and pass in the appropriate experiment configuration and train/validation files. Set `wait=False` in order to run all of the training jobs in parallel.
    5. When training completes, create an `ExperimentAnalytics` object to retrieve the results. Which combination of hyperparameters maximizes the metric on the validation set?

## Answers

### 1.

In [1]:
from smexperiments import experiment
import smexperiments

In [None]:
for exp in experiment.Experiment.list():
    print(exp.experiment_name)

### 2.

In [None]:
import datetime

In [14]:
for exp_sum in experiment.Experiment.list(created_before=datetime.datetime(2020, 8, 29)):
    exp = experiment.Experiment.load(experiment_name=exp_sum.experiment_name)
    exp.delete_all(action='--force')

### 3.A

In [20]:
!cd /root/sagemaker-course/notebooks

In [17]:
# %load ../scripts/sklearn/sklearn_rf.py
from __future__ import print_function

import argparse
import os
import pandas as pd

from sklearn import ensemble
from sklearn.externals import joblib


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here. In this simple example we are just including one hyperparameter.
    parser.add_argument('--n_estimators', type=int, default=100)

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

    # Take the set of files and read them all into a single pandas dataframe
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    train_data = pd.concat(raw_data)

    # labels are in the first column
    train_y = train_data.ix[:,0]
    train_X = train_data.ix[:,1:]

    # Here we support a single hyperparameter, 'n_estimators'. Note that you can add as many
    # as your training my require in the ArgumentParser above.
    n_estimators = args.n_estimators

    # Now use scikit-learn's random forest classifier to train the model.
    clf = ensemble.RandomForestClassifier(n_estimators=n_estimators)
    clf = clf.fit(train_X, train_y)

    # Serialize the model.
    joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))


def model_fn(model_dir):
    """Deserialize and return fitted model
    
    Note that this should have the same name as the serialized model in the main method
    """
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


In [38]:
%%writefile ../scripts/sklearn/sklearn-experiment.py

from __future__ import print_function

import argparse
import os
import pandas as pd

from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here. In this simple example we are just including one hyperparameter.
    parser.add_argument('--n_estimators', type=int, default=100)
    parser.add_argument('--max_depth', type=int, default=None)
    parser.add_argument('--min_samples_split', type=int, default=2)

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])

    args = parser.parse_args()

    # Take the set of files and read them all into a single pandas dataframe
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    train_data = pd.concat(raw_data)

    # labels are in the first column
    train_y = train_data.ix[:,0]
    train_X = train_data.ix[:,1:]
    
    input_files = [ os.path.join(args.validation, file) for file in os.listdir(args.validation) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    validation_data = pd.concat(raw_data)
    
    # labels are in the first column
    validation_y = validation_data.ix[:,0]
    validation_X = validation_data.ix[:,1:]

    # Here we support a single hyperparameter, 'n_estimators'. Note that you can add as many
    # as your training my require in the ArgumentParser above.
    n_estimators = args.n_estimators

    # Now use scikit-learn's random forest classifier to train the model.
    clf = ensemble.RandomForestClassifier(n_estimators=n_estimators)
    clf = clf.fit(train_X, train_y)
    
    y_pred = clf.predict(validation_X)
    
    test_accuracy = accuracy_score(validation_y, y_pred)
    
    print(f'Test Accuracy: {test_accuracy};')

    # Serialize the model.
    joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))


def model_fn(model_dir):
    """Deserialize and return fitted model
    
    Note that this should have the same name as the serialized model in the main method
    """
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


Overwriting ../scripts/sklearn/sklearn-experiment.py


### 3.B

In [26]:
import boto3
import pandas as pd
import sagemaker
from sagemaker import get_execution_role

# S3 bucket information
BUCKET = 'sagemaker-course-20200812'
PREFIX = 'churn'
LOCAL_DATA_DIRECTORY = f'../data/{PREFIX}'
print(f"Artifacts will be written to s3://{BUCKET}/{PREFIX}")

# Session variables we'll use throughout the notebook
sagemaker_session = sagemaker.Session()
boto_session = sagemaker_session.boto_session
sagemaker_client = boto_session.client('sagemaker')
role = get_execution_role()

Artifacts will be written to s3://sagemaker-course-20200812/churn


In [39]:
exp = experiment.Experiment.create(experiment_name='churn-experiment', 
                                   description='Experiment for churn dataset', 
                                   sagemaker_boto_client=sagemaker_client)

In [40]:
from sagemaker.sklearn.estimator import SKLearn
from smexperiments.trial import Trial

In [41]:
s3_input_train = sagemaker.TrainingInput(s3_data='s3://sagemaker-course-20200812/churn/train.csv',
                                         content_type='csv')
s3_input_validation = sagemaker.TrainingInput(s3_data='s3://sagemaker-course-20200812/churn/validation.csv',
                                              content_type='csv')

for max_depth in [2, 4, 8, 10]:
    for min_samples_split in [2, 5, 10, 20]:

        trial_name = f"sklearn-rf-{max_depth}-max-depth-{min_samples_split}-min-samples-split"
        sklearn_trial = Trial.create(
            trial_name=trial_name, 
            experiment_name=exp.experiment_name,
            sagemaker_boto_client=sagemaker_client,
        )
        
        sklearn_estimator = SKLearn(
            framework_version='0.20.0',
            py_version='py3',
            entry_point='../scripts/sklearn/sklearn-experiment.py',
            code_location=f's3://{BUCKET}/{PREFIX}',
            hyperparameters={'n_estimators': 50,
                             'max_depth': max_depth,
                             'min_samples_split': min_samples_split},
            role=role,
            instance_type='ml.c4.xlarge',
            output_path=f's3://{BUCKET}/{PREFIX}',
            base_job_name='sklearn-experiment',
            sagemaker_session=sagemaker_session,
            metric_definitions=[
                {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?);'}]
        )
        
        sklearn_estimator.fit({'train': s3_input_train, 'validation': s3_input_validation},
                              experiment_config={
                                  'TrialName': sklearn_trial.trial_name,
                                  'TrialComponentDisplayName': 'Training'
                              },
                              wait=False)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker:Creating training-job with name: sklearn-experiment-2020-08-29-15-46-08-392
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker:Creating training-job with name: sklearn-experiment-2020-08-29-15-46-08-744
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker:Creating training-job with name: sklearn-experiment-2020-08-29-15-46-11-734
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker:Creating training-job with name: sklearn-experiment-2020-08-29-15-46-12-198
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker:Creating training-job with name: sklearn-experiment-2020-08-29-15-46-17-929


In [42]:
from sagemaker.analytics import ExperimentAnalytics

In [43]:
analytics = ExperimentAnalytics(experiment_name=exp.experiment_name,
                                sagemaker_session=sagemaker_session)
df = analytics.dataframe()

In [44]:
df.columns

Index(['TrialComponentName', 'DisplayName', 'SourceArn', 'SageMaker.ImageUri',
       'SageMaker.InstanceCount', 'SageMaker.InstanceType',
       'SageMaker.VolumeSizeInGB', 'max_depth', 'min_samples_split',
       'n_estimators', 'sagemaker_container_log_level', 'sagemaker_job_name',
       'sagemaker_program', 'sagemaker_region', 'sagemaker_submit_directory',
       'test:accuracy - Min', 'test:accuracy - Max', 'test:accuracy - Avg',
       'test:accuracy - StdDev', 'test:accuracy - Last',
       'test:accuracy - Count', 'train - MediaType', 'train - Value',
       'validation - MediaType', 'validation - Value',
       'SageMaker.DebugHookOutput - MediaType',
       'SageMaker.DebugHookOutput - Value',
       'SageMaker.ModelArtifact - MediaType',
       'SageMaker.ModelArtifact - Value'],
      dtype='object')