# Amazon SageMaker AutoPilot (ML Tabular Problem)

Amazon SageMaker Autopilot automatically trains and tunes the best machine learning models for classification or regression, based on your data while allowing to maintain full control and visibility.

Building machine learning (ML) models has traditionally required a binary choice. On one hand, you could manually prepare the features, select the algorithm, and optimize the model parameters in order to have full control over the model design and understand all the thought that went into creating it. However, this approach requires deep ML expertise. On the other hand, if you don’t have that expertise, you could use an automated approach (AutoML) to model generation that takes care of all of the heavy lifting, but provides very little visibility into how the model was created. While a model created with AutoML can work well, you may have less trust in it because you can’t understand what went into it, you can’t recreate it, and you can’t learn best practices which may help you in the future.

Amazon SageMaker Autopilot eliminates this choice, allowing you to automatically build machine learning models without compromises. With SageMaker Autopilot, you provide a tabular dataset and select the target column to predict, which can be a number (such as a house price, called regression), or a category (such as spam/not spam, called classification). SageMaker Autopilot will automatically explore different solutions to find the best model. You then can directly deploy the model to production with just one click, or iterate on the recommended solutions with Amazon SageMaker Studio to further improve the model quality.

![AUTOML]('./img/SageMaker-AutoML.png')

In [26]:
import sagemaker
import smdebug_rulesconfig as rule_configs
import boto3
import os
import sys
import numpy as np 
import pandas as pd

print(sagemaker.__version__)

1.64.0


In [27]:
sess   = sagemaker.Session()
bucket = 'hyunjoon-sagemaker-us' # or sess.default_bucket()                     
prefix = 'sagemaker/DEMO-automl-dm'
region = boto3.Session().region_name

print('Sagemaker session :', sess)
print('S3 bucket :', bucket)
print('Prefix :', prefix)
print('Region selected :', region)

Sagemaker session : <sagemaker.session.Session object at 0x00000163480C9E88>
S3 bucket : hyunjoon-sagemaker-us
Prefix : sagemaker/DEMO-automl-dm
Region selected : us-west-2


# 1. Load Data

Download the direct marketing dataset from UCI's ML Repository.
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

[Download Dataset Manually here](https://archive.ics.uci.edu/ml/datasets/bank+marketing)

#### Wget Download
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip \
!unzip -o bank-additional.zip

In [17]:
path = 'C:\\Users\\bokhy\\Desktop\\Python\\github\\amazon-sagemaker\\data'

In [18]:
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page

# Read data
data = pd.read_csv(os.path.join(path, 'bank-additional-full.csv'), sep=';')
data.head() # Show the first 10 lines

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# 2. PreProcess the Data

In [19]:
# Splitting the dataset
train_data, test_data, _ = np.split(data.sample(frac=1, random_state=123), 
                                                  [int(0.95 * len(data)), int(len(data))])  

# Save to CSV files
train_data.to_csv('./data/automl-train.csv', index=False, header=True, sep=',') # Need to keep column names
test_data.to_csv('./data/automl-test.csv', index=False, header=True, sep=',')

# 3. Upload data to S3 bucket

### No preprocessing on data is needed in SageMaker AutoML! 

In [22]:
s3_input_data = sess.upload_data(path="./data/automl-train.csv", key_prefix=prefix + "/input")
print(s3_input_data)

s3://sagemaker-us-west-2-570447867175/sagemaker/DEMO-automl-dm/input/automl-train.csv


# 4. Setting up SageMaker AutoPilot training parameter

Doc: https://sagemaker.readthedocs.io/en/stable/automl.html

You would need to create a role in [IAM Console](https://console.aws.amazon.com/iam)

For detailed instruction, please refer to this [AWS Developer Guie](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role-sagemaker-notebook.html)

In [29]:
from sagemaker.automl.automl import AutoML


role = 'arn:aws:iam::570447867175:role/SageMakerNotebookRole' # pass your IAM role name

auto_ml_job = AutoML(
    role = role,                                              # IAM role
    sagemaker_session = sess,                                  
    target_attribute_name = 'y',                              # The column we want to predict 
    output_path = 's3://{}/{}/output'.format(bucket,prefix),  # Save output artifacts here
    max_candidates = 100,                                     # Number of models (Defaut: 500) 
    max_runtime_per_training_job_in_seconds = 600, 
    total_job_runtime_in_seconds = 3600
)

# 5. Start Training

In [30]:
auto_ml_job.fit(inputs=s3_input_data, logs=False, wait=False)

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [31]:
auto_ml_job.describe_auto_ml_job()

{'AutoMLJobName': 'automl-2020-06-18-07-27-05-187',
 'AutoMLJobArn': 'arn:aws:sagemaker:us-west-2:570447867175:automl-job/automl-2020-06-18-07-27-05-187',
 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
     'S3Uri': 's3://sagemaker-us-west-2-570447867175/sagemaker/DEMO-automl-dm/input/automl-train.csv'}},
   'TargetAttributeName': 'y'}],
 'OutputDataConfig': {'S3OutputPath': 's3://hyunjoon-sagemaker-us/sagemaker/DEMO-automl-dm/output'},
 'RoleArn': 'arn:aws:iam::570447867175:role/SageMakerNotebookRole',
 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 100,
   'MaxRuntimePerTrainingJobInSeconds': 600,
   'MaxAutoMLJobRuntimeInSeconds': 3600},
  'SecurityConfig': {'EnableInterContainerTrafficEncryption': False}},
 'CreationTime': datetime.datetime(2020, 6, 18, 0, 27, 5, 278000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2020, 6, 18, 0, 27, 7, 353000, tzinfo=tzlocal()),
 'AutoMLJobStatus': 'InProgress',
 'AutoMLJobSecondaryStatus

## Tracking the progress of the AutoPilot job 
SageMaker AutoPilot job consists of **four high-level steps** :

1. Data Preprocessing, where the dataset is split into train and validation sets. \
2. Recommending Pipelines, where the dataset is analyzed and SageMaker AutoPilot comes up with a list of ML pipelines that should be tried out on the dataset. \
3. Automatic Feature Engineering, where SageMaker AutoPilot performs feature transformation on individual features of the dataset as well as at an aggregate level. \
4. ML pipeline selection and hyperparameter tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). \

In [32]:
from time import sleep

job = auto_ml_job.describe_auto_ml_job()
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
        sleep(30)
        job = auto_ml_job.describe_auto_ml_job()
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print (job_status, job_sec_status)
    print("Data analysis complete")

InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
Data analysis complete


### Viewing notebooks generated by SageMaker AutoPilot

Once data analysis is complete, SageMaker AutoPilot generates **two notebooks**: 
1. Data exploration \
2. Candidate definition \

In [33]:
job = auto_ml_job.describe_auto_ml_job()
job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']
job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

print(job_candidate_notebook)
print(job_data_notebook)

s3://hyunjoon-sagemaker-us/sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/sagemaker-automl-candidates/pr-1-c96ac63cfc0240aca57b6e1b4efe6d295f229d0ef34547408b311743d1/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
s3://hyunjoon-sagemaker-us/sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/sagemaker-automl-candidates/pr-1-c96ac63cfc0240aca57b6e1b4efe6d295f229d0ef34547408b311743d1/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb


Copy these two notebooks and open these notebooks. There are useful information

Run following commands in Jupyter notebook
or in CMD (command prompt) in Windows

In [38]:
# !aws s3 cp s3://hyunjoon-sagemaker-us/sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/sagemaker-automl-candidates/pr-1-c96ac63cfc0240aca57b6e1b4efe6d295f229d0ef34547408b311743d1/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb candidate.ipynb
# !aws s3 cp s3://hyunjoon-sagemaker-us/sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/sagemaker-automl-candidates/pr-1-c96ac63cfc0240aca57b6e1b4efe6d295f229d0ef34547408b311743d1/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb DataExploration.ipynb

usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: the following arguments are required: paths
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: the following arguments are required: paths


SageMaker AutoPilot then launches feature engineering, and prepares different training and validation datasets.

In [39]:
job = auto_ml_job.describe_auto_ml_job()
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        sleep(30)
        job = auto_ml_job.describe_auto_ml_job()
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print (job_status, job_sec_status)
    print("Feature engineering complete")

Feature engineering complete


Once feature engineering is finished, SageMaker AutoPilot launches Auto Model Tuning on the different candidates. While model tuning is running, we can explore its progress with SageMaker Experiments.

In [42]:
import pandas as pd
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=job['AutoMLJobName'] + '-aws-auto-ml-job',
)

df = exp.dataframe()
print("Number of jobs: ", len(df))

# Move metric to first column
# df = pd.concat([df['ObjectiveMetric - Max'], df.drop(['ObjectiveMetric - Max'], axis=1)], axis=1)
# Show top 5 jobs
# df.sort_values('ObjectiveMetric - Max', ascending=0)[:5]
df[:10]

Number of jobs:  25


Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,_tuning_objective_metric,alpha,colsample_bytree,eta,gamma,lambda,max_depth,min_child_weight,num_round,objective,subsample,SageMaker.ModelName,SageMaker.ModelPrimary.DataUrl,SageMaker.ModelPrimary.Image,processor_module,sagemaker_program,sagemaker_submit_directory,input_channel_mode,job_name,label_col
0,tuning-job-1-46cb7fcbd7294bf7be-004-ea1f352a-a...,tuning-job-1-46cb7fcbd7294bf7be-004-ea1f352a-a...,arn:aws:sagemaker:us-west-2:570447867175:train...,246618743249.dkr.ecr.us-west-2.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.001664,0.59389,0.01501,50.79529,1.537836,22.0,31.774205,5.0,binary:logistic,0.607919,,,,,,,,,
1,tuning-job-1-46cb7fcbd7294bf7be-003-11bb7b5c-a...,tuning-job-1-46cb7fcbd7294bf7be-003-11bb7b5c-a...,arn:aws:sagemaker:us-west-2:570447867175:train...,246618743249.dkr.ecr.us-west-2.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.000296,0.345105,0.002555,0.041776,0.00025,7.0,0.019499,163.0,binary:logistic,0.598289,,,,,,,,,
2,tuning-job-1-46cb7fcbd7294bf7be-006-e57d8a5c-a...,tuning-job-1-46cb7fcbd7294bf7be-006-e57d8a5c-a...,arn:aws:sagemaker:us-west-2:570447867175:train...,246618743249.dkr.ecr.us-west-2.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.001664,0.59389,0.01501,50.79529,1.537836,22.0,31.774205,5.0,binary:logistic,0.607919,,,,,,,,,
3,tuning-job-1-46cb7fcbd7294bf7be-002-2f2ca630-a...,tuning-job-1-46cb7fcbd7294bf7be-002-2f2ca630-a...,arn:aws:sagemaker:us-west-2:570447867175:train...,246618743249.dkr.ecr.us-west-2.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.000296,0.345105,0.002555,0.041776,0.00025,7.0,0.019499,163.0,binary:logistic,0.598289,,,,,,,,,
4,tuning-job-1-46cb7fcbd7294bf7be-010-7363e360-a...,tuning-job-1-46cb7fcbd7294bf7be-010-7363e360-a...,arn:aws:sagemaker:us-west-2:570447867175:train...,246618743249.dkr.ecr.us-west-2.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,0.001664,0.59389,0.01501,50.79529,1.537836,22.0,31.774205,5.0,binary:logistic,0.607919,,,,,,,,,


In [43]:
job = auto_ml_job.describe_auto_ml_job()
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        sleep(30)
        job = auto_ml_job.describe_auto_ml_job()
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print (job_status, job_sec_status)
    print("Model tuning complete")

InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress 

# 6. Model Deploy 

In [44]:
from time import strftime, gmtime
timestamp = strftime('%d-%H-%M-%S', gmtime())

endpoint_name = job['AutoMLJobName']+'-'+timestamp

print(endpoint_name)

automl-2020-06-18-07-27-05-187-18-08-28-52


In [46]:
auto_ml_job.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    endpoint_name = endpoint_name
)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-----------------!

# 7. Predict with the test set

- From https://gitlab.com/juliensimon/awsdevdays2020/-/blob/master/mls2/XGBoost-AutoPilot.ipynb

In [47]:
from sagemaker.predictor import csv_serializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV

predictor = RealTimePredictor(
    endpoint=endpoint_name, 
    sagemaker_session=sess, 
    serializer=csv_serializer,
    content_type=CONTENT_TYPE_CSV, 
    accept='text/csv'
)

In [48]:
tp = tn = fp = fn = count = 0

with open('./data/automl-test.csv') as f:
    lines = f.readlines()
    for l in lines[1:]:   # Skip header
        l = l.split(',')  # Split CSV line into feature array
        label = l[-1]     # Store 'yes'/'no' label
        l = l[:-1]        # Remove label
        l = ','.join(l)   # Rebuild CSV line without label
                
        response = predictor.predict(l)
        response = response.decode("utf-8")
        #print ("label %s response %s" %(label,response))

        if 'yes' in label:
            # Sample is positive
            if 'yes' in response:
                # True positive
                tp=tp+1
            else:
                # False negative
                fn=fn+1
        else:
            # Sample is negative
            if 'no' in response:
                # True negative
                tn=tn+1
            else:
                # False positive
                fp=fp+1
        count = count+1
        if (count % 100 == 0):   
            sys.stdout.write(str(count)+' ')
            
#Confusion matrix
print ("%d %d" % (tn, fp))
print ("%d %d" % (fn, tp))

accuracy  = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall    = tn/(tn+fn)
f1        = (2*precision*recall)/(precision+recall)

print ("%.4f %.4f %.4f %.4f" % (accuracy, precision, recall, f1))

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 1769 53
111 127
0.9204 0.7056 0.9410 0.8064


# 8. Close the SageMaker Instance 

- To make sure we don't get charged after the training/inference is over, we have to delete the endpoint.

In [49]:
sess.delete_endpoint(predictor.endpoint)

- Also, SageMaker AutoML job creates many underlying artifacts such as dataset splits, preprocessing scripts, preprocessed data, etc. Let's delete them too!

In [50]:
import boto3

job_outputs_prefix = '{}/output/{}'.format(prefix, job['AutoMLJobName'])
print(job_outputs_prefix)

s3_bucket =boto3.resource('s3').Bucket(bucket)

sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187


In [51]:
s3_bucket.objects.filter(Prefix=job_outputs_prefix).delete()

[{'ResponseMetadata': {'RequestId': '18C196F4C827B924',
   'HostId': 'tqogCt29afGNln2RemIpKQ1c3Y0H1+midZ12x+GhVGOlOvl+ZeW4S0tfVqKhUUAJRzVwDSOT0rw=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'tqogCt29afGNln2RemIpKQ1c3Y0H1+midZ12x+GhVGOlOvl+ZeW4S0tfVqKhUUAJRzVwDSOT0rw=',
    'x-amz-request-id': '18C196F4C827B924',
    'date': 'Thu, 18 Jun 2020 08:44:36 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/transformed-data/dpp0/csv/train/chunk_24.csv.out'},
   {'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/transformed-data/dpp6/csv/train/chunk_90.csv.out'},
   {'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/transformed-data/dpp7/rpb/train/chunk_90.csv.out'},
   {'Key': 'sagemaker/DEMO-automl-dm/output/automl-2020-06-18-07-27-05-187/pre