# Direct Marketing with Amazon SageMaker Autopilot
---

---

## Introduction

Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.


## Prerequisites

Before you start the tasks in this tutorial, do the following:

- The Amazon Simple Storage Service (Amazon S3) bucket and prefix that you want to use for training and model data. This should be within the same Region as Amazon SageMaker training. The code below will create, or if it exists, use, the default bucket.
- The IAM role to give Autopilot access to your data. See the Amazon SageMaker documentation for more information on IAM roles: https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam.html

In [9]:
import sagemaker
import boto3
from sagemaker import get_execution_role

region = boto3.Session().region_name

session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'sagemaker/autopilot-dm'

role = get_execution_role()

sm = boto3.Session().client(service_name='sagemaker',region_name=region)

## loading the dataset<a name="loading"></a>


In [10]:
!#wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!#unzip -o bank-additional.zip

local_data_path = './slug_menu.csv'
print(local_data_path)

./slug_menu.csv


## Upload the dataset to Amazon S3<a name="Uploading"></a>

Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. Autopilot is capable of handling datasets up to 5 GB.


Read the data into a Pandas data frame and take a look.

In [11]:
import pandas as pd

data = pd.read_csv(local_data_path, sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
data

Unnamed: 0,slug,menu,slug.1,categories,distance,name,price_level,rating,review_count,url,lat,lng,Sp1,type,homeurl,resource_id1,resource_id2,lat2,lng2
0,slug,menu,slug,categories,distance,name,price_level,rating,review_count,url,lat,lng,Sp1,type,homeurl,resource_id1,resource_id2,lat2,lng2
1,1-2-3-pho-and-asian-cafe-englewood,A1. Vietnamese Crispy Spring Rolls 3 piece. 6....,1-2-3-pho-and-asian-cafe-englewood,"Vietnamese, Chinese, Sushi Bars",5.6 mi,1 2 3 Pho & Asian Cafe,$$,3.50993377483,151,https://m.yelp.com/biz/1-2-3-pho-and-asian-caf...,39.5368356,-104.8536173,,natural,/biz/1-2-3-pho-and-asian-cafe-englewood,dVsoRXXEeYslIq3egO984Q,M0loTtLIh4F4gCinGTaCsA,39.5368356,-104.8536173
2,1-chinese-kitchen-san-jose-2,"Thai Iced Tea 4.5, Vietnamese Iced Coffee 4....",1-chinese-kitchen-san-jose-2,"Chinese, Caterers",0.5 mi,#1 Chinese Kitchen,$,3.70448548813,379,https://m.yelp.com/biz/1-chinese-kitchen-san-j...,37.252168258,-121.831224679,,natural,/biz/1-chinese-kitchen-san-jose-2,_us5qOY9AafKtmBdje6mIg,TsIah2X2urWVxVP4nrwhFA,37.252168258,-121.831224679
3,1-columbia-deli-new-york,"S1. Mango Sunrise Smoothie Mango, pineapple, b...",1-columbia-deli-new-york,Delis,0.2 mi,#1 Columbia Deli,,,0,https://m.yelp.com/biz/1-columbia-deli-new-york,40.8117962,-73.9576436,,natural,/biz/1-columbia-deli-new-york,ZJRC1aIv4CIG6qMKZCR2bQ,3hiuYqihbILBQ7EZCp19ZQ,40.8117962,-73.9576436
4,10-barrel-brewing-san-diego-san-diego,Apocalypse IPA He send is near. Apocalypse is ...,10-barrel-brewing-san-diego-san-diego,"Breweries, Pubs, American (Traditional)",2.1 mi,10 Barrel Brewing - San Diego,$$,4.52427184466,412,https://m.yelp.com/biz/10-barrel-brewing-san-d...,32.7145590497,-117.150228582,,natural,/biz/10-barrel-brewing-san-diego-san-diego,8nEOrZJXEPbB7omJwyM1JQ,Xm502NvgzMVjdue1JAkejw,32.7145590497,-117.150228582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28051,zyka-decatur,Tandoori Chicken chicken leg quarter marinated...,zyka-decatur,"Indian, Pakistani",11.0 mi,Zyka,$$,3.8908819133,669,https://m.yelp.com/biz/zyka-decatur,33.797014,-84.283603,,natural,/biz/zyka-decatur,b6kJapy143QnbJieVPOHOw,deC3XO9E7td_ySUvJJJGPg,33.797014,-84.283603
28052,zzaam-fresh-korean-grill-bristol,"Korean Ice Tea 3.5, Ramune A Japanese soda. 3...",zzaam-fresh-korean-grill-bristol,Korean,0.3 mi,Zzaam Fresh Korean Grill,$,4.28915662651,83,https://m.yelp.com/biz/zzaam-fresh-korean-gril...,41.6855246,-72.9387402,,natural,/biz/zzaam-fresh-korean-grill-bristol,vrClVVv6DZiWhn3aFfbucQ,cSYz0QgZROe4yn0v447rhg,41.6855246,-72.9387402
28053,zzaam-fresh-korean-grill-midlothian-2,"Chicken Wings Choose spicy or mild. 7.45, Eggr...",zzaam-fresh-korean-grill-midlothian-2,Korean,2.4 mi,ZZAAM Fresh Korean Grill,$,4.47058823529,51,https://m.yelp.com/biz/zzaam-fresh-korean-gril...,37.4191789368,-77.6324390191,,natural,/biz/zzaam-fresh-korean-grill-midlothian-2,S1olXW4gETw7d7UbULAh5w,eamq1pHKG2GPvj7tzZ0mqg,37.4191789368,-77.6324390191
28054,zzan-san-francisco-2,"Soda 2.5, Sparkling Water San Pellegrino. 3.5...",zzan-san-francisco-2,Korean,0.8 mi,Zzan,$$,3.5625,224,https://m.yelp.com/biz/zzan-san-francisco-2,37.7876051,-122.4123571,,natural,/biz/zzan-san-francisco-2,oEJPe3ltDA7Hh1bVf56s7Q,Zr8UsryUlZuV2F1O0EEl6Q,37.7876051,-122.4123571


### Reserve some data for calling batch inference on the model

Divide the data into training and testing splits. The training split is used by SageMaker Autopilot. The testing split is reserved to perform inference using the suggested model.


In [12]:
train_data = data.sample(frac=0.8,random_state=200)

test_data = data.drop(train_data.index)

test_data_no_target = test_data.drop(columns=['rating'])

### Upload the dataset to Amazon S3
Copy the file to Amazon Simple Storage Service (Amazon S3) in a .csv format for Amazon SageMaker training to use.

In [14]:
train_file = 'train_data.csv';
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'test_data.csv';
test_data_no_target.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data (no target) uploaded to: ' + test_data_s3_path)

test_file_with_target = 'test_data_with_target.csv';
test_data.to_csv(test_file_with_target, index=False, header=False)
test_data_with_target_s3_path = session.upload_data(path=test_file_with_target, key_prefix=prefix + "/test")
print('Test data (with target) uploaded to: ' + test_data_with_target_s3_path)

Train data uploaded to: s3://sagemaker-us-east-1-644944822023/sagemaker/autopilot-dm/train/train_data.csv
Test data (no target) uploaded to: s3://sagemaker-us-east-1-644944822023/sagemaker/autopilot-dm/test/test_data.csv
Test data (with target) uploaded to: s3://sagemaker-us-east-1-644944822023/sagemaker/autopilot-dm/test/test_data_with_target.csv


## Setting up the SageMaker Autopilot Job<a name="Settingup"></a>

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset. 

In [15]:
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'rating'
    }
  ]

output_data_config = {
    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
  }

## Launching the SageMaker Autopilot Job<a name="Launching"></a>

You can now launch the Autopilot job by calling the `create_auto_ml_job` API. 

In [16]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-banking-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=role)

AutoMLJobName: automl-banking-15-02-06-41


{u'AutoMLJobArn': u'arn:aws:sagemaker:us-east-1:644944822023:automl-job/automl-banking-15-02-06-41',
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '97',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 15 Dec 2019 02:06:42 GMT',
   'x-amzn-requestid': '61ff4b80-c542-4756-92e6-7d0f86aca8ee'},
  'HTTPStatusCode': 200,
  'RequestId': '61ff4b80-c542-4756-92e6-7d0f86aca8ee',
  'RetryAttempts': 0}}

## Tracking SageMaker Autopilot job progress<a name="Tracking"></a>
SageMaker Autopilot job consists of the following high-level steps : 
* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

In [20]:
print ('JobStatus - Secondary Status')
print('------------------------------')


describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']
    
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response['AutoMLJobStatus']
    
    print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
    sleep(30)

JobStatus - Secondary Status
------------------------------
Completed - MaxCandidatesReached


## Results

Now use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker Autopilot job. 

In [21]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print(best_candidate)
print('\n')
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

{u'CandidateStatus': u'Completed', u'ObjectiveStatus': u'Succeeded', u'CandidateSteps': [{u'CandidateStepType': u'AWS::SageMaker::ProcessingJob', u'CandidateStepName': u'db-1-e2fb48781e6f405e90113e03af83f937c306c96b0af64264bfbf4abfb2', u'CandidateStepArn': u'arn:aws:sagemaker:us-east-1:644944822023:processing-job/db-1-e2fb48781e6f405e90113e03af83f937c306c96b0af64264bfbf4abfb2'}, {u'CandidateStepType': u'AWS::SageMaker::TrainingJob', u'CandidateStepName': u'automl-ban-dpp1-1-fa2cede8962b4cb48bfb2882e31367ba4261e10ff8c34', u'CandidateStepArn': u'arn:aws:sagemaker:us-east-1:644944822023:training-job/automl-ban-dpp1-1-fa2cede8962b4cb48bfb2882e31367ba4261e10ff8c34'}, {u'CandidateStepType': u'AWS::SageMaker::TransformJob', u'CandidateStepName': u'automl-ban-dpp1-rpb-1-af7165d4cec34704864b676fcbe354f35862807a3', u'CandidateStepArn': u'arn:aws:sagemaker:us-east-1:644944822023:transform-job/automl-ban-dpp1-rpb-1-af7165d4cec34704864b676fcbe354f35862807a3'}, {u'CandidateStepType': u'AWS::SageMake

### Perform batch inference using the best candidate

Now that you have successfully completed the SageMaker Autopilot job on the dataset, create a model from any of the candidates by using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). 

In [22]:
model_name = 'automl-banking-model-' + timestamp_suffix

model = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))

Model ARN corresponding to the best candidate is : arn:aws:sagemaker:us-east-1:644944822023:model/automl-banking-model-15-02-06-41


You can use batch inference by using Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting.

In [23]:
transform_job_name = 'automl-banking-transform-' + timestamp_suffix

transform_input = {
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': test_data_s3_path
            }
        },
        'ContentType': 'text/csv',
        'CompressionType': 'None',
        'SplitType': 'Line'
    }

transform_output = {
        'S3OutputPath': 's3://{}/{}/inference-results'.format(bucket,prefix),
    }

transform_resources = {
        'InstanceType': 'ml.m5.4xlarge',
        'InstanceCount': 1
    }

sm.create_transform_job(TransformJobName = transform_job_name,
                        ModelName = model_name,
                        TransformInput = transform_input,
                        TransformOutput = transform_output,
                        TransformResources = transform_resources
)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '113',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 15 Dec 2019 04:47:19 GMT',
   'x-amzn-requestid': 'f9e41b2d-bada-48f3-9e42-01e341ceda8f'},
  'HTTPStatusCode': 200,
  'RequestId': 'f9e41b2d-bada-48f3-9e42-01e341ceda8f',
  'RetryAttempts': 0},
 u'TransformJobArn': u'arn:aws:sagemaker:us-east-1:644944822023:transform-job/automl-banking-transform-15-02-06-41'}

Watch the transform job for completion.

In [24]:
print ('JobStatus')
print('----------')


describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
    job_run_status = describe_response['TransformJobStatus']
    print (job_run_status)
    sleep(30)

JobStatus
----------
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
Completed


Now let's view the results of the transform job:

In [25]:
s3_output_key = '{}/inference-results/test_data.csv.out'.format(prefix);
local_inference_results_path = 'inference_results.csv'

s3 = boto3.resource('s3')
inference_results_bucket = s3.Bucket(session.default_bucket())

inference_results_bucket.download_file(s3_output_key, local_inference_results_path);

data = pd.read_csv(local_inference_results_path, sep=';')
pd.set_option('display.max_rows', 10)         # Keep the output on one page
data

Unnamed: 0,0.20951581001281738
0,3.989051
1,4.166936
2,4.001935
3,0.191046
4,3.679652
...,...
5605,4.106693
5606,3.653477
5607,3.850721
5608,3.877154


### View other candidates explored by SageMaker Autopilot
You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker Autopilot and sort them by their final performance metric.

In [26]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name)['Candidates']
#print(repr(candidates))
index = 1
for candidate in candidates:
  print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']) + candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
  index += 1

1  tuning-job-1-25c50953778448b786-241-edc053e3  0.246413007379validation:mse
2  tuning-job-1-25c50953778448b786-243-50f675dd  0.286435991526validation:mse
3  tuning-job-1-25c50953778448b786-244-e42b455d  0.291945010424validation:mse
4  tuning-job-1-25c50953778448b786-245-06af7714  0.326144993305validation:mse
5  tuning-job-1-25c50953778448b786-246-123af794  0.302026987076validation:mse
6  tuning-job-1-25c50953778448b786-247-2d89fa85  0.267668992281validation:mse
7  tuning-job-1-25c50953778448b786-248-47a2fcc4  0.37551599741validation:mse
8  tuning-job-1-25c50953778448b786-249-82dc2d0e  0.290565997362validation:mse
9  tuning-job-1-25c50953778448b786-250-c33da69b  0.290334999561validation:mse
10  tuning-job-1-25c50953778448b786-251-9995d524  0.295363008976validation:mse


In [29]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
for candidate in candidates:
  print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']) + ' ' + candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
  index += 1

1  tuning-job-1-25c50953778448b786-028-b604dc30  1871.20068359 validation:objective_loss
2  tuning-job-1-25c50953778448b786-027-fd64e79a  11.1486997604 validation:mse
3  tuning-job-1-25c50953778448b786-026-7d213a7f  11.1486997604 validation:mse
4  tuning-job-1-25c50953778448b786-018-ee070039  10.957400322 validation:mse
5  tuning-job-1-25c50953778448b786-032-26e58250  10.7209997177 validation:mse
6  tuning-job-1-25c50953778448b786-051-9f4e08f3  10.6714000702 validation:mse
7  tuning-job-1-25c50953778448b786-092-febd67df  10.5369997025 validation:mse
8  tuning-job-1-25c50953778448b786-037-df48b12a  10.4678001404 validation:mse
9  tuning-job-1-25c50953778448b786-006-0d7717e4  9.71615982056 validation:mse
10  tuning-job-1-25c50953778448b786-004-8cb74767  9.71409988403 validation:mse


### Candidate Generation Notebook
    
Sagemaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the Sagemaker Autopilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more.
    
The notebook can be downloaded from the following Amazon S3 location:

In [30]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']


u's3://sagemaker-us-east-1-644944822023/sagemaker/autopilot-dm/output/automl-banking-15-02-06-41/sagemaker-automl-candidates/pr-1-049cf12207264c6d8398f6c51b30d3081605d366b56347c6813944812c/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb'

### Data Exploration Notebook
Sagemaker Autopilot also auto-generates a Data Exploration notebook, which can be downloaded from the following Amazon S3 location:

In [31]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']


u's3://sagemaker-us-east-1-644944822023/sagemaker/autopilot-dm/output/automl-banking-15-02-06-41/sagemaker-automl-candidates/pr-1-049cf12207264c6d8398f6c51b30d3081605d366b56347c6813944812c/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb'

## Cleanup

The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well. 

In [None]:
#s3 = boto3.resource('s3')
#bucket = s3.Bucket(bucket)

#job_outputs_prefix = '{}/output/{}'.format(prefix,auto_ml_job_name)
#bucket.objects.filter(Prefix=job_outputs_prefix).delete()