# Sagemaker AutoML Experiment

We will test the Sagemaker AutoML tool for the ACloudGuru lab.

In [20]:
import os
import pandas as pd
import boto3
import datetime
from os.path import join

## Data upload

The first step is to upload our dataset to S3. 

In this instance, we will upload the entire, raw dataset. 

From a previous EDA we noted:

* It's actually relatively clean - there are no missing values to infer
* We have a lot of features to work with

It will be interested to examine how AutoML deals with it with little help.

In [17]:
root_dir = os.path.abspath("..")
data_dir = join(root_dir, "data")
raw_data_dir = join(data_dir, "raw")
processed_data_dir = join(data_dir, "processed")

We'll read in our full dataset to begin. This is a `.csv` file - we make no further assumptions:

In [8]:
df_ufo = pd.read_csv(join(raw_data_dir, "ufo_fullset.csv"))

The data is succesfully read in. We will output under the processed folder in a specific format to ensure it is ready for analysis:

In [18]:
df_ufo.to_csv(join(processed_data_dir, "ufo_fullset_formatted.csv"), index=False, header=True)

And now upload to S3 (assuming of course that our bucket exists):

In [19]:
s3 = boto3.resource("s3")
bucket_name = os.environ["s3_bucket"]
bucket = s3.Bucket(bucket_name)

target_key = "automl/input/ufo_fullset.csv"
bucket.upload_file(join(processed, "ufo_fullset_formatted.csv"), target_key)

## AutoML job

### Config

We now have to write some config for our AutoML job.

In [30]:
input_data_config = [{
    'DataSource': {
        'S3DataSource': {
            'S3DataType': 'S3Prefix',
            'S3Uri': 's3://{}/{}'.format(bucket_name, target_key)
        }
    },
    'TargetAttributeName': 'researchOutcome'
}]

output_data_config = {
    'S3OutputPath': 's3://{}/automl/output/'.format(bucket_name)
}

### Create job

In [27]:
now = datetime.datetime.now()

In [33]:
int(now.timestamp())

1601708912

In [35]:
auto_ml_job_name = 'automl-job-{}'.format(int(datetime.datetime.now().timestamp()))
print(auto_ml_job_name)

automl-job-1601709049


In [36]:
sm = boto3.client("sagemaker")
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=os.environ['sm_role'])

{'AutoMLJobArn': 'arn:aws:sagemaker:ap-southeast-2:949012111517:automl-job/automl-job-1601709049',
 'ResponseMetadata': {'RequestId': '71b64b19-de23-4a2f-875f-934b949cf62f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '71b64b19-de23-4a2f-875f-934b949cf62f',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '97',
   'date': 'Sat, 03 Oct 2020 07:10:53 GMT'},
  'RetryAttempts': 0}}

### Describe job

In [88]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

{'AutoMLJobName': 'automl-job-1601709049',
 'AutoMLJobArn': 'arn:aws:sagemaker:ap-southeast-2:949012111517:automl-job/automl-job-1601709049',
 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
     'S3Uri': 's3://acg-sm-demo-goyder/automl/input/ufo_fullset.csv'}},
   'TargetAttributeName': 'researchOutcome'}],
 'OutputDataConfig': {'S3OutputPath': 's3://acg-sm-demo-goyder/automl/output/'},
 'RoleArn': 'arn:aws:iam::949012111517:role/sagemaker-role-acg-demo',
 'CreationTime': datetime.datetime(2020, 10, 3, 15, 10, 52, 680000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2020, 10, 3, 15, 35, 48, 633000, tzinfo=tzlocal()),
 'BestCandidate': {'CandidateName': 'tuning-job-1-93394821faf04a20b2-002-11ebe9ac',
  'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:accuracy',
   'Value': 0.9503200054168701},
  'ObjectiveStatus': 'Succeeded',
  'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob',
    'CandidateStepArn': 'arn:a

### Evaluate job

We now pull data about the job from the AutoML API and evaluate.

In [89]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                                                SortBy='FinalObjectiveMetricValue'
                                               )['Candidates']

In [91]:
candidates

[{'CandidateName': 'tuning-job-1-93394821faf04a20b2-003-bc71d42c',
  'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:accuracy',
   'Value': 0.9503200054168701},
  'ObjectiveStatus': 'Succeeded',
  'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob',
    'CandidateStepArn': 'arn:aws:sagemaker:ap-southeast-2:949012111517:processing-job/db-1-6b6f57cfa3b244c1b008b9482ff383c3f2ce40b4087e46509551b15257',
    'CandidateStepName': 'db-1-6b6f57cfa3b244c1b008b9482ff383c3f2ce40b4087e46509551b15257'},
   {'CandidateStepType': 'AWS::SageMaker::TrainingJob',
    'CandidateStepArn': 'arn:aws:sagemaker:ap-southeast-2:949012111517:training-job/automl-job-dpp1-1-e34f0a60d1594adc89a2db4294ec431bbdd9853bca814',
    'CandidateStepName': 'automl-job-dpp1-1-e34f0a60d1594adc89a2db4294ec431bbdd9853bca814'},
   {'CandidateStepType': 'AWS::SageMaker::TransformJob',
    'CandidateStepArn': 'arn:aws:sagemaker:ap-southeast-2:949012111517:transform-job/automl-job-dpp1-csv-1-633678ba7