# Amazon SageMaker Autopilot Demo

---

*This notebook should be deployed in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel in the Oregon (us-west-2) region for the hyperllinks to the console to work correctly* 

## Introduction

Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. 

This notebook demonstrates how to configure the model to obtain the inference probability and explore the top 5 best performing models. For the purpose of this demo we will use the churn dataset that is available from the [University of California Irvine Repository of Machine Learning Datasets.](https://archive.ics.uci.edu/ml/datasets.php)

Let's start by importing all the python libraries and specifying a bucket to use with Autopilot.

In [None]:
import sagemaker
import boto3
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
from pprint import pprint
from sagemaker import AutoML
from time import gmtime, strftime, sleep

region = boto3.Session().region_name
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = "sagemaker/autopilot-churn-demo"
role = get_execution_role()
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

### Split the dataset into training and test

The training split is used by SageMaker Autopilot. The testing split would be used to perform inference using the suggested model. Both files are saved in the working directory.

In [None]:
churn = pd.read_csv("./churn.txt")
train_data = churn.sample(frac=0.8, random_state=200)
test_data = churn.drop(train_data.index)
test_data_no_target = test_data.drop(columns=["Churn?"])

train_file = "train_data.csv"
train_data.to_csv(train_file, index=False, header=True)

test_file = "test_data.csv"
test_data_no_target.to_csv(test_file, index=False, header=False)

## Autopilot On!

We will use the `AutoML` estimator from SageMaker Python SDK to invoke Autopilot to find the best ML pipeline to train a model on this dataset. 

The required inputs for invoking a Autopilot job are:
* The input dataset
* Name of the column to predict. For us it's binary classificatoion on the `Churn?` collumn 
* The IAM role with access to all the required resources

You can also specify the type of problem you want to solve with your dataset (`Regression, MulticlassClassification, BinaryClassification`) however Autopilot has the ability to detect the prediction type based on the target attribute. In cour case ```Churn?``` is binary so our model will be performing binary classification. 

We limit the number of candidates to 5 so that the job finishes in under an hour.

In [None]:
timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
base_job_name = "automl-churn-sdk-" + timestamp_suffix

train_file = "train_data.csv"

target_attribute_name = "Churn?"
target_attribute_values = np.unique(train_data[target_attribute_name])
target_attribute_true_value = target_attribute_values[1]  # 'True.'

automl = AutoML(
    role=role,
    target_attribute_name=target_attribute_name,
    base_job_name=base_job_name,
    sagemaker_session=session,
    max_candidates=5,
)

automl.fit(train_file, job_name=base_job_name, wait=False, logs=False)

### Tracking SageMaker Autopilot Job Progress<a name="Tracking"></a>
SageMaker Autopilot job consists of the following high-level steps : 
* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The processing job connected to this can be found [here](https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/processing-jobs)
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level. The 5 training jobs connected to this can be found [here](https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs)
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). The tuning job connected to this can be found [here](https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/hyper-tuning-jobs)

We can use the `describe_auto_ml_job` method to check the status of our SageMaker Autopilot job.

Once you run the cell below, please be aware that Autopilot will take approximately 45 minutes to complete all stages.

In [None]:
print("JobStatus - Secondary Status")
print("------------------------------")


describe_response = automl.describe_auto_ml_job()
print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(30)

---
## Describing the SageMaker Autopilot Job Results <a name="Results"></a>

We can use the `describe_auto_ml_job` method to look up the best candidate generated by the SageMaker Autopilot job. This notebook demonstrate end-to-end Autopilot so that we have a already initialized `automl` object. 

In [None]:
best_candidate = automl.describe_auto_ml_job()["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]
pprint(best_candidate)
print("\n")
print("CandidateName: " + best_candidate_name)
print(
    "FinalAutoMLJobObjectiveMetricName: "
    + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"]
)
print(
    "FinalAutoMLJobObjectiveMetricValue: "
    + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
)

Due to some randomness in the algorithms involved, different runs will provide slightly different results, but accuracy will be around or above $93-94\%$, which is a good result.

### Check Top Candidates

In addition to the `best_candidate`, we can also explore the other top candidates generated by SageMaker Autopilot. 

We use the `list_candidates` method to see our other top candidates using 'max_results=5' to limit to 5

In [None]:
candidates = automl.list_candidates(
    sort_by="FinalObjectiveMetricValue", sort_order="Descending", 
)

for candidate in candidates:
    print("Candidate name: ", candidate["CandidateName"])
    print("Objective metric name: ", candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"])
    print("Objective metric value: ", candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
    print("\n")