# SageMaker AutoPilot

SageMaker provides a feature called **AutoPilot** that allows automated training and finetuning of models for various tasks. However, AutoPilot is much very powerful in that it has SageMaker models implemented, as well as automatically process data and engineer new useful features. AutoPilot can also automatically detect whether the task is regression or classification by analyzing the target!

In this notebook, we will use AutoPilot on the `obesity.csv` data. Interestingly, since the target in this data is integer and only has seven values (1 to 7), AutoPilot will consider this task classification. In data with continuous target, AutoPilot will run regression models.

## Load data

Not much to discuss here, we set the path to the bucket and the data, and load it into a `pandas` dataframe

In [1]:
import pandas as pd
import numpy as np

data_name = 'obesity.csv'
data_location = 'your bucket'

data = pd.read_csv(data_location + data_name)
data.head(3)

  from pandas.core.computation.check import NUMEXPR_INSTALLED
severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Unnamed: 0,Gender,Age,Height,Weight,FamilyHistory,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,ObesityLevel
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,2
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,2
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,2


Next, we need to define the input features and the target

In [2]:
columns = [
    'Gender',
    'Age',
    'Height',
    'Weight',
    'FamilyHistory',
    'FAVC',
    'FCVC',
    'NCP',
    'CAEC',
    'SMOKE',
    'CH2O',
    'SCC',
    'FAF',
    'TUE',
    'CALC',
    'MTRANS'
]

target = 'ObesityLevel'

Then split train/test and upload back to our S3 bucket for the model. We also need to drop the target from the test data

In [55]:
from sklearn.model_selection import train_test_split

traindata, testdata_w_target = train_test_split(data, test_size = 0.2)
traindata.to_csv('train_data.csv', index=False)

testdata = testdata_w_target.drop([target], axis=1)
testdata.to_csv('test_data_no_target.csv', index=False)

Upload to S3

In [56]:
import sagemaker
import boto3
import os

sess = sagemaker.Session()                        # get our current SageMaker session
bucket = 'lle13-it7143'                           # this should be the name of the bucket we created in module 9
prefix = 'obesity'                                # the folder to store your data in the S3 instance

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train_data.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test_data_no_target.csv')

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


## Set up the AutoPilot Job

Since we set up all the required information previously (such as bucket, folder, target), we can run the cell below as-is. Basically, it just sets the data path and target in the data.

In [8]:
input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": "s3://{}/{}/train".format(bucket, prefix),
            }
        },
        "TargetAttributeName": target,
    }
]

job_config = {"CompletionCriteria": {"MaxCandidates": 10}}


output_data_config = {"S3OutputPath": "s3://{}/{}/output".format(bucket, prefix)}

## Create and Launch the AutoPilot Job

Now we can create and send the auto job to SageMaker. You can see that the code is much simpler compared to XGBoost or Linear Learner since we don't have to set up parameters or set up model images.

In [10]:
auto_ml_job_name = 'automl-obesity'
region = boto3.Session().region_name
sm = boto3.Session().client(service_name="sagemaker", region_name=region)
role = sagemaker.get_execution_role()

sm.create_auto_ml_job(
    AutoMLJobName=auto_ml_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLJobConfig=job_config,
    RoleArn=role,
)

{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:111098985325:automl-job/automl-obesity',
 'ResponseMetadata': {'RequestId': '4525ca03-4df0-490b-9876-7baa4feaa068',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4525ca03-4df0-490b-9876-7baa4feaa068',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '85',
   'date': 'Sat, 06 Jan 2024 07:32:55 GMT'},
  'RetryAttempts': 0}}

The AutoPilot job will actually runs separately from this notebook, so you will see the cell above as completed right after running it. To track the job status, run the cell below to retrieve the training information every 60 seconds.

In [12]:
print("JobStatus - Secondary Status")
print("------------------------------")

from time import sleep

describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(60)

JobStatus - Secondary Status
------------------------------
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - MergingAutoMLTaskReports
InProgress - MergingAutoMLTaskReports
InProgress - MergingAutoMLTaskReports
InProgress - MergingAutoMLTaskReports
InProgress - MergingAutoMLTaskReports
InProgress - MergingAutoMLTaskReports
InProgress - MergingAutoMLTaskReports
InProgress - MergingAutoMLTaskReports
InProgress - Mer

We can investigate the best model

In [17]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]

print("CandidateName: " + best_candidate_name)
print(
    "metric: "
    + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"]
)
print(
    "value: "
    + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
)

CandidateName: automl-obesityvSlg1ey6faHtNtWQEk-008-26f090ee
metric: validation:accuracy
value: 0.9541900157928467


## Deploy the Selected Model

Like before, after training, we need to deploy our model to an endpoint. This time, we need to manually create the endpoint config as well (this was automated previously). However, the code does not require modification and can be reused on different data.

In [31]:
model_name = best_candidate_name + "-model"
model_arn = sm.create_model(Containers=best_candidate["InferenceContainers"], ModelName=model_name, ExecutionRoleArn=role)

epc_name = best_candidate_name + "-epc"
ep_config = sm.create_endpoint_config(
    EndpointConfigName=epc_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.m4.xlarge",
            "InitialInstanceCount": 1,
            "ModelName": model_name,
            "VariantName": "main",
        }
    ],
)

ep_name = best_candidate_name + "-ep"
create_endpoint_response = sm.create_endpoint(EndpointName=ep_name, EndpointConfigName=epc_name)

Another difference compared to previous module is that this time the deployment happens separately from our notebook session. So, we need to run the code below to track when the deployment completes.

In [32]:
sm.get_waiter("endpoint_in_service").wait(EndpointName=ep_name)

## Inference

Finally, we can make prediction for the test data. The steps are like in XGBoost and Linear Learner. We convert the test data to a `numpy` array directly from the S3 bucket.

In [57]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

predictor = Predictor(
    endpoint_name=ep_name,
    sagemaker_session=sess,
    serializer=CSVSerializer(),
    deserializer=CSVDeserializer(),
)

test_data_np = pd.read_csv(f's3://{bucket}/{prefix}/test/test.csv').values
prediction = predictor.predict(test_data_np)
prediction_df = pd.DataFrame(prediction)

In this case, we have the test data labels from `testdata_w_target`, so we can evaluate the model's accuracy and f1. Specifically for this data, the prediction becomes string, so we need to cast it back to integer.

In [61]:
from sklearn.metrics import accuracy_score, f1_score

prediction_int = np.array(prediction, dtype=int).flatten()
print('accuracy: ', accuracy_score(testdata_w_target[target], prediction_int))
print('f1 score: ', f1_score(testdata_w_target[target], prediction_int, average='weighted'))

accuracy:  0.9905437352245863
f1 score:  0.9905426939326567


## Cleaning up

AutoPilot generates a lot more resources than training and tuning a single model. Therefore, this part becomes a bit more complicated. Regardless, please don't forget to run it!

In [63]:
import time

def cleanup_experiment_resources(experiment_name):
    trials = sm.list_trials(ExperimentName=experiment_name)["TrialSummaries"]
    print("TrialNames:")
    for trial in trials:
        trial_name = trial["TrialName"]
        print(f"\n{trial_name}")

        components_in_trial = sm.list_trial_components(TrialName=trial_name)
        print("\tTrialComponentNames:")
        for component in components_in_trial["TrialComponentSummaries"]:
            component_name = component["TrialComponentName"]
            print(f"\t{component_name}")
            sm.disassociate_trial_component(TrialComponentName=component_name, TrialName=trial_name)
            try:
                # comment out to keep trial components
                sm.delete_trial_component(TrialComponentName=component_name)
            except:
                # component is associated with another trial
                continue
            # to prevent throttling
            time.sleep(5)
        sm.delete_trial(TrialName=trial_name)
    sm.delete_experiment(ExperimentName=experiment_name)
    print(f"\nExperiment {experiment_name} deleted")


def cleanup_autopilot_models(autopilot_job_name):
    print("{0}:\n".format(autopilot_job_name))
    response = sm.list_models(NameContains=autopilot_job_name)

    for model in response["Models"]:
        model_name = model["ModelName"]
        print(f"\t{model_name}")
        sm.delete_model(ModelName=model_name)
        # to prevent throttling
        time.sleep(3)
        
cleanup_experiment_resources("{0}-aws-auto-ml-job".format(auto_ml_job_name))
cleanup_autopilot_models(auto_ml_job_name)


TrialNames:

automl-obesityvSlg1ey6faHtNtWQEk-005-e6f29619-aws-trial
	TrialComponentNames:
	automl-obesity-dpp2-csv-1-4c1b0cc2291e434aa325268feafc97ac507e9-aws-transform-job
	automl-obesity-dpp2-1-9a36998be71f446dbd7ea32bcb055e99430bda923-aws-training-job
	automl-obesity-db-1-097f8c93764e4c6fbb270b8cceff7ceb61bc5bd0c61-aws-processing-job

automl-obesityvSlg1ey6faHtNtWQEk-006-3f9b2b0f-aws-trial
	TrialComponentNames:
	automl-obesityvSlg1ey6faHtNtWQEk-006-3f9b2b0f-aws-training-job
	automl-obesity-dpp0-csv-1-d7a20225be7d4f1e8679ea2d582e2e564b350-aws-transform-job
	automl-obesity-dpp0-1-b44a05c57e8a4f58ba3b5c5c597d0e24cc5338710-aws-training-job
	automl-obesity-db-1-097f8c93764e4c6fbb270b8cceff7ceb61bc5bd0c61-aws-processing-job

automl-obesityvSlg1ey6faHtNtWQEk-003-b751cc2f-aws-trial
	TrialComponentNames:
	automl-obesityvSlg1ey6faHtNtWQEk-003-b751cc2f-aws-training-job
	automl-obesity-dpp0-csv-1-d7a20225be7d4f1e8679ea2d582e2e564b350-aws-transform-job
	automl-obesity-dpp0-1-b44a05c57e8a4f58ba3

In [None]:
sm.delete_endpoint(EndpointName=ep_name) 
sm.delete_endpoint_config(EndpointConfigName=epc_name)