# Amazon SageMaker AutoPilot (or AutoML)

## Introduction
**[Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html)** is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use **SageMaker Autopilot** in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the <u>AWS SDKs</u> to simply create and deploy a machine learning model.

<div class="alert alert-block alert-info">
    <b>Remark</b>: Another option for without coding is to use <b><u>SageMaker Canvas</u></b>, please refer to the <a href=https://aws.amazon.com/sagemaker/canvas/>product page</a> for more information.
</div>

I will use the auto-insurance data, which we have previously worked on in the last 3 notebooks to demonstrate the AutoML capability.

## Set up

In [None]:
%pip install --upgrade pip awscli botocore boto3 sagemaker --quiet --root-user-action=ignore

In [4]:
import sagemaker
import boto3
import pandas as pd

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [6]:
REGION_NM = boto3.Session().region_name
sm_session = sagemaker.Session()

# Adjust Bucket name and prefix based on your AWS dataset location
# S3_BUCKET_NM = sm_session.default_bucket()
S3_BUCKET_NM = 'ml-ai-demo-th'
S3_PREFIX = 'amazon-fraud-detector'

ROLE_ARN = sagemaker.get_execution_role()
sm_client = boto3.client('sagemaker')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


## Dataset
---

I will continue to use the train / test dataset of auto-insurance claim in this demonstrated notebook. Let's quickly look at the data for both.

<div class="alert alert-block alert-info">
    I only load to look at the dataset, we don't need to load this into SageMaker Studio Notebook.
</div>

In [8]:
train_df = pd.read_csv(f's3://{S3_BUCKET_NM}/{S3_PREFIX}/train/auto_insurance_fraud_train.csv')
print(train_df.shape)
print(train_df.columns)
display(train_df.head(3))

(80000, 23)
Index(['first_name', 'last_name', 'date_of_birth', 'incident_severity',
       'num_vehicles_involved', 'num_injuries', 'num_witnesses',
       'police_report_available', 'injury_claim', 'vehicle_claim',
       'incident_hour', 'customer_age', 'policy_deductable',
       'policy_annual_premium', 'num_claims_past_year', 'EVENT_LABEL',
       'EVENT_TIMESTAMP', 'EVENT_ID', 'ENTITY_TYPE', 'ENTITY_ID', 'policy_id',
       'incident_type', 'LABEL_TIMESTAMP'],
      dtype='object')


Unnamed: 0,first_name,last_name,date_of_birth,incident_severity,num_vehicles_involved,num_injuries,num_witnesses,police_report_available,injury_claim,vehicle_claim,...,policy_annual_premium,num_claims_past_year,EVENT_LABEL,EVENT_TIMESTAMP,EVENT_ID,ENTITY_TYPE,ENTITY_ID,policy_id,incident_type,LABEL_TIMESTAMP
0,Anna,Kim,1960-01-12,5.0,2.0,3.0,2.0,0.0,9785.0,3241.0,...,2600.0,3.0,legit,2022-03-27T12:40:05Z,fb93212b-a1ac-45cd-9bc9-79863600ad18,customer,a584-37055492,fb93212b-a1ac-45cd-9bc9-79863600ad18,theft,2023-03-06T15:05:50Z
1,Nathaniel,Garrett,2014-02-01,4.0,2.0,5.0,2.0,0.0,6258.0,6520.0,...,2004.0,6.0,legit,2022-03-24T03:58:59Z,fdf03dab-9e82-44ab-8947-5a8c571a6285,customer,b860-98572447,fdf03dab-9e82-44ab-8947-5a8c571a6285,breakin,2023-03-06T15:05:50Z
2,Linda,Lynch,2003-10-27,3.0,2.0,4.0,2.0,0.0,9244.0,5581.0,...,1787.0,5.0,legit,2022-09-14T19:52:51Z,92726737-ef66-4fca-91fd-1b9e40806447,customer,d681-19951055,92726737-ef66-4fca-91fd-1b9e40806447,theft,2023-03-06T15:05:50Z


In [89]:
test_df = pd.read_csv(f's3://{S3_BUCKET_NM}/{S3_PREFIX}/test/auto_insurance_fraud_test.csv')
print(test_df.shape)
print(test_df.columns)
display(test_df.head(3))

(20000, 23)
Index(['first_name', 'last_name', 'date_of_birth', 'incident_severity',
       'num_vehicles_involved', 'num_injuries', 'num_witnesses',
       'police_report_available', 'injury_claim', 'vehicle_claim',
       'incident_hour', 'customer_age', 'policy_deductable',
       'policy_annual_premium', 'num_claims_past_year', 'EVENT_LABEL',
       'EVENT_TIMESTAMP', 'EVENT_ID', 'ENTITY_TYPE', 'ENTITY_ID', 'policy_id',
       'incident_type', 'LABEL_TIMESTAMP'],
      dtype='object')


Unnamed: 0,first_name,last_name,date_of_birth,incident_severity,num_vehicles_involved,num_injuries,num_witnesses,police_report_available,injury_claim,vehicle_claim,...,policy_annual_premium,num_claims_past_year,EVENT_LABEL,EVENT_TIMESTAMP,EVENT_ID,ENTITY_TYPE,ENTITY_ID,policy_id,incident_type,LABEL_TIMESTAMP
0,Vernon,Reyes,1934-08-25,4.0,3.0,4.0,2.0,0.0,2788.0,7345.0,...,1818.0,3.0,fraud,2022-12-15T13:00:37Z,3e35057c-e4ce-4e38-87fd-bf590fb50dc7,customer,c252-19142691,3e35057c-e4ce-4e38-87fd-bf590fb50dc7,collision,2023-03-06T15:05:50Z
1,Dennis,Montoya,1972-04-11,4.0,2.0,5.0,1.0,0.0,8093.0,4208.0,...,2530.0,5.0,legit,2022-08-13T15:31:20Z,3378f38c-2cad-49f9-a90f-50a73340c09a,customer,c759-02561552,3378f38c-2cad-49f9-a90f-50a73340c09a,theft,2023-03-06T15:05:50Z
2,Jared,Collins,1915-09-29,5.0,3.0,4.0,2.0,0.0,10144.0,4197.0,...,2893.0,5.0,legit,2022-09-05T23:48:27Z,c2d33aef-9cc7-4bcb-959e-40d6e14117f7,customer,d074-76720665,c2d33aef-9cc7-4bcb-959e-40d6e14117f7,collision,2023-03-06T15:05:50Z


## Setting up the SageMaker Autopilot Job
---

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset.

The required inputs for invoking a Autopilot job are:

- Amazon S3 location for input dataset and for all output artifacts
- Name of the column of the dataset you want to predict (**EVENT_LABEL** in this case)
- An IAM role (with access to Amazon S3 and SageMaker)

Currently Autopilot supports <u>only tabular datasets in CSV format</u>. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order by name, is expected to have a header row.


In [9]:
TARGET_LABEL = 'EVENT_LABEL'

input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": f"s3://{S3_BUCKET_NM}/{S3_PREFIX}/train/"
            }
        },
        "TargetAttributeName": TARGET_LABEL,
    }
]

output_data_config = {
    "S3OutputPath": f"s3://{S3_BUCKET_NM}/{S3_PREFIX}/autopilot_output/"
}

You can also specify the type of problem you want to solve with your dataset (Regression, MulticlassClassification, BinaryClassification). In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict).

Because the target attribute, Churn?, is binary, our model will be performing binary prediction, also known as binary classification. In this example we will let AutoPilot infer the type of problem for us.

You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a Candidate because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job takes about four hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters.

## Launching the SageMaker Autopilot Job
---

We can now launch the **Autopilot** job by calling the `create_auto_ml_job` method. We can adjust various parameters in the API. For the full usage of API, please refer to the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job.html).

In [19]:
import datetime
import time


autopilot_job_name = f"insurance-fraud-{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}"
print(f'AutoMLJobName: {autopilot_job_name}')

resp = sm_client.create_auto_ml_job(
    AutoMLJobName=autopilot_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLJobConfig={
        "CompletionCriteria": {
            "MaxCandidates": 50
        }
    },
    RoleArn=ROLE_ARN,
    Tags=[{
        'Key': 'Project',
        'Value': 'demo-fraud-detection'
    }],
)

AutoMLJobName: insurance-fraud-20231106-143253


### Tracking SageMaker Autopilot job progress
---

SageMaker Autopilot job consists of the following high-level steps :

- Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
- Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
- Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline).

In [None]:
print("JobStatus - Secondary Status")
print("------------------------------")

resp = sm_client.describe_auto_ml_job(
    AutoMLJobName=autopilot_job_name
)
print(f'{resp["AutoMLJobStatus"]} --- {resp["AutoMLJobSecondaryStatus"]}')
_status = resp["AutoMLJobStatus"]

while _status not in ("Failed", "Completed", "Stopped"):
    resp = sm_client.describe_auto_ml_job(
        AutoMLJobName=autopilot_job_name
    )
    _status = resp["AutoMLJobStatus"]
    print(f'{resp["AutoMLJobStatus"]} --- {resp["AutoMLJobSecondaryStatus"]}')
    time.sleep(60)

JobStatus - Secondary Status
------------------------------
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- AnalyzingData
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- FeatureEngineering
InProgress --- ModelTuning
InProgress --- ModelTuning
InProgress --- ModelTuning
InProgress --- ModelTuning
InProgress --- ModelTuning
InProg

The job runtime will be varied depends on how many candidates you want to compare, the size of the data, and number of columns and their data types.

<div class="alert info-box alert-info">
    AutoPilot automatically generates <b>two executable Jupyter Notebooks</b>
    <ul>
        <li>SageMakerAutopilotDataExplorationNotebook.ipynb</li>
        <li>SageMakerAutopilotCandidateDefinitionNotebook.ipynb</li>
    </ul>
    <p>
        These notebooks are <u>stored in Amazon S3</u>, which are available for us to explore further.
</div>

In [29]:
# print(describe_response)
print(resp["AutoMLJobArtifacts"]["CandidateDefinitionNotebookLocation"])
print(resp["AutoMLJobArtifacts"]["DataExplorationNotebookLocation"])
candidate_nbk_uri = resp["AutoMLJobArtifacts"]["CandidateDefinitionNotebookLocation"]
data_explore_nbk_uri = resp["AutoMLJobArtifacts"]["DataExplorationNotebookLocation"]

s3://ml-ai-demo-th/amazon-fraud-detector/autopilot_output/insurance-fraud-20231106-143253/sagemaker-automl-candidates/insurance-fraud-20231106-143253-pr-1-6e80be25f0964fa1b850950ffb/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
s3://ml-ai-demo-th/amazon-fraud-detector/autopilot_output/insurance-fraud-20231106-143253/sagemaker-automl-candidates/insurance-fraud-20231106-143253-pr-1-6e80be25f0964fa1b850950ffb/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb


### Get the Results of AutoPilot
---
Now use the `describe_auto_ml_job` or `describe_auto_ml_job_v2` API to look up the best candidate selected by the SageMaker Autopilot job.

In [79]:
resp = sm_client.describe_auto_ml_job_v2(AutoMLJobName=autopilot_job_name)
best_candidate = resp['BestCandidate']
best_candidate_name = resp['BestCandidate']['CandidateName']
print(f'''
The name of the best model candidate name: {best_candidate_name}
The evaluation metric used: {resp['BestCandidate']['FinalAutoMLJobObjectiveMetric']['MetricName']}
The {resp['BestCandidate']['FinalAutoMLJobObjectiveMetric']['StandardMetricName']} score is {resp['BestCandidate']['FinalAutoMLJobObjectiveMetric']['Value']}
''')


The name of the best model candidate name: insurance-fraud-20231106-14325Y5-035-043f6b11
The evaluation metric used: validation:f1_binary
The F1 score is 0.913320004940033



You can also view other evaluation metric by accessing by running below script.

In [45]:
pd.DataFrame(resp['BestCandidate']['CandidateProperties']['CandidateMetrics'])

Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.91332,Validation
1,LogLoss,LogLoss,0.03585,Validation
2,Recall,Recall,0.90199,Validation
3,Precision,Precision,0.92494,Validation
4,AUC,AUC,0.99468,Validation
5,Accuracy,Accuracy,0.99138,Validation
6,BalancedAccuracy,BalancedAccuracy,0.94905,Validation


You can access which algorithm is being used by looking at **InferenceContainers** image.

In [61]:
resp['BestCandidate']['InferenceContainers'][1]['Image']

'246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3'

### Explore other model candidates
---
If you are curious to explore the performance of other algorithms that AutoPilot explored, they are enumerated for you below via `list_candidates_for_auto_ml_job` API call

In [63]:
resp = sm_client.list_candidates_for_auto_ml_job(AutoMLJobName=autopilot_job_name)

In [75]:
print(f'There are total of {len(resp["Candidates"])} models')
print(' -------------------------------------------------- ')
for n, candidate_ in enumerate(resp["Candidates"]):
    print(f'Candidate: {n+1}')
    print(f'Candidate Name: {candidate_["CandidateName"]}')
    print(f'''
Using metric as {candidate_["FinalAutoMLJobObjectiveMetric"]["MetricName"]} - \
{candidate_["FinalAutoMLJobObjectiveMetric"]["StandardMetricName"]}: {candidate_["FinalAutoMLJobObjectiveMetric"]["Value"]}
Algorithm: {candidate_["InferenceContainers"][1]["Image"]}
''')
    print('Other metrics:\n')
    display(pd.DataFrame(candidate_['CandidateProperties']['CandidateMetrics']))
    print(' -------------------------------------------------- ')
          

There are total of 10 models
 -------------------------------------------------- 
Candidate: 1
Candidate Name: insurance-fraud-20231106-14325Y5-050-61254409

Using metric as validation:f1_binary - F1: 0.5327500104904175
Algorithm: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.53275,Validation
1,LogLoss,LogLoss,0.26613,Validation
2,Recall,Recall,0.90819,Validation
3,Precision,Precision,0.37693,Validation
4,AUC,AUC,0.97775,Validation
5,Accuracy,Accuracy,0.91975,Validation
6,BalancedAccuracy,BalancedAccuracy,0.91428,Validation


 -------------------------------------------------- 
Candidate: 2
Candidate Name: insurance-fraud-20231106-14325Y5-049-4da8833c

Using metric as validation:binary_f_beta - F1: 0.07545485347509384
Algorithm: 174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:inference-cpu

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,LogLoss,LogLoss,17.879566,Validation
1,F1,F1,0.075455,Validation
2,BalancedAccuracy,BalancedAccuracy,0.45252,Validation
3,Accuracy,Accuracy,0.482345,Validation
4,AUC,AUC,0.437264,Validation
5,Recall,Recall,0.419355,Validation
6,Precision,Precision,0.041457,Validation


 -------------------------------------------------- 
Candidate: 3
Candidate Name: insurance-fraud-20231106-14325Y5-048-6b3a708e

Using metric as validation:binary_f_beta - F1: 0.004950494971126318
Algorithm: 174872318107.dkr.ecr.us-west-2.amazonaws.com/mxnet-algorithms:inference-cpu

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,LogLoss,LogLoss,0.591253,Validation
1,Accuracy,Accuracy,0.949753,Validation
2,Recall,Recall,0.002481,Validation
3,Precision,Precision,1.0,Validation
4,AUC,AUC,0.53779,Validation
5,F1,F1,0.00495,Validation
6,BalancedAccuracy,BalancedAccuracy,0.501241,Validation


 -------------------------------------------------- 
Candidate: 4
Candidate Name: insurance-fraud-20231106-14325Y5-047-ec3d4ba7

Using metric as validation:f1_binary - F1: 0.5999199748039246
Algorithm: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.59992,Validation
1,LogLoss,LogLoss,1.18307,Validation
2,Recall,Recall,0.89578,Validation
3,Precision,Precision,0.45097,Validation
4,AUC,AUC,0.9427,Validation
5,Accuracy,Accuracy,0.93982,Validation
6,BalancedAccuracy,BalancedAccuracy,0.91897,Validation


 -------------------------------------------------- 
Candidate: 5
Candidate Name: insurance-fraud-20231106-14325Y5-046-b65684d0

Using metric as validation:f1_binary - F1: 0.7544199824333191
Algorithm: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.75442,Validation
1,LogLoss,LogLoss,0.09903,Validation
2,Recall,Recall,0.95285,Validation
3,Precision,Precision,0.62439,Validation
4,AUC,AUC,0.99222,Validation
5,Accuracy,Accuracy,0.96875,Validation
6,BalancedAccuracy,BalancedAccuracy,0.96122,Validation


 -------------------------------------------------- 
Candidate: 6
Candidate Name: insurance-fraud-20231106-14325Y5-045-e2d6fd23

Using metric as validation:f1_binary - F1: 0.7234500050544739
Algorithm: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.72345,Validation
1,LogLoss,LogLoss,0.29363,Validation
2,Recall,Recall,0.93797,Validation
3,Precision,Precision,0.58879,Validation
4,AUC,AUC,0.99068,Validation
5,Accuracy,Accuracy,0.96388,Validation
6,BalancedAccuracy,BalancedAccuracy,0.95161,Validation


 -------------------------------------------------- 
Candidate: 7
Candidate Name: insurance-fraud-20231106-14325Y5-044-4cc27931

Using metric as validation:f1_binary - F1: 0.9054999947547913
Algorithm: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.9055,Validation
1,LogLoss,LogLoss,0.03081,Validation
2,Recall,Recall,0.90943,Validation
3,Precision,Precision,0.9016,Validation
4,AUC,AUC,0.9936,Validation
5,Accuracy,Accuracy,0.99044,Validation
6,BalancedAccuracy,BalancedAccuracy,0.95208,Validation


 -------------------------------------------------- 
Candidate: 8
Candidate Name: insurance-fraud-20231106-14325Y5-043-01228aff

Using metric as validation:f1_binary - F1: 0.7709299921989441
Algorithm: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.77093,Validation
1,LogLoss,LogLoss,0.10145,Validation
2,Recall,Recall,0.95409,Validation
3,Precision,Precision,0.64676,Validation
4,AUC,AUC,0.99328,Validation
5,Accuracy,Accuracy,0.97144,Validation
6,BalancedAccuracy,BalancedAccuracy,0.96323,Validation


 -------------------------------------------------- 
Candidate: 9
Candidate Name: insurance-fraud-20231106-14325Y5-042-8e309d6b

Using metric as validation:binary_f_beta - F1: 0.1727493852376938
Algorithm: 174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:inference-cpu

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,LogLoss,LogLoss,14.678397,Validation
1,F1,F1,0.172749,Validation
2,BalancedAccuracy,BalancedAccuracy,0.719848,Validation
3,Accuracy,Accuracy,0.575027,Validation
4,AUC,AUC,0.825642,Validation
5,Recall,Recall,0.880893,Validation
6,Precision,Precision,0.095765,Validation


 -------------------------------------------------- 
Candidate: 10
Candidate Name: insurance-fraud-20231106-14325Y5-041-0805d4ab

Using metric as validation:f1_binary - F1: 0.6401000022888184
Algorithm: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.3-1-cpu-py3

Other metrics:



Unnamed: 0,MetricName,StandardMetricName,Value,Set
0,F1,F1,0.6401,Validation
1,LogLoss,LogLoss,0.63774,Validation
2,Recall,Recall,0.9268,Validation
3,Precision,Precision,0.48887,Validation
4,AUC,AUC,0.98475,Validation
5,Accuracy,Accuracy,0.9475,Validation
6,BalancedAccuracy,BalancedAccuracy,0.9377,Validation


 -------------------------------------------------- 


## Host the best candidate model
---
Now we have the trained algorithm, we can then create a model and deploy it as an endpoint.

### Create model

In [83]:
model_name = f'{best_candidate_name}-model'

model_arn = sm_client.create_model(
    Containers=best_candidate["InferenceContainers"], 
    ModelName=model_name, 
    ExecutionRoleArn=ROLE_ARN,
)

### Create endpoint configuration

In [85]:
endpoint_config_name = f'{best_candidate_name}-epc'
endpoint_config = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[{
        "InstanceType": "ml.m5.xlarge",
        "InitialInstanceCount": 1,
        "ModelName": model_name,
        "VariantName": "main",
    }],
)


### Create endpoint

In [87]:
endpoint_name = f'{best_candidate_name}-ep'
endpoint_resp = sm_client.create_endpoint(
    EndpointName=endpoint_name, 
    EndpointConfigName=endpoint_config_name
)

In [88]:
sm_client.get_waiter("endpoint_in_service").wait(EndpointName=endpoint_name)

## Evaluate
---

Once our endpoint is ready, let's use it to predict on our test dataset.

In [92]:
from io import StringIO
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

fraud_predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sm_session,
    serializer=CSVSerializer(),
    deserializer=CSVDeserializer(),
)

In [94]:
_test_data = test_df.drop(['EVENT_LABEL'], axis=1).copy()
prediction = fraud_predictor.predict(
    _test_data.to_csv(sep=',', header=False, index=False)
)

pred_df = pd.DataFrame(prediction)
pred_df.head(3)

Unnamed: 0,0
0,legit
1,legit
2,legit


In [100]:
(test_df['EVENT_LABEL']).unique()

array(['fraud', 'legit'], dtype=object)

In [103]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

accuracy = accuracy_score(
    test_df.reset_index()['EVENT_LABEL'], 
    pred_df[0]
)
precision = precision_score(
    test_df.reset_index()['EVENT_LABEL'], 
    pred_df[0], 
    pos_label='fraud'
)

recall = recall_score(
    test_df.reset_index()['EVENT_LABEL'], 
    pred_df[0],
    pos_label="fraud", 
    average="binary"
)

f1 = f1_score(
    test_df.reset_index()['EVENT_LABEL'], 
    pred_df[0], 
    pos_label='fraud'
)

print(f'''
Accuracy score: {accuracy}
F1 score: {f1}
Recall: {recall}
Precision: {precision}
Classification report: \n{classification_report(test_df.reset_index()['EVENT_LABEL'], pred_df[0])}
''')


Accuracy score: 0.9923
F1 score: 0.9226907630522088
Recall: 0.9045275590551181
Precision: 0.9415983606557377
Classification report: 
              precision    recall  f1-score   support

       fraud       0.94      0.90      0.92      1016
       legit       0.99      1.00      1.00     18984

    accuracy                           0.99     20000
   macro avg       0.97      0.95      0.96     20000
weighted avg       0.99      0.99      0.99     20000




## Clean up
---

The `Autopilot job` creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well.

Below is some code guidelines where you can use to list out and delete, if not required.

In [106]:
s3_resource = boto3.resource('s3') 
s3_bucket = s3_resource.Bucket(S3_BUCKET_NM)

job_outputs_prefix = '{}/autopilot_output/{}'.format(S3_PREFIX, autopilot_job_name) 
print(job_outputs_prefix)

amazon-fraud-detector/autopilot_output/insurance-fraud-20231106-143253


In [None]:
# Uncomment the code to delete the files in that prefix
# s3_bucket.objects.filter(Prefix=job_outputs_prefix).delete()

Finally, we delete the endpoint and associated resources, if you are no longer used the endpoint.

In [115]:
sm_client = boto3.client('sagemaker')
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

{'ResponseMetadata': {'RequestId': '724f97b2-e2cc-4edc-af25-6c6afa7db107',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '724f97b2-e2cc-4edc-af25-6c6afa7db107',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Tue, 07 Nov 2023 14:22:10 GMT'},
  'RetryAttempts': 0}}