# Sagemaker AutoML Fraud Detection
This notebook contains the code to start a sagemaker automl job for the fraud detection Kaggle challenge. 

The code is partly taken and inspired by [this](https://gitlab.com/juliensimon/aim307/-/blob/master/aim307.ipynb) aws example notebook.

In [1]:
%%sh
pip install --upgrade pip
pip -q install sagemaker awscli boto3 --upgrade 

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 21.1 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.3.1
    Uninstalling pip-22.3.1:
      Successfully uninstalled pip-22.3.1
Successfully installed pip-23.0.1


In [2]:
import numpy as np 
import pandas as pd
import boto3
import sagemaker
import os, sys
import time

print (sagemaker.__version__)

2.145.0


In [3]:
sess   = sagemaker.Session()
bucket = sess.default_bucket()                     
region = boto3.Session().region_name
prefix = 'sagemaker/fraud-detection-auto-ml'
# Role when working on a notebook instance
role = sagemaker.get_execution_role()

In [4]:
sm = boto3.Session().client(service_name='sagemaker',region_name=region)
sm_rt = boto3.Session().client('runtime.sagemaker', region_name=region)

## Read Data from S3

In [36]:
bucket_data = 'patrick-fraud-detection-ml-kaggle'
filename = 'creditcard.csv'

s3 = boto3.client('s3') 

obj = s3.get_object(Bucket=bucket_data, Key=filename) 
# get object and file (key) from bucket

df = pd.read_csv(obj['Body']) # 'Body' is a key word

In [39]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [40]:
# check number of samples
num_samples, num_features = df.shape

print(f"There are {num_samples} samples with {num_features} features each.")

There are 284807 samples with 31 features each.


In [41]:
# check data imbalance on target
count_neg_class = np.sum(df["Class"] == 0)
count_pos_class = np.sum(df["Class"] == 1)

print(f"There are {count_neg_class} negative samples ({np.round(100 * count_neg_class / num_samples, 2)} % of total data).")
print(f"There are {count_pos_class} positive samples ({np.round(100 * count_pos_class / num_samples, 2)} % of total data).")

There are 284315 negative samples (99.83 % of total data).
There are 492 positive samples (0.17 % of total data).


## Splitting the dataset

Split dataset into train and test. The train set is again split during AutoML job into train and validation.

In [42]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.2)

# Save to CSV files and upload to S3
train_file = "automl-train.csv"
train_data.to_csv(train_file, index=False, header=True, sep=',') # Need to keep column names
train_data_s3_path = sess.upload_data(path=train_file, key_prefix=prefix + "/train")
print("Train data uploaded to: " + train_data_s3_path)

test_file = "automl-test.csv"
test_file_no_target = "automl-test-no-target.csv"
test_data_no_target = test_data.drop(columns=["Class"])
test_data.to_csv(test_file, index=False, header=False, sep=',')
test_data_no_target.to_csv(test_file_no_target, index=False, header=False, sep=',')
test_data_s3_path = sess.upload_data(path=test_file, key_prefix=prefix + "/test")
test_data_s3_path_no_target = sess.upload_data(path=test_file_no_target, key_prefix=prefix + "/test")
print("Test data uploaded to: " + test_data_s3_path)
print("Test data no target uploaded to: " + test_data_s3_path_no_target)

Train data uploaded to: s3://sagemaker-eu-central-1-705561729853/sagemaker/fraud-detection-auto-ml/train/automl-train.csv
Test data uploaded to: s3://sagemaker-eu-central-1-705561729853/sagemaker/fraud-detection-auto-ml/test/automl-test.csv
Test data no target uploaded to: s3://sagemaker-eu-central-1-705561729853/sagemaker/fraud-detection-auto-ml/test/automl-test-no-target.csv


## Setting up Sagemaker AutoML Job

In [16]:
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/input'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'Class'  # the column we want to predict
    }
]

output_data_config = { 'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix) }

# Optional parameters

problem_type = 'BinaryClassification'

job_objective = { 'MetricName': 'F1' }

## Launching the Amazon SageMaker AutoPilot job

Let's now launch the job by calling the create_auto_ml_job API.

In [17]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'fraud-detection-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      AutoMLJobConfig={"CompletionCriteria": {"MaxCandidates": 20}},
                      AutoMLJobObjective=job_objective,
                      ProblemType=problem_type,
                      RoleArn=role)

AutoMLJobName: fraud-detection-11-04-46-52


{'AutoMLJobArn': 'arn:aws:sagemaker:eu-central-1:705561729853:automl-job/fraud-detection-11-04-46-52',
 'ResponseMetadata': {'RequestId': 'e02dc03a-508e-4152-8159-7cf8025776c3',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e02dc03a-508e-4152-8159-7cf8025776c3',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '101',
   'date': 'Tue, 11 Apr 2023 04:46:54 GMT'},
  'RetryAttempts': 0}}

## Tracking Progress
We can now track the progess of the job.

In [18]:
%%time
job_run_status = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobStatus']

print(job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response['AutoMLJobStatus']
    
    print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
    sleep(60)

InProgress
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning
InProgress - ModelTuning


KeyboardInterrupt: 

## Fetching the auto-generated notebooks
Once the 'AnalyzingData' step is complete, SageMaker AutoPilot generates two notebooks: 
1. Data exploration
2. Candidate definition.

In [19]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']
job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

print(job_candidate_notebook)
print(job_data_notebook)

s3://sagemaker-eu-central-1-705561729853/sagemaker/fraud-detection-auto-ml/output/fraud-detection-11-04-46-52/sagemaker-automl-candidates/fraud-detection-11-04-46-52-pr-1-7b0617c0121c411b883dc943f7869e/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
s3://sagemaker-eu-central-1-705561729853/sagemaker/fraud-detection-auto-ml/output/fraud-detection-11-04-46-52/sagemaker-automl-candidates/fraud-detection-11-04-46-52-pr-1-7b0617c0121c411b883dc943f7869e/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb


In [20]:
%%sh -s $job_candidate_notebook $job_data_notebook
aws s3 cp $1 .
aws s3 cp $2 .

download: s3://sagemaker-eu-central-1-705561729853/sagemaker/fraud-detection-auto-ml/output/fraud-detection-11-04-46-52/sagemaker-automl-candidates/fraud-detection-11-04-46-52-pr-1-7b0617c0121c411b883dc943f7869e/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb to ./SageMakerAutopilotCandidateDefinitionNotebook.ipynb
download: s3://sagemaker-eu-central-1-705561729853/sagemaker/fraud-detection-auto-ml/output/fraud-detection-11-04-46-52/sagemaker-automl-candidates/fraud-detection-11-04-46-52-pr-1-7b0617c0121c411b883dc943f7869e/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb to ./SageMakerAutopilotDataExplorationNotebook.ipynb


## Inspecting the SageMaker Autopilot job with Amazon SageMaker Experiments
Once the 'ModelTuning' step starts, we can use the SageMaker Experiments SDK to list and view all jobs. Data is stored in a pandas dataframe, which makes it easy to filter it, compare it to other experiments, etc.

In [21]:
from sagemaker.analytics import ExperimentAnalytics

analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=auto_ml_job_name+'-aws-auto-ml-job'
)

df = analytics.dataframe()
df

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,_tuning_objective_metric,eval_metric,learning_rate,...,enable_validation_split,input_channel_mode,job_name,label_col,max_dataset_size,max_subsampled_dataset_size,SageMaker.ImageUri - MediaType,SageMaker.ImageUri - Value,ds - MediaType,ds - Value
0,fraud-detection-11-04-46-52EtgHM-018-83744956-...,fraud-detection-11-04-46-52EtgHM-018-83744956-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,664544806723.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:binary_f_beta,binary_f_beta,0.000686,...,,,,,,,,,,
1,fraud-detection-11-04-46-52EtgHM-016-60c6d19a-...,fraud-detection-11-04-46-52EtgHM-016-60c6d19a-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,664544806723.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:binary_f_beta,binary_f_1,0.00355,...,,,,,,,,,,
2,fraud-detection-11-04-46-52EtgHM-020-93507b1b-...,fraud-detection-11-04-46-52EtgHM-020-93507b1b-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,664544806723.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:binary_f_beta,binary_f_beta,0.001518,...,,,,,,,,,,
3,fraud-detection-11-04-46-52EtgHM-013-bc761f4a-...,fraud-detection-11-04-46-52EtgHM-013-bc761f4a-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,492215442770.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:f1_binary,"accuracy,f1_binary,auc,balanced_accuracy,preci...",,...,,,,,,,,,,
4,fraud-detection-11-04-46-52EtgHM-015-ef5ac858-...,fraud-detection-11-04-46-52EtgHM-015-ef5ac858-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,492215442770.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:f1_binary,"accuracy,f1_binary,auc,balanced_accuracy,preci...",,...,,,,,,,,,,
5,fraud-detection-11-04-46-52EtgHM-003-fda6791a-...,fraud-detection-11-04-46-52EtgHM-003-fda6791a-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,492215442770.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:f1_binary,"accuracy,f1_binary,auc,balanced_accuracy,preci...",,...,,,,,,,,,,
6,fraud-detection-11-04-46-52EtgHM-002-e2cf2bfc-...,fraud-detection-11-04-46-52EtgHM-002-e2cf2bfc-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,492215442770.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:f1_binary,"accuracy,f1_binary,auc,balanced_accuracy,preci...",,...,,,,,,,,,,
7,fraud-detection-11-04-46-52EtgHM-001-b993bea5-...,fraud-detection-11-04-46-52EtgHM-001-b993bea5-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,492215442770.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:f1_binary,"accuracy,f1_binary,auc,balanced_accuracy,preci...",,...,,,,,,,,,,
8,fraud-detection-11-04-46-52EtgHM-017-234e8ee6-...,fraud-detection-11-04-46-52EtgHM-017-234e8ee6-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,664544806723.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:binary_f_beta,binary_f_beta,3e-06,...,,,,,,,,,,
9,fraud-detection-11-04-46-52EtgHM-019-09a41b6d-...,fraud-detection-11-04-46-52EtgHM-019-09a41b6d-...,arn:aws:sagemaker:eu-central-1:705561729853:tr...,492215442770.dkr.ecr.eu-central-1.amazonaws.co...,1.0,ml.m5.12xlarge,50.0,validation:f1_binary,"accuracy,f1_binary,auc,balanced_accuracy,preci...",,...,,,,,,,,,,


## Listing all candidates explored by Amazon SageMaker AutoPilot
Once the 'ModelTuning' step is complete, we can list top candidates that were identified by SageMaker AutoPilot, and sort them by their final performance metric.

In [22]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, 
                                                SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
for candidate in candidates:
  print (str(index) + "  " 
         + candidate['CandidateName'] + "  " 
         + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
  index += 1

1  fraud-detection-11-04-46-52EtgHM-011-218f95f5  0.8444399833679199
2  fraud-detection-11-04-46-52EtgHM-015-ef5ac858  0.8351699709892273
3  fraud-detection-11-04-46-52EtgHM-004-0aeeed95  0.6523600220680237
4  fraud-detection-11-04-46-52EtgHM-019-09a41b6d  0.644070029258728
5  fraud-detection-11-04-46-52EtgHM-014-2eb92b33  0.6104400157928467
6  fraud-detection-11-04-46-52EtgHM-005-8200efaa  0.485289990901947
7  fraud-detection-11-04-46-52EtgHM-008-c5fbd4ef  0.30254998803138733
8  fraud-detection-11-04-46-52EtgHM-009-f9c2449d  0.3001900017261505
9  fraud-detection-11-04-46-52EtgHM-007-526ae705  0.20699000358581543
10  fraud-detection-11-04-46-52EtgHM-006-59b99a28  0.20051999390125275


In [23]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']

print("Candidate name: " + best_candidate_name)

Candidate name: fraud-detection-11-04-46-52EtgHM-011-218f95f5


## Evaluate on Testset by Hosting an Endpoint
Let's now deploy the model as endpoint and then make predictions on the test set to see how well the model performs on the hold-out test set.

In [24]:
timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
model_name = best_candidate_name + timestamp_suffix + "-model"
model_arn = sm.create_model(
    Containers=best_candidate["InferenceContainers"], ModelName=model_name, ExecutionRoleArn=role
)

epc_name = best_candidate_name + timestamp_suffix + "-epc"
ep_config = sm.create_endpoint_config(
    EndpointConfigName=epc_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.m5.2xlarge",
            "InitialInstanceCount": 1,
            "ModelName": model_name,
            "VariantName": "main",
        }
    ],
)

ep_name = best_candidate_name + timestamp_suffix + "-ep"
create_endpoint_response = sm.create_endpoint(EndpointName=ep_name, EndpointConfigName=epc_name)

In [25]:
sm.get_waiter("endpoint_in_service").wait(EndpointName=ep_name)

### Evaluate

In [47]:
tp = tn = fp = fn = count = 0

with open('automl-test.csv') as f:
    lines = f.readlines()
    for l in lines[1:]:   # Skip header
        l = l.split(',')  # Split CSV line into features
        label = l[-1]     # Store 0/1 label
        l = l[:-1]        # Remove label
        l = ','.join(l)   # Rebuild CSV line without label
                
        response = sm_rt.invoke_endpoint(EndpointName=ep_name, ContentType='text/csv', Accept='text/csv', Body=l)

        response = response['Body'].read().decode("utf-8")
        #print ("label %s response %s" %(label,response))

        if '1' in label:
            # Sample is positive
            if '1' in response:
                # True positive
                tp=tp+1
            else:
                # False negative
                fn=fn+1
        else:
            # Sample is negative
            if '0' in response:
                # True negative
                tn=tn+1
            else:
                # False positive
                fp=fp+1
        count = count+1
        if (count % 100 == 0):   
            sys.stdout.write(str(count)+' ')
            
print ("Done")

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000 6100 6200 6300 6400 6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 7500 7600 7700 7800 7900 8000 8100 8200 8300 8400 8500 8600 8700 8800 8900 9000 9100 9200 9300 9400 9500 9600 9700 9800 9900 10000 10100 10200 10300 10400 10500 10600 10700 10800 10900 11000 11100 11200 11300 11400 11500 11600 11700 11800 11900 12000 12100 12200 12300 12400 12500 12600 12700 12800 12900 13000 13100 13200 13300 13400 13500 13600 13700 13800 13900 14000 14100 14200 14300 14400 14500 14600 14700 14800 14900 15000 15100 15200 15300 15400 15500 15600 15700 15800 15900 16000 16100 16200 16300 16400 16500 16600 16700 16800 16900 17000 17100 17200 17300 17400 17500 17600 17700 17800 17900 18000 18100 18200 18300 18400 1850

In [49]:
 # Confusion matrix
print ("%d %d" % (tn, fp))
print ("%d %d" % (fn, tp))

accuracy  = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall    = tp/(tp+fn)
f1        = (2*precision*recall)/(precision+recall)

print ("Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f" % (accuracy, precision, recall, f1))

56830 7
3 121
Accuracy: 0.9998, Precision: 0.9453, Recall: 0.9758, F1: 0.9603
