
# Credit card fraud detector

In this solution we will build the core of a credit card fraud detection system using SageMaker. We will start by training an anomaly detection algorithm, then proceed to train two XGBoost models for supervised training. To deal with the highly unbalanced data common in fraud detection, our first model will use re-weighting of the data, and the second will use re-sampling, using the popular SMOTE technique for oversampling the rare fraud data.

Our solution includes an example of making calls to a REST API to simulate a real deployment, using AWS Lambda to trigger both the anomaly detection and XGBoost model.

You can select Run->Run All from the menu to run all cells in Studio (or Cell->Run All in a SageMaker Notebook Instance).

Note: When running this notebook on SageMaker Studio, you should make sure the 'SageMaker JumpStart Data Science 1.0' image/kernel is used.

In [10]:
import sys

sys.path.insert(0, './src/')

In [40]:
import boto3
from zipfile import ZipFile
import yaml
import os

region_name = boto3.Session().region_name
print(f"Region : {region_name} ")

s3 = boto3.resource('s3', region_name=region_name)

# Define the path to your YAML file
file_path = 'credit-fraud-detection-aws/config.yml'

try:
    with open(file_path, 'r') as file:
        params = yaml.safe_load(file)

except FileNotFoundError:
    print(f"Error: The file {file_path} was not found.")
except Exception as e:
    print(f"An error occurred: {e}")


Region : us-east-1 


In [22]:
bucket_name = params['bucket_name']
object_key = params['object_key']

# Create S3 object
obj = s3.Object(bucket_name, object_key)

# Download the file to local path
if 'dataset' not in os.listdir('credit-fraud-detection-aws'):
    os.mkdir('credit-fraud-detection-aws/dataset')

if object_key not in os.listdir('credit-fraud-detection-aws/dataset'):
    obj.download_file('credit-fraud-detection-aws/dataset/creditcard.csv.zip')

In [27]:
with ZipFile('credit-fraud-detection-aws/dataset/creditcard.csv.zip', 'r') as zf:
    zf.extractall('credit-fraud-detection-aws/dataset')

In [28]:
import numpy as np 
import pandas as pd

data = pd.read_csv('creditcard.csv', delimiter=',')

In [29]:
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [30]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data:', 100.*frauds/(frauds + nonfrauds))

Number of frauds:  492
Number of non-frauds:  284315
Percentage of fradulent data: 0.1727485630620034


In [31]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

In [32]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, random_state=42)

# Training

In [33]:
import os
import sagemaker

session = sagemaker.Session()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [34]:
import io
import sklearn
from sklearn.datasets import dump_svmlight_file   

buf = io.BytesIO()

sklearn.datasets.dump_svmlight_file(X_train, y_train, buf)


In [35]:
key = 'fraud-dataset'
prefix = 'fraud-classifier'
subdir = 'base'
buf.seek(0)
boto3.resource('s3', region_name=region_name).Bucket(bucket_name).Object(os.path.join(prefix, 'train', subdir, key)).upload_fileobj(buf)

s3_train_data = 's3://{}/{}/train/{}/{}'.format(bucket_name, prefix, subdir, key)
print('Uploaded training data location: {}'.format(s3_train_data))

output_location = 's3://{}/{}/output'.format(bucket_name, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

Uploaded training data location: s3://sandboxstav/fraud-classifier/train/base/fraud-dataset
Training artifacts will be uploaded to: s3://sandboxstav/fraud-classifier/output


In [36]:
buf.close()

In [37]:
xgboost_image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=boto3.Session().region_name,
    version="0.90-2",
    py_version="py3",
)

In [38]:
from math import sqrt
import numpy as np

hyperparams = {
    "max_depth":5,
    "subsample":0.8,
    "num_round":100,
    "eta":0.2,
    "gamma":4,
    "min_child_weight":6,
    "silent":0,
    "objective":'binary:logistic',
    "eval_metric":'auc'
}

In [41]:
import boto3

iam = boto3.client('iam')
response = iam.list_roles()

for role in response['Roles']:
    if params['role'] in role['RoleName']:
        role_name = role['RoleName']


In [42]:
from sagemaker.estimator import Estimator
output_location = 's3://sandboxstav/fraud-classifier/output'
s3_train_data = 's3://sandboxstav/fraud-classifier/train/base/fraud-dataset'
clf = Estimator(
    image_uri=xgboost_image_uri,
    role=role_name,
    instance_count=1,
    instance_type='ml.m5.large',
    use_spot_instances=False,
    hyperparameters=hyperparams,
    output_path=output_location,
    sagemaker_session=session,
    base_job_name="xgbv2"
)
clf.fit({'train': s3_train_data})

INFO:sagemaker:Creating training-job with name: xgbv2-2025-11-20-15-58-00-259


2025-11-20 15:58:00 Starting - Starting the training job...
2025-11-20 15:58:22 Starting - Preparing the instances for training...
2025-11-20 15:58:42 Downloading - Downloading input data...
2025-11-20 15:59:33 Downloading - Downloading the training image......
2025-11-20 16:00:14 Training - Training image download completed. Training in progress..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34m[16:00:19] 256326x30 matrix with 7688153 entries loaded from /opt/ml/input/data/train[0m
[34mINFO:r

In [None]:
from sagemaker.estimator import Estimator
output_location = 's3://sandboxstav/fraud-classifier/output'
s3_train_data = 's3://sandboxstav/fraud-classifier/train/base/fraud-dataset'
clf = Estimator(
    image_uri=xgboost_image_uri,
    role=role_name,
    instance_count=1,
    instance_type='ml.m5.large',
    use_spot_instances=True,
    hyperparameters=hyperparams,
    output_path=output_location,
    sagemaker_session=session,
    base_job_name="xgbv2spot",
    volume_size=5,  # 5 GB
    max_run=3600,
    max_wait=3600*2,
)
clf.fit({'train': s3_train_data})

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: xgbv2spot-2025-11-20-16-02-18-648


2025-11-20 16:02:18 Starting - Starting the training job...
2025-11-20 16:02:50 Starting - Preparing the instances for training...
2025-11-20 16:03:08 Downloading - Downloading input data.

## Deployment


In [17]:
clf.deploy(initial_instance_count=1, instance_type='ml.t2.medium', endpoint_name = 'final-xgb')

INFO:sagemaker:Creating model with name: xgbv2spot-2025-11-16-13-36-13-960
INFO:sagemaker:Creating endpoint-config with name final-xgb
INFO:sagemaker:Creating endpoint with name final-xgb


----------!

<sagemaker.base_predictor.Predictor at 0x7efd576c2930>

In [120]:
import boto3

runtime = boto3.client("sagemaker-runtime")

# Example feature vector (replace with your actual features)
payload = "0.0,1.2,3.4,5.6,7.8,9.0,2.1,4.3,6.5,8.7,0.0,1.2,3.4,5.6,7.8,9.0,2.1,4.3,6.5,8.7,0.0,1.2,3.4,5.6,7.8,9.0,2.1,4.3,6.5,8.7"

response = runtime.invoke_endpoint(
    EndpointName="final-xgb",   # replace with your endpoint name
    ContentType="text/csv",     # ✅ use CSV instead of JSON
    Body=payload
)

print(response['Body'].read().decode())


0.0007510025170631707
