# AWS Credit card fraud detection 

In this solution we will build the core of a credit card fraud detection system using SageMaker. We will start by training an anomaly detection algorithm, then proceed to train two XGBoost models for supervised training. To deal with the highly unbalanced data common in fraud detection, our first model will use re-weighting of the data, and the second will use re-sampling, using the popular SMOTE technique for oversampling the rare fraud data.

Our solution includes an example of making calls to a REST API to simulate a real deployment, using AWS Lambda to trigger both the anomaly detection and XGBoost model.

You can select Run->Run All from the menu to run all cells in Studio (or Cell->Run All in a SageMaker Notebook Instance).

**Note**: When running this notebook on SageMaker Studio, you should make sure the 'SageMaker JumpStart Data Science 1.0' image/kernel is used.

In [2]:
import os
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
import xgboost as xgb
import boto3
import joblib
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
import sys
sys.path.insert(0, '.')

### Set up environment

Let's set up environment

In [4]:
# Configuration des variables d'environnement
aws_region = os.environ.get('AWS_REGION')
aws_access_key = os.getenv("AWS_ID_ACCESS_KEY")
aws_secret_key = os.getenv("AWS_SECRET_ACCESS_KEY")
s3_bucket = os.getenv("SOLUTIONS_S3_BUCKET")
s3_prefix = os.getenv("SOLUTION_NAME")

print(f"aws_region: {aws_region}")
print(f"aws_access_key: {aws_access_key}")
print(f"aws_secret_key: {aws_secret_key}")
print(f"s3_bucket: {s3_bucket}")
print(f"s3_prefix: {s3_prefix}")

aws_region: eu-west-1
aws_access_key: AKIAYPN45FK4P7QDIR5R
aws_secret_key: 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
s3_bucket: credit-card-s3
s3_prefix: fraud-detection


In [5]:
DATASET_PATH = 'dataset'
os.makedirs(DATASET_PATH, exist_ok=True)
# os.makedirs(CHECKPOINTS_PATH, exist_ok=True)

In [6]:
# Initialisation du client S3
s3_client = boto3.client(
    's3',
    aws_access_key_id=aws_access_key,
    aws_secret_access_key=aws_secret_key,
    region_name=aws_region
)

In [7]:
# Download file from S3
s3_key = f"{s3_prefix}/dataset/creditcard.csv.zip"
local_zip_path = f"{DATASET_PATH}/creditcard.csv.zip"

print("Téléchargement en cours...")
s3_client.download_file(s3_bucket, s3_key, local_zip_path)
print(f"Téléchargement terminé : {local_zip_path}")

Téléchargement en cours...
Téléchargement terminé : dataset/creditcard.csv.zip


In [8]:
# Unzip file to DATASET_PATH
print("Décompression...")
with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
    zip_ref.extractall(DATASET_PATH)
print(f"Fichiers extraits dans le dossier '{DATASET_PATH}'.")

Décompression...
Fichiers extraits dans le dossier 'dataset'.


In [9]:
# (Optionnal) Remove zip file
os.remove(local_zip_path)

## Investigate and process the data

Let's start by reading in the credit card fraud data set.

In [10]:
data = pd.read_csv(f"{DATASET_PATH}/creditcard.csv", delimiter=',')
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Let's take a peek at our data (we only show a subset of the columns in the table):

In [11]:
print(data.columns)
data[['Time', 'V1', 'V2', 'V27', 'V28', 'Amount', 'Class']].describe()

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


Unnamed: 0,Time,V1,V2,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,31.6122,33.84781,25691.16,1.0


The dataset contains
only numerical features, because the original features have been transformed using PCA, to protect user privacy. As a result,
the dataset contains 28 PCA components, V1-V28, and two features that haven't been transformed, _Amount_ and _Time_.
_Amount_ refers to the transaction amount, and _Time_ is the seconds elapsed between any transaction in the data
and the first transaction.

The class column corresponds to whether or not a transaction is fraudulent. We see that the majority of data is non-fraudulent with only $492$ ($0.173\%$) of the data corresponding to fraudulent examples, out of the total of 284,807 examples in the data.

In [12]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data:', 100.*frauds/(frauds + nonfrauds))

Number of frauds:  492
Number of non-frauds:  284315
Percentage of fradulent data: 0.1727485630620034


We already know that the columns $V_i$ have been normalized to have $0$ mean and unit standard deviation as the result of a PCA.

In [13]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

Next, we will prepare our data for loading and training.

## Training

We will split our dataset into a train and test to evaluate the performance of our models. It's important to do so _before_ any techniques meant to alleviate the class imbalance are used. This ensures that we don't leak information from the test set into the train set.

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, random_state=42
)

> Note: If you are bringing your own data to this solution and they include categorical data, that have strings as values, you'd need to one-hot encode these values first using for example sklearn's [OneHotEncoder](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features), as XGBoost only supports numerical data.

## Unsupervised Learning

In [17]:
import os
import sagemaker


# sagemaker_iam_role = os.getenv("SAGEMAKER_IAM_ROLE")
sagemaker_session = sagemaker.Session()
sagemaker_iam_role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()

data_location = 's3://{}/{}/'.format(default_bucket, s3_prefix)
base_job_name = "{}-rcf".format(s3_prefix)
output_path = 's3://{}/{}/output'.format(default_bucket, s3_prefix)


print(sagemaker_iam_role)
print(default_bucket)
print('Training artifacts will be uploaded to: {}'.format(output_path))
data_location

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
arn:aws:iam::582901115576:role/service-role/AmazonSageMaker-ExecutionRole-20250615T134459
sagemaker-eu-west-1-582901115576
Training artifacts will be uploaded to: s3://sagemaker-eu-west-1-582901115576/fraud-detection/output


's3://sagemaker-eu-west-1-582901115576/fraud-detection/'

In a fraud detection scenario, commonly we will have very few labeled examples, and it's possible that labeling fraud takes a very long time. We would like then to extract information from the unlabeled data we have at hand as well. _Anomaly detection_ is a form of unsupervised learning where we try to identify anomalous examples based solely on their feature characteristics. Random Cut Forest is a state-of-the-art anomaly detection algorithm that is both accurate and scalable. We will train such a model on our training data and evaluate its performance on our test set.

In [16]:
from sagemaker import RandomCutForest

# specify general training job information
rcf = RandomCutForest(
    sagemaker_session=sagemaker_session,
    role=sagemaker_iam_role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    data_location=data_location,
    output_path=output_path,
    base_job_name=base_job_name,
    num_samples_per_tree=512,
    num_trees=50
)

Now we are ready to fit the model. The below cell should take around 5 minutes to complete.

In [None]:
rcf.fit(rcf.record_set(X_train))

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: fraud-detection-rcf-2025-06-17-01-03-46-997


2025-06-17 01:03:49 Starting - Starting the training job.

### Host Random Cut Forest

Once we have a trained model we can deploy it and get some predictions for our test set. SageMaker will spin up an instance for us and deploy the model, the whole process should take around 10 minutes, you will see progress being made with each `-` and an exclamation point when the process is finished.

In [None]:
rcf_predictor = rcf.deploy(
    model_name="{}-rcf".format(s3_prefix),
    endpoint_name="{}-rcf-endpoint".format(s3_prefix),
    initial_instance_count=1,
    instance_type='ml.m4.xlarge'
)

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

rcf_predictor.content_type = 'text/csv'
rcf_predictor.serializer = CSVSerializer()
rcf_predictor.accept = 'application/json'
rcf_predictor.deserializer = JSONDeserializer()

### Test Random Cut Forest

With the model deployed, let's see how it performs in terms of separating fraudulent from legitimate transactions.

In [None]:
def predict_rcf(current_predictor, data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = []
    for array in split_array:
        array_preds = [s['score'] for s in current_predictor.predict(array)['scores']]
        predictions.append(array_preds)

    return np.concatenate([np.array(batch) for batch in predictions])

In [None]:
positives = X_test[y_test == 1]
positives_scores = predict_rcf(rcf_predictor, positives)

negatives = X_test[y_test == 0]
negatives_scores = predict_rcf(rcf_predictor, negatives)

In [None]:
positives_scores

In [None]:
negatives_scores

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(positives_scores, label='fraud', bins=20, ax=ax)
sns.histplot(negatives_scores, label='not-fraud', bins=20, ax=ax)
ax.legend()

The unsupervised model already can achieve some separation between the classes, with higher anomaly scores being correlated to fraud.

## Clean up

We will leave the unsupervised and base XGBoost endpoints running at the end of this notebook so we can handle incoming event streams using the Lambda function. The solution will automatically clean up the endpoints when deleted, however, don't forget to ensure the prediction endpoints are deleted when you're done. You can do that at the Amazon SageMaker console in the Endpoints page. Or you can run `predictor_name.delete_endpoint()` here.

In [None]:
# Uncomment to clean up endpoints
rcf_predictor.delete_model()
rcf_predictor.delete_endpoint()
sm_client = boto3.client('sagemaker', region_name=aws_region)
waiter = sm_client.get_waiter('endpoint_deleted')
waiter.wait(EndpointName="{}-rcf-endpoint".format(s3_prefix))



## Data Acknowledgements

The dataset used to demonstrated the fraud detection solution has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the [DefeatFraud](https://mlg.ulb.ac.be/wordpress/portfolio_page/defeatfraud-assessment-and-validation-of-deep-feature-engineering-and-learning-solutions-for-fraud-detection/) project
We cite the following works:
* Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
* Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
* Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
* Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
* Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
* Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing