# Payment Fraud Detection

The data used in the following notebook was obtained from Kaggle (Dal Pozzolo et al 2015). Link to the dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

According to the dictionary, to protect the identity, the variables in the data set are the consequence of a dimensionality reduction process (PCA). The time variable represent the number of seconds elapsed between the transaction and the first transaction in the dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#As we will deploy our model in Amazon SageMaker, it is necessary to import the packages for that task
import sagemaker
import boto3
from sagemaker import get_execution_role #With get_execution_role we can obtain the role/permissions of the session

In [2]:
client = boto3.client('sagemaker-runtime')

In [3]:
sagemaker_role=sagemaker.get_execution_role()
sagemaker_session=sagemaker.Session()
bucket=sagemaker_session.default_bucket()
print(sagemaker_role)
print(sagemaker_session)
print(bucket)

arn:aws:iam::095482984955:role/service-role/AmazonSageMaker-ExecutionRole-20210117T214625
<sagemaker.session.Session object at 0x7f24565fe358>
sagemaker-us-east-1-095482984955


In [4]:
data_fraud=pd.read_csv('creditcard.csv')

In [5]:
print('Data Points: {}'.format(data_fraud.shape[0]))
print('Variables: {}'.format(data_fraud.shape[1]))

Data Points: 284807
Variables: 31


In [6]:
data_fraud.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [7]:
#To analyze the class imbalance
def percentage_fraudulent(df):
    '''Calculates the proportion of data points that are labeled as 1 (fraudulent).
       Parameter: the dataframe that has a 'Class' column
       Return: The proportion of fraudulent datapoints
    '''
    return df['Class'].mean()

In [8]:
print('Percentage of fraudulent transaction in the dataset: {}%'.format(percentage_fraudulent(data_fraud)*100))

Percentage of fraudulent transaction in the dataset: 0.1727485630620034%


Only 0.17% of all the dataset are fraudulent transaction. The dataset is extremely imbalanced and we should adress the imbalance before estimating a model, otherwise most algorithms will make almost all predictions 0

In [9]:
from sklearn.model_selection import train_test_split
train_X,test_x,train_y,test_y=train_test_split(data_fraud.drop('Class',axis=1),data_fraud['Class'],random_state=1,train_size=0.7)

To estimate a logistic regression with SageMaker, we should use the bult-in LinearLearner algotirhm

In [10]:
from sagemaker import LinearLearner

prefix='fraud_detection'
output_path='s3://{}/{}'.format(bucket, prefix)

linear_learner=LinearLearner(role=sagemaker_role,
                            instance_count=1,
                            instance_type='ml.c4.xlarge',
                            predictor_type='binary_classifier',
                            output_path=output_path,
                            sagemaker_session=sagemaker_session)


In [11]:
data_record=linear_learner.record_set(train_X.to_numpy('float32'),labels=train_y.to_numpy('float32'))

In [12]:
data_record

(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://sagemaker-us-east-1-095482984955/sagemaker-record-sets/LinearLearner-2021-02-23-02-27-54-742/.amazon.manifest', 'feature_dim': 30, 'num_records': 199364, 's3_data_type': 'ManifestFile', 'channel': 'train'})

In [13]:
%%time
linear_learner.fit(data_record)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-02-23 02:28:01 Starting - Starting the training job...
2021-02-23 02:28:25 Starting - Launching requested ML instancesProfilerReport-1614047281: InProgress
.........
2021-02-23 02:29:46 Starting - Preparing the instances for training.........
2021-02-23 02:31:29 Downloading - Downloading input data
2021-02-23 02:31:29 Training - Downloading the training image...
2021-02-23 02:31:49 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[02/23/2021 02:31:56 INFO 140529613989696] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_

Once the model is trained, we shoul deploy it so we can make prediction using test data

In [14]:
%%time
predictor=linear_learner.deploy(initial_instance_count=1,
                     instance_type='ml.t2.medium')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


-----------------------!CPU times: user 401 ms, sys: 13.9 ms, total: 415 ms
Wall time: 11min 35s


In [15]:
test_record=test_x.to_numpy('float32')
print(predictor.predict(test_record[0]))

[label {
  key: "predicted_label"
  value {
    float32_tensor {
      values: 0.0
    }
  }
}
label {
  key: "score"
  value {
    float32_tensor {
      values: 0.0002751190622802824
    }
  }
}
]


In [16]:
list_predictions=[predictor.predict(batch) for batch in np.array_split(test_record,100)]

In [17]:
np.concatenate([[i.label['predicted_label'].float32_tensor.values[0] for i in chunks] for chunks in list_predictions])

array([0., 0., 0., ..., 0., 0., 0.])

In [18]:
list_predictions[0][1]

label {
  key: "predicted_label"
  value {
    float32_tensor {
      values: 0.0
    }
  }
}
label {
  key: "score"
  value {
    float32_tensor {
      values: 0.00016928077093325555
    }
  }
}

In [19]:
predictor.predict(test_record[0:2])[0].label['predicted_label'].float32_tensor.values[0]

0.0

In [20]:
def evaluation_metrics(test_input,test_label,predictor):
    list_predictions=[predictor.predict(batch) for batch in np.array_split(test_input,100)]
    predictions_test=np.concatenate([[i.label['predicted_label'].float32_tensor.values[0] for i in chunks] for chunks in list_predictions])
    
    tp = np.logical_and(test_label, predictions_test).sum()
    fp = np.logical_and(1-test_label, predictions_test).sum()
    tn = np.logical_and(1-test_label, 1-predictions_test).sum()
    fn = np.logical_and(test_label, 1-predictions_test).sum()
    
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
           
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy}

In [21]:
evaluation_metrics(test_record,test_y,predictor)

{'TP': 99,
 'FP': 26,
 'FN': 36,
 'TN': 85282,
 'Precision': 0.792,
 'Recall': 0.7333333333333333,
 'Accuracy': 0.9992743700478681}

In [22]:
boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


{'ResponseMetadata': {'RequestId': 'd2566912-518d-4166-bf23-45fb1153a09a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd2566912-518d-4166-bf23-45fb1153a09a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Tue, 23 Feb 2021 02:46:17 GMT'},
  'RetryAttempts': 0}}

# As the task is to predict fraudulent transactions, it is most important the recall than accuracy as we would like to avoi false negatives (transactions labeled as legitimate when they were actually fraudulent)

Sagemaker has a parameter for the Linear Learner algorithms that allow to indicate the metric to consider the best performing model. According to the documentation, when predictor_type is set to binary_classifier, it is the model evaluation criteria for the validation set (or training set in case validation set is not provided). 

In [28]:
linear_recall = LinearLearner(role=sagemaker_role,
                              train_instance_count=1, 
                              train_instance_type='ml.c4.xlarge',
                              predictor_type='binary_classifier',
                              output_path=output_path,
                              sagemaker_session=sagemaker_session,
                              epochs=15,
                              binary_classifier_model_selection_criteria='precision_at_target_recall',
                              target_recall=0.9)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [29]:
linear_recall.fit(data_record)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-02-23 03:06:40 Starting - Starting the training job...
2021-02-23 03:07:03 Starting - Launching requested ML instancesProfilerReport-1614049599: InProgress
......
2021-02-23 03:08:03 Starting - Preparing the instances for training.........
2021-02-23 03:09:30 Downloading - Downloading input data...
2021-02-23 03:10:05 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[02/23/2021 03:10:09 INFO 139715668854592] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', 

In [30]:
predictor_recall=linear_recall.deploy(instance_type='ml.t2.medium',initial_instance_count=1)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


--------------------!

In [31]:
evaluation_metrics(test_record,test_y,predictor_recall)

{'TP': 115,
 'FP': 991,
 'FN': 20,
 'TN': 84317,
 'Precision': 0.10397830018083183,
 'Recall': 0.8518518518518519,
 'Accuracy': 0.988167550296689}

In [34]:
boto3.client('sagemaker').delete_endpoint(EndpointName=predictor_recall.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


{'ResponseMetadata': {'RequestId': '52179924-6753-4fd0-a14d-957a6f290280',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '52179924-6753-4fd0-a14d-957a6f290280',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Tue, 23 Feb 2021 03:27:11 GMT'},
  'RetryAttempts': 0}}

In [36]:
linear_balanced = LinearLearner(role=sagemaker_role,
                                train_instance_count=1, 
                                train_instance_type='ml.c4.xlarge',
                                predictor_type='binary_classifier',
                                output_path=output_path,
                                sagemaker_session=sagemaker_session,
                                epochs=15,
                                binary_classifier_model_selection_criteria='precision_at_target_recall',
                                target_recall=0.9,
                                positive_example_weight_mult='balanced')

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [37]:
linear_balanced.fit(data_record)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-02-23 03:28:27 Starting - Starting the training job...
2021-02-23 03:28:50 Starting - Launching requested ML instancesProfilerReport-1614050906: InProgress
......
2021-02-23 03:29:51 Starting - Preparing the instances for training.........
2021-02-23 03:31:14 Downloading - Downloading input data
2021-02-23 03:31:14 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[02/23/2021 03:31:38 INFO 140417266886464] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'

In [38]:
balanced=linear_balanced.deploy(instance_type='ml.t2.medium',initial_instance_count=1)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


---------------------!

In [39]:
evaluation_metrics(test_record,test_y,balanced)

{'TP': 117,
 'FP': 679,
 'FN': 18,
 'TN': 84629,
 'Precision': 0.14698492462311558,
 'Recall': 0.8666666666666667,
 'Accuracy': 0.9918425148929696}