# Predicting Money Laundering

In this activity, you will gain hands-on experience in deploying a machine learning model into the cloud to predict whether or not a cash or transfer bank transaction is a potential money laundering fraud.

## Instructions

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

### Load the Data into Pandas

In [2]:
# Load the CSV data into a DataFrame
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,typeofaction,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,isfraud
0,cash-in,30105,28942,494528,2019,7,19,4,1
1,cash-in,30105,8692,494528,2019,5,17,4,1
2,cash-in,30105,60094,494528,2019,7,20,5,1
3,cash-in,30105,20575,494528,2019,7,3,2,1
4,cash-in,30105,45938,494528,2019,5,26,6,1


## Preprocess Data

### Encode Categorical Data

Since the `typeofaction` column has categorical data, use the `OneHotEncoder` module from Scikit-learn to transform this column's categories into a numerical representation.

> **Hint:** You can recall how to use the `OneHotEncode` module in [this article from the Scikit-learn's User Guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features). You can also review Lesson 3 in Module 13 for additional information.

In [3]:
# Create a OneHotEncoder instance
enc = OneHotEncoder(sparse=False)

In [4]:
# Create a list of the columns with categorical variables
categorical_variables = # YOUR CODE GOES HERE!

In [5]:
# Use the fit_transform function from the OneHotEncoder to encode the data
encoded_data = # YOUR CODE GOES HERE!

In [6]:
# Create a DataFrame with the encoded variables
encoded_df = pd.DataFrame(
    encoded_data,
    columns = enc.get_feature_names(categorical_variables)
)

# Display sample data
encoded_df.head()

Unnamed: 0,typeofaction_cash-in,typeofaction_transfer
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


In [7]:
# Drop the 'typeofaction' column from the original DataFrame
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,isfraud
0,30105,28942,494528,2019,7,19,4,1
1,30105,8692,494528,2019,5,17,4,1
2,30105,60094,494528,2019,7,20,5,1
3,30105,20575,494528,2019,7,3,2,1
4,30105,45938,494528,2019,5,26,6,1


Using the encoded data from the `typeofaction` column, we will add to the original DataFrame a new column called `operationtype`. Where `1` will represent a cash-in operation, and `0` will describe a transfer.

In [8]:
# Add the encoded 'typeofaction' data to the original DataFrame
df["operationtype"] = encoded_df["typeofaction_cash-in"]

# Display sample data
df.head()

Unnamed: 0,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,isfraud,operationtype
0,30105,28942,494528,2019,7,19,4,1,1.0
1,30105,8692,494528,2019,5,17,4,1,1.0
2,30105,60094,494528,2019,7,20,5,1,1.0
3,30105,20575,494528,2019,7,3,2,1,1.0
4,30105,45938,494528,2019,5,26,6,1,1.0


### Create the features and target sets

The features set will be all the columns from the original DataFrame except the `isfraud` column that constitutes the target set.

In [9]:
# Creating the features set X
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,operationtype
0,30105,28942,494528,2019,7,19,4,1.0
1,30105,8692,494528,2019,5,17,4,1.0
2,30105,60094,494528,2019,7,20,5,1.0
3,30105,20575,494528,2019,7,3,2,1.0
4,30105,45938,494528,2019,5,26,6,1.0


In [10]:
# Creating the target set y
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

0    1
1    1
2    1
3    1
4    1
Name: isfraud, dtype: int64

### Split the features and target sets into training and testing datasets

In [11]:
# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = # YOUR CODE GOES HERE!

### Use the Scikit-Learn’s `StandardScaler` to scale the features data

In [12]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Create a Machine Learning Model in SageMaker Studio

### Importing the required libraries

In [13]:
# Import Amazon SageMaker libraries and modules
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer, json_deserializer

# Import AWS Python SDK
import boto3

# Import support libraries
import io
import os
import json
import numpy as np

### Configure general settings for the SageMaker model

In [14]:
# Set the S3 bucket name
bucket = # YOUR CODE GOES HERE!

In [15]:
# Set a prefix for the data files
prefix = "money-laundering"

In [16]:
# Set the IAM execution role
role = get_execution_role()

### Upload the training and testing data to Amazon S3

#### Encode and upload the training data

In [17]:
# Encode the training data as Protocol Buffer
buf = io.BytesIO()
vectors = np.array(X_train).astype("float32")
labels = np.array(y_train).astype("float32")
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)

# Upload encoded training data to Amazon S3
key = 'linear_train.data'
boto3.resource("s3").Bucket(bucket).Object(os.path.join(prefix, "train", key)).upload_fileobj(buf)
s3_train_data = "s3://{}/{}/train/{}".format(bucket, prefix, key)
print("Training data uploaded to: {}".format(s3_train_data))

Training data uploaded to: s3://fintech-bootcamp-activities-jams-2021-02-11/money-laundering/train/linear_train.data


#### Encode and upload the testing data

In [18]:
# Encode the testing data as Protocol Buffer
buf = io.BytesIO()
vectors = np.array(X_test).astype("float32")
labels = np.array(y_test).astype("float32")
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)

# Upload encoded testing data to Amazon S3
key = "linear_test.data"
boto3.resource("s3").Bucket(bucket).Object(os.path.join(prefix, "test", key)).upload_fileobj(buf)
s3_test_data = "s3://{}/{}/test/{}".format(bucket, prefix, key)
print("Testing data uploaded to: {}".format(s3_test_data))

Testing data uploaded to: s3://fintech-bootcamp-activities-jams-2021-02-11/money-laundering/test/linear_test.data


### Specify the Amazon SageMaker session to use

In [19]:
# Save the current session in a variable
sess = sagemaker.Session()

### Create an instance of the machine learning model

In [20]:
# Import the get_image_uri module from the sagamaker library
from sagemaker.amazon.amazon_estimator import get_image_uri

In [21]:
# Import the container image
container = get_image_uri(boto3.Session().region_name, "linear-learner")

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


In [22]:
# Create an instance of the machine learning model
linear = sagemaker.estimator.Estimator(
    container,
    role,
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


### Define Linear Learner hyperparameters

In [23]:
# Get the dimension of the feature-input vector
feature_dim = X.shape[1]

In [24]:
# Define linear learner hyperparameters
linear.set_hyperparameters(
    feature_dim=feature_dim,
    mini_batch_size=200,
    predictor_type="binary_classifier"
)

## Fit the Machine Learning Model in SageMaker Studio

Use the fit function of the model to train it using the train and testing data stored in the Amazon S3 bucket.

In [25]:
# Fitting the linear learner model
linear.fit({"train": s3_train_data, "test": s3_test_data})

2021-03-09 20:38:00 Starting - Starting the training job...
2021-03-09 20:38:24 Starting - Launching requested ML instancesProfilerReport-1615322280: InProgress
.........
2021-03-09 20:39:45 Starting - Preparing the instances for training.........
2021-03-09 20:41:26 Downloading - Downloading input data
2021-03-09 20:41:26 Training - Downloading the training image...
2021-03-09 20:42:01 Uploading - Uploading generated training model
2021-03-09 20:42:01 Completed - Training job completed
[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[03/09/2021 20:41:47 INFO 140001872426816] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'u

## Make Predictions With the Model in SageMaker Studio

### Deploy the model

In [26]:
# Deploy an instance of the linear-learner model to create a predictor
linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.t2.medium")

-----------------!

### Setting configurations for the predictor

In [27]:
# Linear predictor configurations
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

### Make Predictions Using Testing Data

#### Use the `predict` function of the predictor to make predictions using the testing data stored in Pandas

In [28]:
# Making some predictions using the test data
model_predictions = # YOUR CODE GOES HERE!

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The json_deserializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [29]:
# Display sample predictions
model_predictions["predictions"][:3]

[{'score': 1.0, 'predicted_label': 1},
 {'score': 1.0, 'predicted_label': 1},
 {'score': 1.0, 'predicted_label': 1}]

#### Creating a list of the predicted values

In [30]:
# Create a list with the predicted values
y_predictions = [np.uint8(value["predicted_label"]) for value in model_predictions["predictions"]]

# Transforming the list into an array
y_predictions = np.array(y_predictions)

# Display sample data
y_predictions[:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=uint8)

## Evaluate the Machine Model

Use the `classification_report` module from Scikit-learn to assess the performance of the model to predict fraudulent credit card transactions.

In [31]:
# Import the classification report from Scikit-learn
from sklearn.metrics import classification_report

In [32]:
# Display classification report
# YOUR CODE GOES HERE!

Classification report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       225
           1       0.62      1.00      0.76       360

    accuracy                           0.62       585
   macro avg       0.31      0.50      0.38       585
weighted avg       0.38      0.62      0.47       585



  _warn_prf(average, modifier, msg_start, len(result))


## Delete the End-Point to Avoid Additional AWS Resources Usage and Billing

Make sure that you delete all the Amazon SageMaker endpoints to prevent unwanted charges.

In [34]:
# Delete Amazon SageMaker end-point
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
