# Predicting Money Laundering

In this activity, you'll gain hands-on experience on using SageMaker Studio to prepare the data for a machine learning application that will predict whether or not a cash or transfer bank transaction is potential money laundering fraud.

## Instructions

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

### Load the Data into Pandas

In [2]:
# Load the CSV data into a DataFrame
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,typeofaction,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,isfraud
0,cash-in,30105,28942,494528,2019,7,19,4,1
1,cash-in,30105,8692,494528,2019,5,17,4,1
2,cash-in,30105,60094,494528,2019,7,20,5,1
3,cash-in,30105,20575,494528,2019,7,3,2,1
4,cash-in,30105,45938,494528,2019,5,26,6,1


## Preprocess Data

### Encode Categorical Data

Since the `typeofaction` column has categorical data, use the `OneHotEncoder` module from Scikit-learn to transform this column's categories into a numerical representation.

**Hint:** You can recall how to use the `OneHotEncode` module in [this article from the Scikit-learn's User Guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features).

In [3]:
# Create a OneHotEncoder instance
enc = OneHotEncoder(sparse=False)

In [4]:
# Create a list of the columns with categorical variables
categorical_variables = # YOUR CODE GOES HERE!

In [5]:
# Use the fit_transform function from the OneHotEncoder to encode the data
encoded_data = # YOUR CODE GOES HERE!

In [6]:
# Create a DataFrame with the encoded variables
encoded_df = # YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,typeofaction_cash-in,typeofaction_transfer
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


In [7]:
# Drop the 'typeofaction' column from the original DataFrame
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,isfraud
0,30105,28942,494528,2019,7,19,4,1
1,30105,8692,494528,2019,5,17,4,1
2,30105,60094,494528,2019,7,20,5,1
3,30105,20575,494528,2019,7,3,2,1
4,30105,45938,494528,2019,5,26,6,1


Using the encoded data from the `typeofaction` column, we will add to the original DataFrame a new column called `operationtype`. Where `1` will represent a cash-in operation, and `0` will describe a transfer.

In [8]:
# Add the encoded 'typeofaction' data to the original DataFrame
df["operationtype"] = # YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,isfraud,operationtype
0,30105,28942,494528,2019,7,19,4,1,1.0
1,30105,8692,494528,2019,5,17,4,1,1.0
2,30105,60094,494528,2019,7,20,5,1,1.0
3,30105,20575,494528,2019,7,3,2,1,1.0
4,30105,45938,494528,2019,5,26,6,1,1.0


### Create the features and target sets

The features set will be all the columns from the original DataFrame except the `isfraud` column that constitutes the target set.

In [9]:
# Create the features set X
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

Unnamed: 0,sourceid,destinationid,amountofmoney,year,month,day,dayofweek,operationtype
0,30105,28942,494528,2019,7,19,4,1.0
1,30105,8692,494528,2019,5,17,4,1.0
2,30105,60094,494528,2019,7,20,5,1.0
3,30105,20575,494528,2019,7,3,2,1.0
4,30105,45938,494528,2019,5,26,6,1.0


In [10]:
# Create the target set y
# YOUR CODE GOES HERE!

# Display sample data
# YOUR CODE GOES HERE!

0    1
1    1
2    1
3    1
4    1
Name: isfraud, dtype: int64

### Split the features and target sets into training and testing datasets

In [11]:
# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = # YOUR CODE GOES HERE!

### Use the Scikit-Learn’s `StandardScaler` to scale the features data

In [12]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = # YOUR CODE GOES HERE!

# Scale the data
X_train_scaled = # YOUR CODE GOES HERE!
X_test_scaled = # YOUR CODE GOES HERE!