<a href="https://colab.research.google.com/github/henryjhu/Anomaly-Detection-in-Wire-Activities/blob/main/DSC_680_Supervised_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DSC-680-Z1 Research Practicum** <BR> Machine Learning

## **Project Description**
A global bank sought to find new and innovative means for detecting and preventing fraud in their wire transactions. Their goal is through machine learning to arrive at scenario detection rules which could be customized with parameters specific for each cohort of customers. The immediate challenge was that the data provided had no labels. Therefore, unsupervised machine learning techniques were first employed for this research project. After the labels have been identified, supervised machine learning techniques were then employed with selected tuning parameters to increase both the sensitivity and specificity percentages of both classes.

<b>Purpose:</b><br>
Carry out supervised machine learnings with the sample data.<br>
<b>Universtiy Name:</b> Utica College <br>
<b>Course Name:</b> DSC-680-Z1 Research Practicum <br>
<b>Student Name:</b> Henry J. Hu <br>
<b>Program Director Name:</b> Dr. McCarthy, Michael <br>
<b>Runtime Environment:</b> Google Colab<br>
<b>Programming Language:</b> Python <br>
<b>Sample Data Frame:</b>
Six random samples of labeled international wires belonging to 139 customers from 3 continents for the entire year of 2020.<br>
<b> Last Update:</b> August 7th, 2021

## **Installing Packages Into Google Colab**

In [1]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.8.0-py3-none-any.whl (206 kB)
[K     |████████████████████████████████| 206 kB 8.3 MB/s 
[?25hCollecting scikit-learn>=0.24
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 1.4 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn, imbalanced-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.4.3
    Uninstalling imbalanced-learn-0.4.3:
      Successfully uninstalled imbalanced-learn-0.4.3
Successfully installed imbalanced-learn-0.8.0 scikit-learn-0.24.2 threadpoolctl-2.2.0


## **Mounting Google Drive**

In [2]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


## **Importing Libraries**

In [3]:
# Importing libraries
import io
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import seaborn as sns
import sklearn
import traceback
import time
import pytz
import xgboost as xgb
from datetime import datetime
from sklearn import metrics
from scipy.stats import rankdata
from numpy import quantile, where, random
from random import sample
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import model_selection, preprocessing
from sklearn.datasets import make_blobs
from sklearn.metrics import classification_report, accuracy_score
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import cross_validate 
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, 
                              ExtraTreesClassifier, IsolationForest, VotingClassifier, StackingClassifier)
from imblearn.over_sampling import SMOTE # https://github.com/vsmolyakov/experiments_with_python/blob/master/chp01/imbalanced_data.ipynb

## **Importing Data into Google Colab**

In [4]:
# Importing data
NN_103_output_df = pd.read_csv("gdrive/MyDrive/DSC-380/NN_103_score_df.cvs")
NN_202_output_df = pd.read_csv("gdrive/MyDrive/DSC-380/NN_202_score_df.cvs")
EU_103_output_df = pd.read_csv("gdrive/MyDrive/DSC-380/EU_103_score_df.cvs")
EU_202_output_df = pd.read_csv("gdrive/MyDrive/DSC-380/EU_202_score_df.cvs")
AS_103_output_df = pd.read_csv("gdrive/MyDrive/DSC-380/AS_103_score_df.cvs")
AS_202_output_df = pd.read_csv("gdrive/MyDrive/DSC-380/AS_202_score_df.cvs")

## **Re-Sampling**

In [5]:
NN_103_output_df = NN_103_output_df.sample(n=100000, random_state=42)
NN_202_output_df = NN_202_output_df.sample(n=100000, random_state=42)
EU_103_output_df = EU_103_output_df.sample(n=100000, random_state=42)
EU_202_output_df = EU_202_output_df.sample(n=100000, random_state=42)
AS_103_output_df = AS_103_output_df.sample(n=100000, random_state=42)
AS_202_output_df = AS_202_output_df.sample(n=100000, random_state=42)

## **Preparation of Training and Test Data Sets**

In [6]:
NN_103_output_df_x_train, NN_103_output_df_x_test, NN_103_output_df_y_train, NN_103_output_df_y_test = train_test_split(NN_103_output_df[['TRXN_MONTH','TRANSACTION_AMOUNT']],NN_103_output_df[['FRAUD_LABEL']],train_size=0.7, test_size=0.3, shuffle=True, random_state=42)

In [7]:
NN_202_output_df_x_train, NN_202_output_df_x_test, NN_202_output_df_y_train, NN_202_output_df_y_test = train_test_split(NN_202_output_df[['TRXN_MONTH','TRANSACTION_AMOUNT']],NN_202_output_df[['FRAUD_LABEL']],train_size=0.7, test_size=0.3, shuffle=True, random_state=42)

In [8]:
EU_103_output_df_x_train, EU_103_output_df_x_test, EU_103_output_df_y_train, EU_103_output_df_y_test = train_test_split(EU_103_output_df[['TRXN_MONTH','TRANSACTION_AMOUNT']],EU_103_output_df[['FRAUD_LABEL']],train_size=0.7, test_size=0.3, shuffle=True, random_state=42)

In [9]:
EU_202_output_df_x_train, EU_202_output_df_x_test, EU_202_output_df_y_train, EU_202_output_df_y_test = train_test_split(EU_202_output_df[['TRXN_MONTH','TRANSACTION_AMOUNT']],EU_202_output_df[['FRAUD_LABEL']],train_size=0.7, test_size=0.3, shuffle=True, random_state=42)

In [10]:
AS_103_output_df_x_train, AS_103_output_df_x_test, AS_103_output_df_y_train, AS_103_output_df_y_test = train_test_split(AS_103_output_df[['TRXN_MONTH','TRANSACTION_AMOUNT']],AS_103_output_df[['FRAUD_LABEL']],train_size=0.7, test_size=0.3, shuffle=True, random_state=42)

In [11]:
AS_202_output_df_x_train, AS_202_output_df_x_test, AS_202_output_df_y_train, AS_202_output_df_y_test = train_test_split(AS_202_output_df[['TRXN_MONTH','TRANSACTION_AMOUNT']],AS_202_output_df[['FRAUD_LABEL']],train_size=0.7, test_size=0.3, shuffle=True, random_state=42)

## **Over Sampling the Minority Class Using SMOTE**

In [12]:
# Initilize SMOTE
sm = SMOTE(random_state=42, n_jobs=-1)

# Apply SMOTE to North America
NN_103_sm_output_df_x_train, NN_103_sm_output_df_y_train = sm.fit_resample(NN_103_output_df_x_train, NN_103_output_df_y_train)
NN_103_sm_output_df_x_test, NN_103_sm_output_df_y_test = sm.fit_resample(NN_103_output_df_x_test, NN_103_output_df_y_test)
NN_202_sm_output_df_x_train, NN_202_sm_output_df_y_train = sm.fit_resample(NN_202_output_df_x_train, NN_202_output_df_y_train)
NN_202_sm_output_df_x_test, NN_202_sm_output_df_y_test = sm.fit_resample(NN_202_output_df_x_test, NN_202_output_df_y_test)

#Apply SMOTE to Europe
EU_103_sm_output_df_x_train, EU_103_sm_output_df_y_train = sm.fit_resample(EU_103_output_df_x_train, EU_103_output_df_y_train)
EU_103_sm_output_df_x_test, EU_103_sm_output_df_y_test = sm.fit_resample(EU_103_output_df_x_test, EU_103_output_df_y_test)
EU_202_sm_output_df_x_train, EU_202_sm_output_df_y_train = sm.fit_resample(EU_202_output_df_x_train, EU_202_output_df_y_train)
EU_202_sm_output_df_x_test, EU_202_sm_output_df_y_test = sm.fit_resample(EU_202_output_df_x_test, EU_202_output_df_y_test)

#Apply SMOTE to Asia
AS_103_sm_output_df_x_train, AS_103_sm_output_df_y_train = sm.fit_resample(AS_103_output_df_x_train, AS_103_output_df_y_train)
AS_103_sm_output_df_x_test, AS_103_sm_output_df_y_test = sm.fit_resample(AS_103_output_df_x_test, AS_103_output_df_y_test)
AS_202_sm_output_df_x_train, AS_202_sm_output_df_y_train = sm.fit_resample(AS_202_output_df_x_train, AS_202_output_df_y_train)
AS_202_sm_output_df_x_test, AS_202_sm_output_df_y_test = sm.fit_resample(AS_202_output_df_x_test, AS_202_output_df_y_test)

# unique, count = np.unique(NN_103_sm_output_df_y_train, return_counts=True)
# y_train_ct = {k:v for (k,v) in zip(unique, count)}
# y_train_ct
# NN_103_sm_output_df_x_train

## **Cross Validation Function**

In [13]:
##############################################################################################
#
# Purpose: 
# Through cross validation, performs model fitting, training and prediction for 
# Random Forest and Artificial Neuro Networks 
#
##############################################################################################

In [14]:
def cross_val_train (clf_n, X_train, Y_train):

  # Initialize the log variable
  class log:
    def_tz = pytz.timezone('America/New_York')
    def info(text):        
        print(f'{datetime.now(log.def_tz).replace(microsecond=0)} : {text}');

  # Y train has to be a 1D array
  Y_train = np.ravel(Y_train)

  # Scale and center the data around the mean of 0
  scaler = preprocessing.StandardScaler().fit(X_train)
  X_train = scaler.transform(X_train)

  # Random Seed
  random.seed(42)

  clf1=MLPClassifier()
  clf2=RandomForestClassifier()

  # Best tunning hyper parameters have been determined with two random iterations
  # Data frames used: NN_202_output_df_x_train, NN_202_output_df_y_train
  clf1_parameter_space = {
      'warm_start': [True],
      'max_iter': [10000],
      'hidden_layer_sizes': [(4,8),(8,16),(16,32)],
      'activation': ['logistic'],
      'solver': ['lbfgs'], # Have to set learning_rate_init if solver is either ’sgd’ or ‘adam’ 
      'alpha': [0.05,0.1,1.5,2.0,2.5], # Have to set alpha correctly or else all same label values in output 
      'learning_rate': ['adaptive'], # No need to set learning_rate if solver is not ’sgd’
      'random_state': [42]
  }
  # Best tunning hyper parameters have been determined with two random iterations
  # Data frames used: NN_202_output_df_x_train, NN_202_output_df_y_train
  clf2_parameter_space = {
      'bootstrap': [True, False],
      'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
      'max_features': ['auto', 'sqrt'],
      'min_samples_leaf': [1, 2, 4],
      'min_samples_split': [2, 5, 10],
      'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
  }

  if clf_n == 1:
    model = clf1
    model_name = 'Artificial Neuro Networks'
    clf_parameter_space = clf1_parameter_space
  else: 
    model = clf2
    model_name = 'Random Forest'
    clf_parameter_space = clf2_parameter_space

  clf_cross = RandomizedSearchCV(model, clf_parameter_space, n_jobs=-1, cv=10, n_iter=1, verbose=True, refit=True) # Stratified cv fold cross validation
  clf_cross.fit(X_train, Y_train)

  log.info(f'Best parameters used for cross validation of {model_name}')
  log.info(clf_cross.best_params_)

  return clf_cross

## **Stacking - Level 1**

### **Gathering Predictions for Test Data Set X**

### ***Artificial Neuro Networks Predictions***

#### ***North America***

In [15]:
NN_103_sm_output_ann_y_pred_test = cross_val_train(1,NN_103_sm_output_df_x_train, NN_103_sm_output_df_y_train).predict(NN_103_sm_output_df_x_test)
NN_103_sm_output_ann_y_pred_train = cross_val_train(1,NN_103_sm_output_df_x_train, NN_103_sm_output_df_y_train).predict(NN_103_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 17:26:49-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 17:26:49-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 17:28:25-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 17:28:25-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}


In [16]:
NN_202_sm_output_ann_y_pred_test = cross_val_train(1,NN_202_sm_output_df_x_train, NN_202_sm_output_df_y_train).predict(NN_202_sm_output_df_x_test)
NN_202_sm_output_ann_y_pred_train = cross_val_train(1,NN_202_sm_output_df_x_train, NN_202_sm_output_df_y_train).predict(NN_202_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 17:31:57-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 17:31:57-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 17:35:29-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 17:35:29-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}


#### ***Europe***

In [37]:
EU_103_sm_output_ann_y_pred_test = cross_val_train(1, EU_103_sm_output_df_x_train, EU_103_sm_output_df_y_train).predict(EU_103_sm_output_df_x_test)
EU_103_sm_output_ann_y_pred_train = cross_val_train(1, EU_103_sm_output_df_x_train, EU_103_sm_output_df_y_train).predict(EU_103_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 20:37:28-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 20:37:28-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 20:39:11-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 20:39:11-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}


In [18]:
EU_202_sm_output_ann_y_pred_test = cross_val_train(1, EU_202_sm_output_df_x_train, EU_202_sm_output_df_y_train).predict(EU_202_sm_output_df_x_test)
EU_202_sm_output_ann_y_pred_train = cross_val_train(1, EU_202_sm_output_df_x_train, EU_202_sm_output_df_y_train).predict(EU_202_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 17:43:50-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 17:43:50-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 17:48:51-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 17:48:51-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}


#### ***Asia***

In [38]:
AS_103_sm_output_ann_y_pred_test = cross_val_train(1, AS_103_sm_output_df_x_train, AS_103_sm_output_df_y_train).predict(AS_103_sm_output_df_x_test)
AS_103_sm_output_ann_y_pred_train = cross_val_train(1, AS_103_sm_output_df_x_train, AS_103_sm_output_df_y_train).predict(AS_103_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 20:40:54-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 20:40:54-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 20:42:36-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 20:42:36-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}


In [39]:
AS_202_sm_output_ann_y_pred_test = cross_val_train(1, AS_202_sm_output_df_x_train, AS_202_sm_output_df_y_train).predict(AS_202_sm_output_df_x_test)
AS_202_sm_output_ann_y_pred_train = cross_val_train(1, AS_202_sm_output_df_x_train, AS_202_sm_output_df_y_train).predict(AS_202_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 20:47:47-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 20:47:47-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 20:52:55-04:00 : Best parameters used for cross validation of Artificial Neuro Networks
2021-08-11 20:52:55-04:00 : {'warm_start': True, 'solver': 'lbfgs', 'random_state': 42, 'max_iter': 10000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (4, 8), 'alpha': 2.0, 'activation': 'logistic'}


### ***Random Forest Predictions***

#### ***North America***

In [40]:
NN_103_sm_output_rf_y_pred_test = cross_val_train(2,NN_103_sm_output_df_x_train, NN_103_sm_output_df_y_train).predict(NN_103_sm_output_df_x_test)
NN_103_sm_output_rf_y_pred_train = cross_val_train(2,NN_103_sm_output_df_x_train, NN_103_sm_output_df_y_train).predict(NN_103_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 21:05:31-04:00 : Best parameters used for cross validation of Random Forest
2021-08-11 21:05:31-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 21:18:04-04:00 : Best parameters used for cross validation of Random Forest
2021-08-11 21:18:04-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}


In [56]:
NN_202_sm_output_rf_y_pred_test = cross_val_train(2,NN_202_sm_output_df_x_train, NN_202_sm_output_df_y_train).predict(NN_202_sm_output_df_x_test)
NN_202_sm_output_rf_y_pred_train = cross_val_train(2,NN_202_sm_output_df_x_train, NN_202_sm_output_df_y_train).predict(NN_202_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 23:56:42-04:00 : Best parameters used for cross validation of Random Forest
2021-08-11 23:56:42-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-12 00:09:54-04:00 : Best parameters used for cross validation of Random Forest
2021-08-12 00:09:54-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}


#### ***Europe***

In [42]:
EU_103_sm_output_rf_y_pred_test = cross_val_train(2, EU_103_sm_output_df_x_train, EU_103_sm_output_df_y_train).predict(EU_103_sm_output_df_x_test)
EU_103_sm_output_rf_y_pred_train = cross_val_train(2, EU_103_sm_output_df_x_train, EU_103_sm_output_df_y_train).predict(EU_103_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 21:56:21-04:00 : Best parameters used for cross validation of Random Forest
2021-08-11 21:56:21-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 22:07:36-04:00 : Best parameters used for cross validation of Random Forest
2021-08-11 22:07:36-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}


In [57]:
EU_202_sm_output_rf_y_pred_test = cross_val_train(2, EU_202_sm_output_df_x_train, EU_202_sm_output_df_y_train).predict(EU_202_sm_output_df_x_test)
EU_202_sm_output_rf_y_pred_train = cross_val_train(2, EU_202_sm_output_df_x_train, EU_202_sm_output_df_y_train).predict(EU_202_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-12 00:20:31-04:00 : Best parameters used for cross validation of Random Forest
2021-08-12 00:20:31-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-12 00:30:55-04:00 : Best parameters used for cross validation of Random Forest
2021-08-12 00:30:55-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}


#### ***Asia***

In [58]:
AS_103_sm_output_rf_y_pred_test = cross_val_train(2, AS_103_sm_output_df_x_train, AS_103_sm_output_df_y_train).predict(AS_103_sm_output_df_x_test)
AS_103_sm_output_rf_y_pred_train = cross_val_train(2, AS_103_sm_output_df_x_train, AS_103_sm_output_df_y_train).predict(AS_103_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-12 00:42:01-04:00 : Best parameters used for cross validation of Random Forest
2021-08-12 00:42:01-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-12 00:53:02-04:00 : Best parameters used for cross validation of Random Forest
2021-08-12 00:53:02-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}


In [45]:
AS_202_sm_output_rf_y_pred_test = cross_val_train(2, AS_202_sm_output_df_x_train, AS_202_sm_output_df_y_train).predict(AS_202_sm_output_df_x_test)
AS_202_sm_output_rf_y_pred_train = cross_val_train(2, AS_202_sm_output_df_x_train, AS_202_sm_output_df_y_train).predict(AS_202_sm_output_df_x_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 23:02:01-04:00 : Best parameters used for cross validation of Random Forest
2021-08-11 23:02:01-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}
Fitting 10 folds for each of 1 candidates, totalling 10 fits
2021-08-11 23:12:36-04:00 : Best parameters used for cross validation of Random Forest
2021-08-11 23:12:36-04:00 : {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 70, 'bootstrap': False}


## **Stacking - Level 2**

##### **Stacking Algorithm = XGBoost**

##### **Weights of 2x applied to Level 1 Random Forest's predictions before feeding them into Level 2 Decesion Tree as inputs**

##### **Level 2 X Training Data = Combine of Level 1 ANN and Random Forest predicitons from the "train" data**

##### **Level 2 X Test Data = Combine of Level 1 ANN and Random Forest predicitons from the "test" data**

#### ***North America***

In [59]:
NN_103_sm_output_ensemble_x_train = np.column_stack((NN_103_sm_output_ann_y_pred_train, NN_103_sm_output_rf_y_pred_train*2))
NN_103_sm_output_ensemble_x_train_df = pd.DataFrame(NN_103_sm_output_ensemble_x_train, columns = ['ANN','RF'])
NN_103_sm_output_ensemble_x_test = np.column_stack((NN_103_sm_output_ann_y_pred_test, NN_103_sm_output_rf_y_pred_test*2))
NN_103_sm_output_ensemble_x_test_df = pd.DataFrame(NN_103_sm_output_ensemble_x_test, columns = ['ANN','RF'])
NN_103_sm_output_df_y_train = np.ravel(NN_103_sm_output_df_y_train)
NN_103_sm_output_final_pred = xgb.XGBClassifier(objective= 'binary:logistic').fit(NN_103_sm_output_ensemble_x_train_df, NN_103_sm_output_df_y_train).predict(NN_103_sm_output_ensemble_x_test_df)

In [60]:
NN_202_sm_output_ensemble_x_train = np.column_stack((NN_202_sm_output_ann_y_pred_train, NN_202_sm_output_rf_y_pred_train*2))
NN_202_sm_output_ensemble_x_train_df = pd.DataFrame(NN_202_sm_output_ensemble_x_train, columns = ['ANN','RF'])
NN_202_sm_output_ensemble_x_test = np.column_stack((NN_202_sm_output_ann_y_pred_test, NN_202_sm_output_rf_y_pred_test*2))
NN_202_sm_output_ensemble_x_test_df = pd.DataFrame(NN_202_sm_output_ensemble_x_test, columns = ['ANN','RF'])
NN_202_sm_output_df_y_train = np.ravel(NN_202_sm_output_df_y_train)
NN_202_sm_output_final_pred = xgb.XGBClassifier(objective= 'binary:logistic').fit(NN_202_sm_output_ensemble_x_train_df, NN_202_sm_output_df_y_train).predict(NN_202_sm_output_ensemble_x_test_df)

#### ***Europe***

In [61]:
EU_103_sm_output_ensemble_x_train = np.column_stack((EU_103_sm_output_ann_y_pred_train, EU_103_sm_output_rf_y_pred_train*2))
EU_103_sm_output_ensemble_x_train_df = pd.DataFrame(EU_103_sm_output_ensemble_x_train, columns = ['ANN','RF'])
EU_103_sm_output_ensemble_x_test = np.column_stack((EU_103_sm_output_ann_y_pred_test, EU_103_sm_output_rf_y_pred_test*2))
EU_103_sm_output_ensemble_x_test_df = pd.DataFrame(EU_103_sm_output_ensemble_x_test, columns = ['ANN','RF'])
EU_103_sm_output_final_pred = DecisionTreeClassifier(random_state=42).fit(EU_103_sm_output_ensemble_x_train_df, EU_103_sm_output_df_y_train).predict(EU_103_sm_output_ensemble_x_test_df)

In [62]:
EU_202_sm_output_ensemble_x_train = np.column_stack((EU_202_sm_output_ann_y_pred_train, EU_202_sm_output_rf_y_pred_train*2))
EU_202_sm_output_ensemble_x_train_df = pd.DataFrame(EU_202_sm_output_ensemble_x_train, columns = ['ANN','RF'])
EU_202_sm_output_ensemble_x_test = np.column_stack((EU_202_sm_output_ann_y_pred_test, EU_202_sm_output_rf_y_pred_test*2))
EU_202_sm_output_ensemble_x_test_df = pd.DataFrame(EU_202_sm_output_ensemble_x_test, columns = ['ANN','RF'])
EU_202_sm_output_final_pred = DecisionTreeClassifier(random_state=42).fit(EU_202_sm_output_ensemble_x_train_df, EU_202_sm_output_df_y_train).predict(EU_202_sm_output_ensemble_x_test_df)

#### ***Asia***

In [63]:
AS_103_sm_output_ensemble_x_train = np.column_stack((AS_103_sm_output_ann_y_pred_train, AS_103_sm_output_rf_y_pred_train*2))
AS_103_sm_output_ensemble_x_train_df = pd.DataFrame(AS_103_sm_output_ensemble_x_train, columns = ['ANN','RF'])
AS_103_sm_output_ensemble_x_test = np.column_stack((AS_103_sm_output_ann_y_pred_test, AS_103_sm_output_rf_y_pred_test*2))
AS_103_sm_output_ensemble_x_test_df = pd.DataFrame(AS_103_sm_output_ensemble_x_test, columns = ['ANN','RF'])
AS_103_sm_output_final_pred = DecisionTreeClassifier(random_state=42).fit(AS_103_sm_output_ensemble_x_train_df, AS_103_sm_output_df_y_train).predict(AS_103_sm_output_ensemble_x_test_df)

In [64]:
AS_202_sm_output_ensemble_x_train = np.column_stack((AS_202_sm_output_ann_y_pred_train, AS_202_sm_output_rf_y_pred_train*2))
AS_202_sm_output_ensemble_x_train_df = pd.DataFrame(AS_202_sm_output_ensemble_x_train, columns = ['ANN','RF'])
AS_202_sm_output_ensemble_x_test = np.column_stack((AS_202_sm_output_ann_y_pred_test, AS_202_sm_output_rf_y_pred_test*2))
AS_202_sm_output_ensemble_x_test_df = pd.DataFrame(AS_202_sm_output_ensemble_x_test, columns = ['ANN','RF'])
AS_202_sm_output_final_pred = DecisionTreeClassifier(random_state=42).fit(AS_202_sm_output_ensemble_x_train_df, AS_202_sm_output_df_y_train).predict(AS_202_sm_output_ensemble_x_test_df)

## **Confusion Matricies of Accuracy Statistics**

**Accuracy** <br>
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For example, a 0.75 or greater means our model is at least 75% accurate, which is very good.

Accuracy = TP+TN/TP+FP+FN+TN

**Precision (Specificity)** <br>
Precision is the ratio of correctly predicted positive observations over the total of both the correclty and incorrectly predicted positive observations. The question that this metric answer is of all wires that are labeled as fraud, how many are actually fraud? High precision relates to the low false positive rate. For example, a 0.75 precision translates into a very precise prediction.

Precision = TP/TP+FP

**Recall (Sensitivity)** <br>
Recall is the ratio of correctly predicted positive observations over the total of the correclty predicted positive and incorrectly predicted negative observations. The question recall answers is of all the wires that are truly fraud, how many did we labeled? For example, a recall of 0.631 means that our model is good because it is above the threshold of 0.5.

Recall = TP/TP+FN

**F1 score** <br>
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

#### ***North America***

In [65]:
NN_103_output_df_y_train.shape # before SMOTE original y train

(70000, 1)

In [66]:
NN_103_output_df_y_train[NN_103_output_df_y_train['FRAUD_LABEL']>=1].shape # # before SMOTE original y train filter

(549, 1)

In [67]:
NN_103_sm_output_df_y_train.shape # original y train

(138902,)

In [92]:
NN_103_sm_output_df_y_train[NN_103_sm_output_df_y_train>=1].shape # original y train prediction filter

(69451,)

In [69]:
NN_103_sm_output_ann_y_pred_train.shape # ann y train level 1 predictions

(138902,)

In [70]:
NN_103_sm_output_rf_y_pred_train.shape # random forest y train level 1 predictions

(138902,)

In [71]:
NN_103_sm_output_ann_y_pred_train[NN_103_sm_output_ann_y_pred_train>=1].shape # ann y train level 1 prediction filter

(138902,)

In [72]:
NN_103_sm_output_rf_y_pred_train[NN_103_sm_output_rf_y_pred_train>=1].shape  # random forest y train level 1 prediction filter

(98,)

In [73]:
NN_103_sm_output_ensemble_x_train_df.shape # level 2 x train

(138902, 2)

In [74]:
NN_103_sm_output_ensemble_x_train_df.shape # level 2 x train 

(138902, 2)

In [75]:
NN_103_sm_output_df_y_train.shape # level 2 y train 

(138902,)

In [76]:
NN_103_sm_output_ensemble_x_train_df[NN_103_sm_output_ensemble_x_train_df['ANN']>=1].shape # level 2 x train ANN filter

(138902, 2)

In [77]:
NN_103_sm_output_ensemble_x_train_df[NN_103_sm_output_ensemble_x_train_df['RF']>=1].shape# level 2 x train RF filter

(98, 2)

In [78]:
NN_103_sm_output_ensemble_x_test_df.shape # level 2 x test 

(59582, 2)

In [79]:
NN_103_sm_output_ensemble_x_test_df[NN_103_sm_output_ensemble_x_test_df['ANN']>=1].shape # level 2 x test ANN filter

(59582, 2)

In [80]:
NN_103_sm_output_ensemble_x_test_df[NN_103_sm_output_ensemble_x_test_df['RF']>=1].shape# level 2 x test RF filter

(55, 2)

In [82]:
NN_103_sm_output_final_pred.shape # level 2 predictions

(59582,)

In [81]:
NN_103_sm_output_final_pred[NN_103_sm_output_final_pred>=1].shape # level 2 predictions filter

(59527,)

#### MT103

In [83]:
print(confusion_matrix(NN_103_sm_output_df_y_test,NN_103_sm_output_final_pred))
print(classification_report(NN_103_sm_output_df_y_test,NN_103_sm_output_final_pred))

[[   55 29736]
 [    0 29791]]
              precision    recall  f1-score   support

           0       1.00      0.00      0.00     29791
           1       0.50      1.00      0.67     29791

    accuracy                           0.50     59582
   macro avg       0.75      0.50      0.34     59582
weighted avg       0.75      0.50      0.34     59582



#### MT202

In [84]:
print(confusion_matrix(NN_202_sm_output_df_y_test,NN_202_sm_output_final_pred))
print(classification_report(NN_202_sm_output_df_y_test,NN_202_sm_output_final_pred))

[[   15 29784]
 [    0 29799]]
              precision    recall  f1-score   support

           0       1.00      0.00      0.00     29799
           1       0.50      1.00      0.67     29799

    accuracy                           0.50     59598
   macro avg       0.75      0.50      0.33     59598
weighted avg       0.75      0.50      0.33     59598



#### ***Europe***

#### MT103

In [85]:
print(confusion_matrix(EU_103_sm_output_df_y_test,EU_103_sm_output_final_pred))
print(classification_report(EU_103_sm_output_df_y_test,EU_103_sm_output_final_pred))

[[    9 29847]
 [    0 29856]]
              precision    recall  f1-score   support

           0       1.00      0.00      0.00     29856
           1       0.50      1.00      0.67     29856

    accuracy                           0.50     59712
   macro avg       0.75      0.50      0.33     59712
weighted avg       0.75      0.50      0.33     59712



#### MT202

In [86]:
print(confusion_matrix(EU_202_sm_output_df_y_test,EU_202_sm_output_final_pred))
print(classification_report(EU_202_sm_output_df_y_test,EU_202_sm_output_final_pred))

[[27476  2402]
 [26962  2916]]
              precision    recall  f1-score   support

           0       0.50      0.92      0.65     29878
           1       0.55      0.10      0.17     29878

    accuracy                           0.51     59756
   macro avg       0.53      0.51      0.41     59756
weighted avg       0.53      0.51      0.41     59756



#### ***Asia***

#### MT103

In [87]:
print(confusion_matrix(AS_103_sm_output_df_y_test,AS_103_sm_output_final_pred))
print(classification_report(AS_103_sm_output_df_y_test,AS_103_sm_output_final_pred))

[[    9 29847]
 [    0 29856]]
              precision    recall  f1-score   support

           0       1.00      0.00      0.00     29856
           1       0.50      1.00      0.67     29856

    accuracy                           0.50     59712
   macro avg       0.75      0.50      0.33     59712
weighted avg       0.75      0.50      0.33     59712



#### MT202

In [88]:
print(confusion_matrix(AS_202_sm_output_df_y_test,AS_202_sm_output_final_pred))
print(classification_report(AS_202_sm_output_df_y_test,AS_202_sm_output_final_pred))

[[27476  2402]
 [26962  2916]]
              precision    recall  f1-score   support

           0       0.50      0.92      0.65     29878
           1       0.55      0.10      0.17     29878

    accuracy                           0.51     59756
   macro avg       0.53      0.51      0.41     59756
weighted avg       0.53      0.51      0.41     59756



## **Merge Final Predictions with X Test Data Sets**

In [90]:
# Merge predictions to North America
NN_103_final_df = pd.DataFrame( np.column_stack((NN_103_sm_output_df_x_test.to_numpy(), NN_103_sm_output_final_pred)),  columns = ['TRANSACTION_MONTH','TRANSACTION_AMOUNT','FRAUD_LABEL'] )
NN_202_final_df = pd.DataFrame( np.column_stack((NN_202_sm_output_df_x_test.to_numpy(), NN_202_sm_output_final_pred)),  columns = ['TRANSACTION_MONTH','TRANSACTION_AMOUNT','FRAUD_LABEL'] )

# Merge predictions to Europe
EU_103_final_df = pd.DataFrame( np.column_stack((EU_103_sm_output_df_x_test.to_numpy(), EU_103_sm_output_final_pred)),  columns = ['TRANSACTION_MONTH','TRANSACTION_AMOUNT','FRAUD_LABEL'] )
EU_202_final_df = pd.DataFrame( np.column_stack((EU_202_sm_output_df_x_test.to_numpy(), EU_202_sm_output_final_pred)),  columns = ['TRANSACTION_MONTH','TRANSACTION_AMOUNT','FRAUD_LABEL'] )

# Merge predictions to Asia
AS_103_final_df = pd.DataFrame( np.column_stack((AS_103_sm_output_df_x_test.to_numpy(), AS_103_sm_output_final_pred)),  columns = ['TRANSACTION_MONTH','TRANSACTION_AMOUNT','FRAUD_LABEL'] )
AS_202_final_df = pd.DataFrame( np.column_stack((AS_202_sm_output_df_x_test.to_numpy(), AS_202_sm_output_final_pred)),  columns = ['TRANSACTION_MONTH','TRANSACTION_AMOUNT','FRAUD_LABEL'] )

## **Backup Final Results for Rule Tuning**

In [91]:
NN_103_final_df.to_csv('/content/gdrive/MyDrive/DSC-380/NN_103_final_df.csv', index=False)
NN_202_final_df.to_csv('/content/gdrive/MyDrive/DSC-380/NN_202_final_df.csv', index=False)
EU_103_final_df.to_csv('/content/gdrive/MyDrive/DSC-380/EU_103_final_df.csv', index=False)
EU_202_final_df.to_csv('/content/gdrive/MyDrive/DSC-380/EU_202_final_df.csv', index=False)
AS_103_final_df.to_csv('/content/gdrive/MyDrive/DSC-380/AS_103_final_df.csv', index=False)
AS_202_final_df.to_csv('/content/gdrive/MyDrive/DSC-380/AS_202_final_df.csv', index=False)