<a href="https://colab.research.google.com/github/henryjhu/Anomaly-Detection-in-Wire-Activities/blob/main/DSC_680_Unsupervised_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DSC-680-Z1 Research Practicum** <BR> Machine Learning

## **Project Description**
A global bank sought to find new and innovative means for detecting and preventing fraud in their wire transactions. Their goal is through machine learning to arrive at scenario detection rules which could be customized with parameters specific for each cohort of customers. The immediate challenge was that the data provided had no labels. Therefore, unsupervised machine learning techniques were first employed for this research project. After the labels have been identified, supervised machine learning techniques were then employed with selected tuning parameters to increase both the sensitivity and specificity percentages of both classes.

<b>Purpose:</b><br>
Carry out unsupervised machine learnings with the sample data.<br>
<b>Universtiy Name:</b> Utica College <br>
<b>Course Name:</b> DSC-680-Z1 Research Practicum <br>
<b>Student Name:</b> Henry J. Hu <br>
<b>Program Director Name:</b> Dr. McCarthy, Michael <br>
<b>Runtime Environment:</b> Google Colab<br>
<b>Programming Language:</b> Python <br>
<b>Sample Data Frame:</b>
A random sample of unlabeled international wires belonging to 139 customers from 3 continents for the entire year of 2020.<br>
<b> Last Update:</b> August 5th, 2021

## **Mounting Google Drive**

In [1]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


## **Importing Libraries**

In [2]:
# Importing libraries
import io
import pandas as pd
import numpy as np
from numpy import quantile, where, random
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import traceback
import time
from datetime import datetime
import pytz
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection, preprocessing
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import IsolationForest, VotingClassifier, StackingClassifier
from sklearn.datasets import make_blobs
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import rankdata

## **Importing Data Into Google Colab**

In [3]:
# Importing data and looking at head
input_data = pd.read_csv("gdrive/MyDrive/DSC-380/sample_df_4M.txt")
input_data.head()

Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT
0,3174204,2020-03-31 18:21:17,3,7116490843,United States of America,US,North America,NN,202,39246.109,6475.35
1,1237511,2020-02-07 00:24:34,2,6249255174,India-Republic of,IN,Asia,AS,202,26152.551,3335.49
2,5556094,2020-06-11 13:46:22,6,7117396344,Switzerland-Swiss Confederation,CH,Europe,EU,103,124854.38,8920000.0
3,2332371,2020-03-10 05:15:07,3,6249399616,United States of America,US,North America,NN,202,39246.109,1784.0
4,7295929,2020-07-31 17:23:36,7,7116490843,United States of America,US,North America,NN,202,29569.26,2446.45


## **Data Segregation**

In [4]:
NN_103_df = input_data[(input_data['CONTINENT_CODE']=='NN') & (input_data['SWIFT_MSG_TYPE']==103)]
NN_103_df.head()
NN_103_df.shape

(597651, 11)

In [5]:
NN_202_df = input_data[(input_data['CONTINENT_CODE']=='NN') & (input_data['SWIFT_MSG_TYPE']==202)]
NN_202_df.head()
NN_202_df.shape

(1311380, 11)

In [6]:
EU_103_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==103)]
EU_103_df.head()
EU_103_df.shape

(627223, 11)

In [7]:
EU_202_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==202)]
EU_202_df.head()
EU_202_df.shape

(320828, 11)

In [8]:
AS_103_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==103)]
AS_103_df.head()
AS_103_df.shape

(627223, 11)

In [9]:
AS_202_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==202)]
AS_202_df.head()
AS_202_df.shape

(320828, 11)

## **Unsupervised Ensemble Learner**

In [10]:
################################################################################
#
# Purpose: Function to rank the outlier scores
#
################################################################################

def rank_fun(arr):
    return rankdata(arr, method = 'dense')

In [11]:
####################################################################################################
#
# Purpose: Function to calculate outlier scores for a given input data set.
# Machine learning method: Ensemble learner of Local Outlier Factor and Isolaion Forest.
# Score to fraud label rule: A score smaller than the offset vaLue is fraud and greater 
# than or equal to the offset value is not fraud.
#
####################################################################################################

def ensemble_fun (df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n=0.05):

  # Include only the relavent independent varialbes
  X=df[['TRXN_MONTH','TRANSACTION_AMOUNT']]
  
  # Scale and center the data around the mean of 0
  scaling=StandardScaler()
  scaling.fit_transform(X)

  # Initialize the log variable
  class log:
    def_tz = pytz.timezone('America/New_York')
    def info(text):        
        print(f'{datetime.now(log.def_tz).replace(microsecond=0)} : {text}');

  # Initialize an enumerate list of estimators
  estimator_list = {
    # novelty=False because this is outlier detection.
    # pairwise_n = 2 for Euclidian distance and pairwise_n = 1 for Manhattan distance.
    'LOF':LocalOutlierFactor(novelty=False, n_neighbors=n_neighbors_n, algorithm='auto', leaf_size=30, 
                             metric='minkowski', p=pairwise_n, metric_params=None, contamination=contamination_n),
    'iForest':IsolationForest(n_estimators=n_estimators_n, random_state=random_state_n, max_samples=len(X), contamination=contamination_n)
  }

  # Input data frame size
  n_rows_in = X.shape[0]
  n_features_in = X.shape[1]

  # Initializing score arrayS
  ensemble_scores = np.zeros([n_rows_in, len(estimator_list)])
  final_scores = np.zeros([n_rows_in, 1])

  # Fitting individual models in the enumerate list
  log.info (f'Input data frame size: Rows = {n_rows_in}, Columns = {n_features_in}')

  for i, (clf_name, clf) in enumerate(estimator_list.items()):
    try:
        clf.fit(X)
        if clf_name == "LOF":
            log.info(f'Fitting {clf_name}')
            log.info(f'LOF offset_ = {clf.offset_}')
            ensemble_scores[:, i] = clf.negative_outlier_factor_
        else:
            log.info(f'Fitting {clf_name}')
            log.info(f'iForest offset_ = {clf.offset_}')
            ensemble_scores[:, i] = clf.score_samples(X)
    except:
            log.info(traceback.print_exc())
    else:    
            log.info(f'{clf_name} is fitted successfully with {len(ensemble_scores)} scores')  

  # Repalce NaN with 0's
  ensemble_scores=np.nan_to_num(ensemble_scores) 

  # Transforming the outlier scores into ranking values
  ensemble_scores = np.apply_along_axis(rank_fun, 0, ensemble_scores)

  # Normalize the ranking values of both algorithms between 0 and 1
  ensemble_scores = preprocessing.MinMaxScaler().fit_transform(ensemble_scores)

  # Select the maximum of two ranking values
  final_scores = np.max(ensemble_scores, axis = 1)

  # Make a copy of final score array
  pred_y = np.copy(final_scores) 

  score_min = np.min(pred_y)
  score_max = np.max(pred_y)

  log.info (f'Minimum Score = {score_min}, Maximum Score = {score_max}')

  # Labeling all scores < offset value as fraud and >= offset value as non-fraud
  offset = 0.1
  pred_y[pred_y < offset] = -999
  pred_y[pred_y >= offset] = 0.0
  pred_y[pred_y == -999] = 1.0

  log.info (f'Real offset: {offset}')

  fraud_ct = np.count_nonzero(pred_y == 1.0)
  fraud_pct = fraud_ct/n_rows_in

  log.info (f'Percentage of suspicious transactions: {fraud_pct}')

  df_arr=df.to_numpy() # Converting the input data frame to a data array
  df_arr_f=np.column_stack( (df_arr, pred_y)) # Adding the scores to the input data array

  # Converting the final combinded data array back to a data frame
  df_f = pd.DataFrame(df_arr_f, columns = ['TRANSACTION_ID','TRANSACTION_TIME','TRXN_MONTH','CLIENT_ID','COUNTRY_NAME','COUNTRY_CODE','CONTINENT_NAME',	'CONTINENT_CODE','SWIFT_MSG_TYPE','AVG_TRXN_AMT','TRANSACTION_AMOUNT','FRAUD_LABEL']) 

  # Output the final data frame size
  n_rows_o = df_f.shape[0]
  n_features_o = df_f.shape[1]
  log.info (f'Output data frame size: Rows = {n_rows_o}, Columns = {n_features_o}')

  return df_f

In [12]:
NN_103_score_df = ensemble_fun (NN_103_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
NN_103_score_df

2021-08-07 00:41:35-04:00 : Input data frame size: Rows = 597651, Columns = 2
2021-08-07 00:41:41-04:00 : Fitting LOF
2021-08-07 00:41:41-04:00 : LOF offset_ = -1.5
2021-08-07 00:41:41-04:00 : LOF is fitted successfully with 597651 scores
2021-08-07 00:42:04-04:00 : Fitting iForest
2021-08-07 00:42:04-04:00 : iForest offset_ = -0.5
2021-08-07 00:42:33-04:00 : iForest is fitted successfully with 597651 scores
2021-08-07 00:42:33-04:00 : Minimum Score = 0.006274848122326602, Maximum Score = 1.0
2021-08-07 00:42:33-04:00 : Real offset: 0.1
2021-08-07 00:42:33-04:00 : Percentage of suspicious transactions: 0.010405738466094761
2021-08-07 00:42:34-04:00 : Output data frame size: Rows = 597651, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,11840677,2020-12-09 13:52:20,12,7116485839,United States of America,US,North America,NN,103,105235,1.07625e+06,0
1,8097698,2020-08-26 13:31:42,8,7117321565,United States of America,US,North America,NN,103,112149,7035.95,0
2,9906741,2020-10-16 10:06:55,10,6249091671,United States of America,US,North America,NN,103,111656,508440,0
3,7130576,2020-07-29 09:16:55,7,6249205053,United States of America,US,North America,NN,103,116543,161700,0
4,1780716,2020-02-25 12:40:18,2,7117321565,United States of America,US,North America,NN,103,137211,262892,0
...,...,...,...,...,...,...,...,...,...,...,...,...
597646,11912924,2020-12-11 09:00:58,12,7116485839,United States of America,US,North America,NN,103,105235,1829.03,0
597647,2411411,2020-03-11 15:34:01,3,6249091671,United States of America,US,North America,NN,103,260830,107040,0
597648,8770765,2020-09-15 10:05:43,9,7117321565,United States of America,US,North America,NN,103,107629,580661,0
597649,10019468,2020-10-20 14:40:45,10,7116485839,United States of America,US,North America,NN,103,111656,9147.1,0


In [13]:
NN_202_score_df = ensemble_fun (NN_202_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
NN_202_score_df

2021-08-07 00:42:34-04:00 : Input data frame size: Rows = 1311380, Columns = 2
2021-08-07 00:42:53-04:00 : Fitting LOF
2021-08-07 00:42:53-04:00 : LOF offset_ = -1.5
2021-08-07 00:42:53-04:00 : LOF is fitted successfully with 1311380 scores
2021-08-07 00:43:53-04:00 : Fitting iForest
2021-08-07 00:43:53-04:00 : iForest offset_ = -0.5
2021-08-07 00:45:00-04:00 : iForest is fitted successfully with 1311380 scores
2021-08-07 00:45:00-04:00 : Minimum Score = 0.00031350593439579755, Maximum Score = 1.0
2021-08-07 00:45:00-04:00 : Real offset: 0.1
2021-08-07 00:45:00-04:00 : Percentage of suspicious transactions: 0.009917796519696808
2021-08-07 00:45:01-04:00 : Output data frame size: Rows = 1311380, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,3174204,2020-03-31 18:21:17,3,7116490843,United States of America,US,North America,NN,202,39246.1,6475.35,0
1,2332371,2020-03-10 05:15:07,3,6249399616,United States of America,US,North America,NN,202,39246.1,1784,0
2,7295929,2020-07-31 17:23:36,7,7116490843,United States of America,US,North America,NN,202,29569.3,2446.45,0
3,7561204,2020-08-10 11:15:14,8,6249399616,United States of America,US,North America,NN,202,26404.8,535.2,0
4,9336515,2020-09-30 15:41:35,9,7116490843,United States of America,US,North America,NN,202,29261.3,214944,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1311375,1659882,2020-02-20 19:44:22,2,7117147073,United States of America,US,North America,NN,202,24951.2,8978.25,0
1311376,8958816,2020-09-21 11:30:30,9,7116485839,United States of America,US,North America,NN,202,29261.3,18187.4,0
1311377,5101009,2020-05-29 13:02:50,5,7116490843,United States of America,US,North America,NN,202,28071.2,10754.7,0
1311378,6491516,2020-07-08 14:32:18,7,7116485839,United States of America,US,North America,NN,202,29569.3,3468.51,0


In [14]:
EU_103_score_df = ensemble_fun (EU_103_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
EU_103_score_df

2021-08-07 00:45:01-04:00 : Input data frame size: Rows = 627223, Columns = 2
2021-08-07 00:45:09-04:00 : Fitting LOF
2021-08-07 00:45:09-04:00 : LOF offset_ = -1.5
2021-08-07 00:45:09-04:00 : LOF is fitted successfully with 627223 scores
2021-08-07 00:45:34-04:00 : Fitting iForest
2021-08-07 00:45:34-04:00 : iForest offset_ = -0.5
2021-08-07 00:46:07-04:00 : iForest is fitted successfully with 627223 scores
2021-08-07 00:46:07-04:00 : Minimum Score = 0.00633175296897441, Maximum Score = 1.0
2021-08-07 00:46:07-04:00 : Real offset: 0.1
2021-08-07 00:46:07-04:00 : Percentage of suspicious transactions: 0.00790946760562033
2021-08-07 00:46:08-04:00 : Output data frame size: Rows = 627223, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,5556094,2020-06-11 13:46:22,6,7117396344,Switzerland-Swiss Confederation,CH,Europe,EU,103,124854,8.92e+06,0
1,10345183,2020-10-29 19:53:30,10,7116359374,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,97166.7,5.05614e+07,0
2,4995226,2020-05-27 14:33:52,5,6249091671,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,121981,1.05256e+06,0
3,8910457,2020-09-18 10:46:20,9,7117258150,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,102882,129.96,0
4,7511379,2020-08-07 10:05:42,8,6249020158,Belgium-Kingdom of,BE,Europe,EU,103,110260,1135.8,0
...,...,...,...,...,...,...,...,...,...,...,...,...
627218,2040396,2020-03-02 09:02:42,3,7116042352,Luxembourg-Grand Duchy of,LU,Europe,EU,103,211850,343522,0
627219,10743216,2020-11-09 13:49:20,11,6249091671,Luxembourg-Grand Duchy of,LU,Europe,EU,103,102641,22597.3,0
627220,557003,2020-01-17 14:30:58,1,7116378678,Switzerland-Swiss Confederation,CH,Europe,EU,103,132449,37107.2,0
627221,10956655,2020-11-16 10:07:39,11,6249091671,Luxembourg-Grand Duchy of,LU,Europe,EU,103,102641,2382.71,0


In [15]:
EU_202_score_df = ensemble_fun (EU_202_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
EU_202_score_df

2021-08-07 00:46:08-04:00 : Input data frame size: Rows = 320828, Columns = 2
2021-08-07 00:46:10-04:00 : Fitting LOF
2021-08-07 00:46:10-04:00 : LOF offset_ = -1.5
2021-08-07 00:46:10-04:00 : LOF is fitted successfully with 320828 scores
2021-08-07 00:46:22-04:00 : Fitting iForest
2021-08-07 00:46:22-04:00 : iForest offset_ = -0.5
2021-08-07 00:46:37-04:00 : iForest is fitted successfully with 320828 scores
2021-08-07 00:46:37-04:00 : Minimum Score = 0.01020377945960581, Maximum Score = 1.0
2021-08-07 00:46:37-04:00 : Real offset: 0.1
2021-08-07 00:46:37-04:00 : Percentage of suspicious transactions: 0.008450010597578765
2021-08-07 00:46:37-04:00 : Output data frame size: Rows = 320828, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,489247,2020-01-16 10:00:53,1,7116370821,Russian Federation,RU,Europe,EU,202,24619.2,53520,0
1,5629788,2020-06-15 10:05:49,6,6249020158,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,21781.8,96002.8,0
2,4812774,2020-05-20 23:36:28,5,6249354289,Slovenia-Republic of,SI,Europe,EU,202,22197.6,12588.1,0
3,7339351,2020-08-03 10:53:32,8,7116055826,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,20550.9,792.1,0
4,3345192,2020-04-06 00:01:33,4,6249328247,Latvia-Republic of,LV,Europe,EU,202,23141.3,2428.93,0
...,...,...,...,...,...,...,...,...,...,...,...,...
320823,6390086,2020-07-06 09:08:33,7,7116055826,Turkey-Republic of,TR,Europe,EU,202,22349.2,3568,0
320824,11023067,2020-11-17 16:31:11,11,7116005194,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,22103.8,200700,0
320825,6765555,2020-07-16 23:43:37,7,6249184766,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,22349.2,89209.5,0
320826,23280,2020-01-02 11:45:28,1,7116569119,Portugal-Portuguese Republic,PT,Europe,EU,202,24619.2,1784,0


In [16]:
AS_103_score_df = ensemble_fun (AS_103_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
AS_103_score_df

2021-08-07 00:46:37-04:00 : Input data frame size: Rows = 627223, Columns = 2
2021-08-07 00:46:45-04:00 : Fitting LOF
2021-08-07 00:46:45-04:00 : LOF offset_ = -1.5
2021-08-07 00:46:45-04:00 : LOF is fitted successfully with 627223 scores
2021-08-07 00:47:11-04:00 : Fitting iForest
2021-08-07 00:47:11-04:00 : iForest offset_ = -0.5
2021-08-07 00:47:44-04:00 : iForest is fitted successfully with 627223 scores
2021-08-07 00:47:44-04:00 : Minimum Score = 0.00633175296897441, Maximum Score = 1.0
2021-08-07 00:47:44-04:00 : Real offset: 0.1
2021-08-07 00:47:44-04:00 : Percentage of suspicious transactions: 0.00790946760562033
2021-08-07 00:47:44-04:00 : Output data frame size: Rows = 627223, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,5556094,2020-06-11 13:46:22,6,7117396344,Switzerland-Swiss Confederation,CH,Europe,EU,103,124854,8.92e+06,0
1,10345183,2020-10-29 19:53:30,10,7116359374,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,97166.7,5.05614e+07,0
2,4995226,2020-05-27 14:33:52,5,6249091671,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,121981,1.05256e+06,0
3,8910457,2020-09-18 10:46:20,9,7117258150,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,102882,129.96,0
4,7511379,2020-08-07 10:05:42,8,6249020158,Belgium-Kingdom of,BE,Europe,EU,103,110260,1135.8,0
...,...,...,...,...,...,...,...,...,...,...,...,...
627218,2040396,2020-03-02 09:02:42,3,7116042352,Luxembourg-Grand Duchy of,LU,Europe,EU,103,211850,343522,0
627219,10743216,2020-11-09 13:49:20,11,6249091671,Luxembourg-Grand Duchy of,LU,Europe,EU,103,102641,22597.3,0
627220,557003,2020-01-17 14:30:58,1,7116378678,Switzerland-Swiss Confederation,CH,Europe,EU,103,132449,37107.2,0
627221,10956655,2020-11-16 10:07:39,11,6249091671,Luxembourg-Grand Duchy of,LU,Europe,EU,103,102641,2382.71,0


In [17]:
AS_202_score_df = ensemble_fun (AS_202_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
AS_202_score_df

2021-08-07 00:47:45-04:00 : Input data frame size: Rows = 320828, Columns = 2
2021-08-07 00:47:47-04:00 : Fitting LOF
2021-08-07 00:47:47-04:00 : LOF offset_ = -1.5
2021-08-07 00:47:47-04:00 : LOF is fitted successfully with 320828 scores
2021-08-07 00:47:58-04:00 : Fitting iForest
2021-08-07 00:47:58-04:00 : iForest offset_ = -0.5
2021-08-07 00:48:13-04:00 : iForest is fitted successfully with 320828 scores
2021-08-07 00:48:13-04:00 : Minimum Score = 0.01020377945960581, Maximum Score = 1.0
2021-08-07 00:48:13-04:00 : Real offset: 0.1
2021-08-07 00:48:13-04:00 : Percentage of suspicious transactions: 0.008450010597578765
2021-08-07 00:48:13-04:00 : Output data frame size: Rows = 320828, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,489247,2020-01-16 10:00:53,1,7116370821,Russian Federation,RU,Europe,EU,202,24619.2,53520,0
1,5629788,2020-06-15 10:05:49,6,6249020158,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,21781.8,96002.8,0
2,4812774,2020-05-20 23:36:28,5,6249354289,Slovenia-Republic of,SI,Europe,EU,202,22197.6,12588.1,0
3,7339351,2020-08-03 10:53:32,8,7116055826,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,20550.9,792.1,0
4,3345192,2020-04-06 00:01:33,4,6249328247,Latvia-Republic of,LV,Europe,EU,202,23141.3,2428.93,0
...,...,...,...,...,...,...,...,...,...,...,...,...
320823,6390086,2020-07-06 09:08:33,7,7116055826,Turkey-Republic of,TR,Europe,EU,202,22349.2,3568,0
320824,11023067,2020-11-17 16:31:11,11,7116005194,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,22103.8,200700,0
320825,6765555,2020-07-16 23:43:37,7,6249184766,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,22349.2,89209.5,0
320826,23280,2020-01-02 11:45:28,1,7116569119,Portugal-Portuguese Republic,PT,Europe,EU,202,24619.2,1784,0


## **Backup Data Frames for Supervised Machine Learning**

In [18]:
NN_103_score_df.to_csv('/content/gdrive/MyDrive/DSC-380/NN_103_score_df.cvs',index=False)
NN_202_score_df.to_csv('/content/gdrive/MyDrive/DSC-380/NN_202_score_df.cvs',index=False)
EU_103_score_df.to_csv('/content/gdrive/MyDrive/DSC-380/EU_103_score_df.cvs',index=False)
EU_202_score_df.to_csv('/content/gdrive/MyDrive/DSC-380/EU_202_score_df.cvs',index=False)
AS_103_score_df.to_csv('/content/gdrive/MyDrive/DSC-380/AS_103_score_df.cvs',index=False)
AS_202_score_df.to_csv('/content/gdrive/MyDrive/DSC-380/AS_202_score_df.cvs',index=False)