<a href="https://colab.research.google.com/github/henryjhu/Anomaly-Detection-in-Wire-Activities/blob/main/DSC_680_Unsupervised_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DSC-680-Z1 Research Practicum** <BR> Machine Learning

## **Project Description**
The research practicum involves on-site experiential learning in a research setting. This setting may be in the private or public sector, it may include such locations as education, governmental, non-governmental, or general
research organization. The experience must provide students the opportunity to collect and analyze data, consider ethical implications of research, and draw empirically grounded conclusions.

<b>Purpose:</b><br>
Carry out unsupervised machine learnings with the sample data.<br>
<b>Universtiy Name:</b> Utica College <br>
<b>Course Name:</b> DSC-680-Z1 Research Practicum <br>
<b>Student Name:</b> Henry J. Hu <br>
<b>Program Director Name:</b> Dr. McCarthy, Michael <br>
<b>Runtime Environment:</b> Google Colab<br>
<b>Programming Language:</b> Python <br>
<b>Sample Data Frame:</b>
A random sample of international wires belonging to 139 customers from 3 continents for the entire year of 2020.<br>
<b> Last Update:</b> August 5th, 2021

## **Mounting Google Drive**

In [1]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


## **Importing Libraries**

In [2]:
# Importing libraries
import io
import pandas as pd
import numpy as np
from numpy import quantile, where, random
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import traceback
import time
from datetime import datetime
import pytz
from sklearn import model_selection, preprocessing
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import IsolationForest, VotingClassifier, StackingClassifier
from sklearn.datasets import make_blobs
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import rankdata

## **Importing Data Into Google Colab**

In [3]:
# Importing data and looking at head
input_data = pd.read_csv("gdrive/MyDrive/DSC-380/sample_df_4M.txt")
input_data.head()

Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT
0,4349182.0,2020-05-06 09:27:19,5,6249012147,United States of America,US,North America,NN,202,12445770.0,1139.98
1,4919379.0,2020-05-26 08:16:53,5,6249328247,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,16644110.0,10704000.0
2,11969294.0,2020-12-14 10:07:03,12,6249263302,United States of America,US,North America,NN,202,9503760.0,7582.0
3,2219769.0,2020-03-05 15:20:16,3,7116485839,United States of America,US,North America,NN,103,29508470.0,363.78
4,852566.0,2020-01-29 00:49:09,1,6249152840,Thailand-Kingdom of,TH,Asia,AS,202,250325.2,6935.62


## **Data Segregation**

In [4]:
NN_103_df = input_data[(input_data['CONTINENT_CODE']=='NN') & (input_data['SWIFT_MSG_TYPE']==103)]
NN_103_df.head()
NN_103_df.shape

(598565, 11)

In [5]:
NN_202_df = input_data[(input_data['CONTINENT_CODE']=='NN') & (input_data['SWIFT_MSG_TYPE']==202)]
NN_202_df.head()
NN_202_df.shape

(1310612, 11)

In [6]:
EU_103_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==103)]
EU_103_df.head()
EU_103_df.shape

(629564, 11)

In [7]:
EU_202_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==202)]
EU_202_df.head()
EU_202_df.shape

(319776, 11)

In [8]:
AS_103_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==103)]
AS_103_df.head()
AS_103_df.shape

(629564, 11)

In [9]:
AS_202_df = input_data[(input_data['CONTINENT_CODE']=='EU') & (input_data['SWIFT_MSG_TYPE']==202)]
AS_202_df.head()
AS_202_df.shape

(319776, 11)

## **Unsupervised Ensemble Learner**

In [11]:
################################################################################
#
# Purpose: Function to rank the outlier scores
#
################################################################################

def rank_fun(arr):
    return rankdata(arr, method = 'dense')

In [12]:
####################################################################################################
#
# Purpose: Function to calculate outlier scores for a given input data set.
# Machine learning method: Ensemble learner of Local Outlier Factor and Isolaion Forest.
# Score to fraud label rule: A score smaller than the offset vaLue is fraud and greater 
# than or equal to the offset value is not fraud.
#
####################################################################################################

def ensemble_fun (df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n=0.05):

  # Include only the relavent independent varialbes
  X=df[['TRXN_MONTH','TRANSACTION_AMOUNT']]
  
  # Scale and center the data around the mean of 0
  from sklearn.preprocessing import StandardScaler
  scaling=StandardScaler()
  scaling.fit_transform(X)

  # Initialize the log variable
  class log:
    def_tz = pytz.timezone('America/New_York')
    def info(text):        
        print(f'{datetime.now(log.def_tz).replace(microsecond=0)} : {text}');

  # Initialize an enumerate list of estimators
  estimator_list = {
    # novelty=False because this is outlier detection.
    # pairwise_n = 2 for Euclidian distance and 1 for Manhattan distance.
    'LOF':LocalOutlierFactor(novelty=False, n_neighbors=n_neighbors_n, algorithm='auto', leaf_size=30, 
                             metric='minkowski', p=pairwise_n, metric_params=None, contamination=contamination_n),
    'iForest':IsolationForest(n_estimators=n_estimators_n, random_state=random_state_n, max_samples=len(X), contamination=contamination_n)
  }

  # Input data frame size
  n_rows_in = X.shape[0]
  n_features_in = X.shape[1]

  # Initializing score array
  ensemble_scores = np.zeros([n_rows_in, len(estimator_list)])
  final_scores = np.zeros([n_rows_in, 1])

  # Ensemble via score averaging
  log.info (f'Input data frame size: Rows = {n_rows_in}, Columns = {n_features_in}')

  for i, (clf_name, clf) in enumerate(estimator_list.items()):
    try:
        clf.fit(X)
        if clf_name == "LOF":
            log.info(f'Fitting {clf_name}')
            log.info(f'LOF offset_ = {clf.offset_}')
            ensemble_scores[:, i] = clf.negative_outlier_factor_
        else:
            log.info(f'Fitting {clf_name}')
            log.info(f'iForest offset_ = {clf.offset_}')
            ensemble_scores[:, i] = clf.score_samples(X)
    except:
            log.info(traceback.print_exc())
    else:    
            log.info(f'{clf_name} is fitted successfully with {len(ensemble_scores)} scores')  

  # Repalce NaN with 0's
  ensemble_scores=np.nan_to_num(ensemble_scores) 

  # Transforming the outlier scores into ranking values
  ensemble_scores = np.apply_along_axis(rank_fun, 0, ensemble_scores)

  # Normalize the ranking values of both algorithms between 0 and 1
  ensemble_scores = preprocessing.MinMaxScaler().fit_transform(ensemble_scores)

  # Select the maximum of two ranking values
  final_scores = np.max(ensemble_scores, axis = 1)
  # return np.sort(final_scores)

  # Make a copy of final score array
  pred_y = np.copy(final_scores) 

  score_min = np.min(pred_y)
  score_max = np.max(pred_y)

  log.info (f'Minimum Score = {score_min}, Maximum Score = {score_max}')

  # Labeling all scores < offset value as fraud and >=offset value as non-fraud
  offset = 0.1
  pred_y[pred_y < offset] = -999
  pred_y[pred_y >= offset] = 0.0
  pred_y[pred_y == -999] = 1.0

  log.info (f'Real offset: {offset}')

  fraud_ct = np.count_nonzero(pred_y == 1.0)
  fraud_pct = fraud_ct/n_rows_in

  log.info (f'Percentage of suspicious transactions: {fraud_pct}')

  df_arr=df.to_numpy() # Converting the input data frame to array
  df_arr_f=np.column_stack( (df_arr, pred_y)) # Add the scores to the input data array

  # Converting the combinded array back to a data frame
  df_f = pd.DataFrame(df_arr_f, columns = ['TRANSACTION_ID','TRANSACTION_TIME','TRXN_MONTH','CLIENT_ID','COUNTRY_NAME','COUNTRY_CODE','CONTINENT_NAME',	'CONTINENT_CODE','SWIFT_MSG_TYPE','AVG_TRXN_AMT','TRANSACTION_AMOUNT','FRAUD_LABEL']) 

  # Output data frame size
  n_rows_o = df_f.shape[0]
  n_features_o = df_f.shape[1]
  log.info (f'Output data frame size: Rows = {n_rows_o}, Columns = {n_features_o}')

  return df_f

In [13]:
NN_103_score_df = ensemble_fun (NN_103_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
NN_103_score_df

2021-08-05 21:11:30-04:00 : Input data frame size: Rows = 598565, Columns = 2
2021-08-05 21:11:35-04:00 : Fitting LOF
2021-08-05 21:11:35-04:00 : LOF offset_ = -1.5
2021-08-05 21:11:35-04:00 : LOF is fitted successfully with 598565 scores
2021-08-05 21:11:58-04:00 : Fitting iForest
2021-08-05 21:11:58-04:00 : iForest offset_ = -0.5
2021-08-05 21:12:27-04:00 : iForest is fitted successfully with 598565 scores
2021-08-05 21:12:27-04:00 : Minimum Score = 0.0039362188249064045, Maximum Score = 1.0
2021-08-05 21:12:27-04:00 : Real offset: 0.1
2021-08-05 21:12:27-04:00 : Percentage of suspicious transactions: 0.010558585951400433
2021-08-05 21:12:27-04:00 : Output data frame size: Rows = 598565, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,2.21977e+06,2020-03-05 15:20:16,3,7116485839,United States of America,US,North America,NN,103,2.95085e+07,363.78,0
1,2.2639e+06,2020-03-06 13:52:32,3,7117321565,United States of America,US,North America,NN,103,2.95085e+07,499520,0
2,5.65122e+06,2020-06-15 15:56:46,6,7117321565,United States of America,US,North America,NN,103,2.20932e+07,969381,0
3,4.3466e+06,2020-05-06 09:00:19,5,7116485839,United States of America,US,North America,NN,103,2.44622e+07,751.9,0
4,6.01167e+06,2020-06-26 09:00:56,6,7116485839,United States of America,US,North America,NN,103,2.20932e+07,17934.1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
598560,1.62415e+06,2020-02-20 05:55:58,2,6249060808,United States of America,US,North America,NN,103,2.26175e+07,1517.06,0
598561,1.16876e+07,2020-12-04 10:06:22,12,7117321565,United States of America,US,North America,NN,103,2.53072e+07,5.01419e+06,0
598562,1.11466e+07,2020-11-20 10:53:26,11,6249328247,United States of America,US,North America,NN,103,2.44221e+07,4461.78,0
598563,3.74413e+06,2020-04-17 20:38:31,4,7116053591,United States of America,US,North America,NN,103,2.44289e+07,19.52,0


In [14]:
NN_202_score_df = ensemble_fun (NN_202_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
NN_202_score_df

2021-08-05 21:12:28-04:00 : Input data frame size: Rows = 1310612, Columns = 2
2021-08-05 21:12:44-04:00 : Fitting LOF
2021-08-05 21:12:44-04:00 : LOF offset_ = -1.5
2021-08-05 21:12:44-04:00 : LOF is fitted successfully with 1310612 scores
2021-08-05 21:13:44-04:00 : Fitting iForest
2021-08-05 21:13:44-04:00 : iForest offset_ = -0.5
2021-08-05 21:14:49-04:00 : iForest is fitted successfully with 1310612 scores
2021-08-05 21:14:49-04:00 : Minimum Score = 0.0003795168627911874, Maximum Score = 1.0
2021-08-05 21:14:49-04:00 : Real offset: 0.1
2021-08-05 21:14:49-04:00 : Percentage of suspicious transactions: 0.009962521325914917
2021-08-05 21:14:50-04:00 : Output data frame size: Rows = 1310612, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,4.34918e+06,2020-05-06 09:27:19,5,6249012147,United States of America,US,North America,NN,202,1.24458e+07,1139.98,0
1,1.19693e+07,2020-12-14 10:07:03,12,6249263302,United States of America,US,North America,NN,202,9.50376e+06,7582,0
2,3.92412e+06,2020-04-24 04:12:42,4,7116480821,United States of America,US,North America,NN,202,1.2558e+07,4460,0
3,1.22986e+07,2020-12-22 11:30:41,12,6249263302,United States of America,US,North America,NN,202,9.50376e+06,2140.8,0
4,7.13839e+06,2020-07-29 11:00:25,7,6249209984,United States of America,US,North America,NN,202,1.04083e+07,4040.81,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1310607,1.88389e+06,2020-02-27 15:00:29,2,7116005194,United States of America,US,North America,NN,202,9.49141e+06,19802.4,0
1310608,5.51978e+06,2020-06-10 17:12:20,6,7116490843,United States of America,US,North America,NN,202,1.03853e+07,737.56,0
1310609,3.16411e+06,2020-03-31 17:42:26,3,7116490843,United States of America,US,North America,NN,202,1.29805e+07,25878.5,0
1310610,3.6769e+06,2020-04-16 11:30:26,4,6249263302,United States of America,US,North America,NN,202,1.2558e+07,5832.82,0


In [15]:
EU_103_score_df = ensemble_fun (EU_103_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
EU_103_score_df

2021-08-05 21:14:50-04:00 : Input data frame size: Rows = 629564, Columns = 2
2021-08-05 21:14:58-04:00 : Fitting LOF
2021-08-05 21:14:58-04:00 : LOF offset_ = -1.5
2021-08-05 21:14:58-04:00 : LOF is fitted successfully with 629564 scores
2021-08-05 21:15:23-04:00 : Fitting iForest
2021-08-05 21:15:23-04:00 : iForest offset_ = -0.5
2021-08-05 21:15:54-04:00 : iForest is fitted successfully with 629564 scores
2021-08-05 21:15:54-04:00 : Minimum Score = 0.004759943060149855, Maximum Score = 1.0
2021-08-05 21:15:54-04:00 : Real offset: 0.1
2021-08-05 21:15:54-04:00 : Percentage of suspicious transactions: 0.00726693394158497
2021-08-05 21:15:55-04:00 : Output data frame size: Rows = 629564, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,4.91938e+06,2020-05-26 08:16:53,5,6249328247,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,1.66441e+07,1.0704e+07,0
1,2.40986e+06,2020-03-11 15:01:04,3,7116378678,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.83461e+07,1.2017e+06,0
2,5.42015e+06,2020-06-08 11:58:03,6,7116490816,Turkey-Republic of,TR,Europe,EU,103,1.88575e+07,9.812e+06,0
3,2.2572e+06,2020-03-06 12:01:37,3,6249340315,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.83461e+07,8.92e+06,0
4,2.00572e+06,2020-02-28 17:47:55,2,7117321565,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.73952e+07,485847,0
...,...,...,...,...,...,...,...,...,...,...,...,...
629559,4.41096e+06,2020-05-07 16:30:45,5,6249340315,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.66441e+07,9.01472e+06,0
629560,7.72331e+06,2020-08-14 09:01:20,8,7116485839,Luxembourg-Grand Duchy of,LU,Europe,EU,103,1.78571e+07,53445.5,0
629561,6.07791e+06,2020-06-29 13:20:29,6,7117396344,Germany-Federal Republic of,DE,Europe,EU,103,1.88575e+07,8920,0
629562,4.55102e+06,2020-05-13 02:23:45,5,6249328247,Sweden-Kingdom of,SE,Europe,EU,103,1.66441e+07,306.85,0


In [16]:
EU_202_score_df = ensemble_fun (EU_202_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
EU_202_score_df

2021-08-05 21:15:55-04:00 : Input data frame size: Rows = 319776, Columns = 2
2021-08-05 21:15:57-04:00 : Fitting LOF
2021-08-05 21:15:57-04:00 : LOF offset_ = -1.5
2021-08-05 21:15:57-04:00 : LOF is fitted successfully with 319776 scores
2021-08-05 21:16:08-04:00 : Fitting iForest
2021-08-05 21:16:08-04:00 : iForest offset_ = -0.5
2021-08-05 21:16:22-04:00 : iForest is fitted successfully with 319776 scores
2021-08-05 21:16:22-04:00 : Minimum Score = 0.010295808286704537, Maximum Score = 1.0
2021-08-05 21:16:22-04:00 : Real offset: 0.1
2021-08-05 21:16:22-04:00 : Percentage of suspicious transactions: 0.008962523766636645
2021-08-05 21:16:23-04:00 : Output data frame size: Rows = 319776, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,7.4096e+06,2020-08-04 23:45:19,8,6249328247,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,1.62311e+06,3058.58,0
1,1.21321e+07,2020-12-17 11:25:00,12,7116055826,Germany-Federal Republic of,DE,Europe,EU,202,1.84935e+06,6206.32,0
2,6.11815e+06,2020-06-30 10:43:28,6,7116055826,Turkey-Republic of,TR,Europe,EU,202,1.89936e+06,16948,0
3,6.45069e+06,2020-07-07 15:11:03,7,7116490843,Luxembourg-Grand Duchy of,LU,Europe,EU,202,1.53813e+06,10301.4,0
4,3.31884e+06,2020-04-03 10:16:05,4,7117147030,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,1.67219e+06,6.53,0
...,...,...,...,...,...,...,...,...,...,...,...,...
319771,5.78821e+06,2020-06-19 00:31:13,6,7116109809,Netherlands-Kingdom of the,NL,Europe,EU,202,1.89936e+06,2299.58,0
319772,318756,2020-01-12 19:15:19,1,7116038656,Lithuania-Republic of,LT,Europe,EU,202,1.53003e+06,13290.8,0
319773,6.34118e+06,2020-07-02 22:05:45,7,6249354289,Turkey-Republic of,TR,Europe,EU,202,1.53813e+06,597.64,0
319774,1.76161e+06,2020-02-25 07:03:02,2,6249307755,Germany-Federal Republic of,DE,Europe,EU,202,1.41274e+06,169444,0


In [17]:
AS_103_score_df = ensemble_fun (AS_103_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
AS_103_score_df

2021-08-05 21:16:23-04:00 : Input data frame size: Rows = 629564, Columns = 2
2021-08-05 21:16:30-04:00 : Fitting LOF
2021-08-05 21:16:30-04:00 : LOF offset_ = -1.5
2021-08-05 21:16:30-04:00 : LOF is fitted successfully with 629564 scores
2021-08-05 21:16:56-04:00 : Fitting iForest
2021-08-05 21:16:56-04:00 : iForest offset_ = -0.5
2021-08-05 21:17:28-04:00 : iForest is fitted successfully with 629564 scores
2021-08-05 21:17:29-04:00 : Minimum Score = 0.004759943060149855, Maximum Score = 1.0
2021-08-05 21:17:29-04:00 : Real offset: 0.1
2021-08-05 21:17:29-04:00 : Percentage of suspicious transactions: 0.00726693394158497
2021-08-05 21:17:29-04:00 : Output data frame size: Rows = 629564, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,4.91938e+06,2020-05-26 08:16:53,5,6249328247,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,103,1.66441e+07,1.0704e+07,0
1,2.40986e+06,2020-03-11 15:01:04,3,7116378678,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.83461e+07,1.2017e+06,0
2,5.42015e+06,2020-06-08 11:58:03,6,7116490816,Turkey-Republic of,TR,Europe,EU,103,1.88575e+07,9.812e+06,0
3,2.2572e+06,2020-03-06 12:01:37,3,6249340315,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.83461e+07,8.92e+06,0
4,2.00572e+06,2020-02-28 17:47:55,2,7117321565,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.73952e+07,485847,0
...,...,...,...,...,...,...,...,...,...,...,...,...
629559,4.41096e+06,2020-05-07 16:30:45,5,6249340315,Switzerland-Swiss Confederation,CH,Europe,EU,103,1.66441e+07,9.01472e+06,0
629560,7.72331e+06,2020-08-14 09:01:20,8,7116485839,Luxembourg-Grand Duchy of,LU,Europe,EU,103,1.78571e+07,53445.5,0
629561,6.07791e+06,2020-06-29 13:20:29,6,7117396344,Germany-Federal Republic of,DE,Europe,EU,103,1.88575e+07,8920,0
629562,4.55102e+06,2020-05-13 02:23:45,5,6249328247,Sweden-Kingdom of,SE,Europe,EU,103,1.66441e+07,306.85,0


In [18]:
AS_202_score_df = ensemble_fun (AS_202_df, n_neighbors_n=20, leaf_size=30, pairwise_n=2, n_estimators_n=100, random_state_n=42, contamination_n='auto')
AS_202_score_df

2021-08-05 21:17:29-04:00 : Input data frame size: Rows = 319776, Columns = 2
2021-08-05 21:17:32-04:00 : Fitting LOF
2021-08-05 21:17:32-04:00 : LOF offset_ = -1.5
2021-08-05 21:17:32-04:00 : LOF is fitted successfully with 319776 scores
2021-08-05 21:17:43-04:00 : Fitting iForest
2021-08-05 21:17:43-04:00 : iForest offset_ = -0.5
2021-08-05 21:17:57-04:00 : iForest is fitted successfully with 319776 scores
2021-08-05 21:17:57-04:00 : Minimum Score = 0.010295808286704537, Maximum Score = 1.0
2021-08-05 21:17:57-04:00 : Real offset: 0.1
2021-08-05 21:17:57-04:00 : Percentage of suspicious transactions: 0.008962523766636645
2021-08-05 21:17:58-04:00 : Output data frame size: Rows = 319776, Columns = 12


Unnamed: 0,TRANSACTION_ID,TRANSACTION_TIME,TRXN_MONTH,CLIENT_ID,COUNTRY_NAME,COUNTRY_CODE,CONTINENT_NAME,CONTINENT_CODE,SWIFT_MSG_TYPE,AVG_TRXN_AMT,TRANSACTION_AMOUNT,FRAUD_LABEL
0,7.4096e+06,2020-08-04 23:45:19,8,6249328247,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,1.62311e+06,3058.58,0
1,1.21321e+07,2020-12-17 11:25:00,12,7116055826,Germany-Federal Republic of,DE,Europe,EU,202,1.84935e+06,6206.32,0
2,6.11815e+06,2020-06-30 10:43:28,6,7116055826,Turkey-Republic of,TR,Europe,EU,202,1.89936e+06,16948,0
3,6.45069e+06,2020-07-07 15:11:03,7,7116490843,Luxembourg-Grand Duchy of,LU,Europe,EU,202,1.53813e+06,10301.4,0
4,3.31884e+06,2020-04-03 10:16:05,4,7117147030,United Kingdom of Great Britain & Northern Ire...,GB,Europe,EU,202,1.67219e+06,6.53,0
...,...,...,...,...,...,...,...,...,...,...,...,...
319771,5.78821e+06,2020-06-19 00:31:13,6,7116109809,Netherlands-Kingdom of the,NL,Europe,EU,202,1.89936e+06,2299.58,0
319772,318756,2020-01-12 19:15:19,1,7116038656,Lithuania-Republic of,LT,Europe,EU,202,1.53003e+06,13290.8,0
319773,6.34118e+06,2020-07-02 22:05:45,7,6249354289,Turkey-Republic of,TR,Europe,EU,202,1.53813e+06,597.64,0
319774,1.76161e+06,2020-02-25 07:03:02,2,6249307755,Germany-Federal Republic of,DE,Europe,EU,202,1.41274e+06,169444,0


## **Backup Data Frames for Supervised Machine Learning**

In [19]:
NN_103_score_df.to_csv('/content/gdrive/MyDrive/NN_103_score_df.cvs',index=False)
NN_202_score_df.to_csv('/content/gdrive/MyDrive/NN_202_score_df.cvs',index=False)
EU_103_score_df.to_csv('/content/gdrive/MyDrive/EU_103_score_df.cvs',index=False)
EU_202_score_df.to_csv('/content/gdrive/MyDrive/EU_202_score_df.cvs',index=False)
AS_103_score_df.to_csv('/content/gdrive/MyDrive/AS_103_score_df.cvs',index=False)
AS_202_score_df.to_csv('/content/gdrive/MyDrive/AS_202_score_df.cvs',index=False)