## Problem Statement :
In the banking industry, detecting credit card fraud using machine learning is not just a trend; it is a necessity for banks, as they need to put proactive monitoring and fraud prevention mechanisms in place. Machine learning helps these institutions reduce time-consuming manual reviews, costly chargebacks and fees, and denial of legitimate transactions.

Suppose you are part of the analytics team working on a fraud detection model and its cost-benefit analysis. You need to develop a machine learning model to detect fraudulent transactions based on the historical transactional data of customers with a pool of merchants. You can learn more about transactional data and the creation of historical variables from the link attached here. You may find this helpful in the capstone project while building the fraud detection model. Based on your understanding of the model, you have to analyse the business impact of these fraudulent transactions and recommend the optimal ways that the bank can adopt to mitigate the fraud risks.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
#Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline
from scipy.stats import randint

from sklearn import model_selection
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer
#from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.metrics import precision_recall_curve, confusion_matrix, accuracy_score

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('float_format','{:.2f}'. format)



import warnings
warnings.filterwarnings('ignore')

In [None]:
df_fraud_train=pd.read_csv('fraudTrain.csv')
df_fraud_train.head(4)

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,28654.0,36.08,-81.18,3495.0,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018.0,36.01,-82.05,0.0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,99160.0,48.89,-118.21,149.0,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044.0,49.16,-118.19,0.0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,83252.0,42.18,-112.26,4154.0,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051.0,43.15,-112.15,0.0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,59632.0,46.23,-112.11,1939.0,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076.0,47.03,-112.56,0.0


In [None]:
df_fraud_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31151 entries, 0 to 31150
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             31151 non-null  int64  
 1   trans_date_trans_time  31151 non-null  object 
 2   cc_num                 31151 non-null  int64  
 3   merchant               31151 non-null  object 
 4   category               31151 non-null  object 
 5   amt                    31151 non-null  float64
 6   first                  31151 non-null  object 
 7   last                   31151 non-null  object 
 8   gender                 31151 non-null  object 
 9   street                 31151 non-null  object 
 10  city                   31150 non-null  object 
 11  state                  31150 non-null  object 
 12  zip                    31150 non-null  float64
 13  lat                    31150 non-null  float64
 14  long                   31150 non-null  float64
 15  ci

## Data Cleaning

In [None]:
#Dropping the first column from train as it is of no use
df_fraud_train.drop(df_fraud_train.columns[0],axis=1,inplace=True)

#Dropping the first column from test as it is of no use
df_fraud_test.drop(df_fraud_test.columns[0],axis=1,inplace=True)

#Checking the dataset head
df_fraud_train.head(3)

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,28654.0,36.08,-81.18,3495.0,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018.0,36.01,-82.05,0.0
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,99160.0,48.89,-118.21,149.0,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044.0,49.16,-118.19,0.0
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,83252.0,42.18,-112.26,4154.0,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051.0,43.15,-112.15,0.0


In [None]:
#Checking the shape for train and test
df_fraud_train.shape, df_fraud_test.shape

((31151, 22), (3901, 22))

In [None]:
#Now we will check the dat types for train data
df_fraud_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31151 entries, 0 to 31150
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   trans_date_trans_time  31151 non-null  object 
 1   cc_num                 31151 non-null  int64  
 2   merchant               31151 non-null  object 
 3   category               31151 non-null  object 
 4   amt                    31151 non-null  float64
 5   first                  31151 non-null  object 
 6   last                   31151 non-null  object 
 7   gender                 31151 non-null  object 
 8   street                 31151 non-null  object 
 9   city                   31150 non-null  object 
 10  state                  31150 non-null  object 
 11  zip                    31150 non-null  float64
 12  lat                    31150 non-null  float64
 13  long                   31150 non-null  float64
 14  city_pop               31150 non-null  float64
 15  jo

In [None]:
#Now we will check the dat types for test data
df_fraud_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3901 entries, 0 to 3900
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   trans_date_trans_time  3901 non-null   object 
 1   cc_num                 3901 non-null   int64  
 2   merchant               3901 non-null   object 
 3   category               3901 non-null   object 
 4   amt                    3901 non-null   float64
 5   first                  3901 non-null   object 
 6   last                   3901 non-null   object 
 7   gender                 3901 non-null   object 
 8   street                 3901 non-null   object 
 9   city                   3901 non-null   object 
 10  state                  3901 non-null   object 
 11  zip                    3901 non-null   int64  
 12  lat                    3901 non-null   float64
 13  long                   3901 non-null   object 
 14  city_pop               3900 non-null   float64
 15  job 

- As we can see trans_date_trans_time and dob col dtype is in object instead data time format. We will change it date time format.
- Unix time we can see that it is in int we need to change it time stamp
- Also from above we can see that there are no null values present in the data set

In [None]:
#Converting the trans_date_trans_time and dob into date time format in train and test data
df_fraud_train['trans_date_trans_time']=pd.to_datetime(df_fraud_train['trans_date_trans_time'])
df_fraud_train['dob']=pd.to_datetime(df_fraud_train['dob'])

df_fraud_test['trans_date_trans_time']=pd.to_datetime(df_fraud_test['trans_date_trans_time'])
df_fraud_test['dob']=pd.to_datetime(df_fraud_test['dob'])


In [None]:
# Creating new column for unix time stamp and Converting unix time into time stamp for further analysis into train and test data

df_fraud_train['Unix_Time_Stamp']=df_fraud_train['unix_time'].apply(datetime.utcfromtimestamp)

df_fraud_test['Unix_Time_Stamp']=df_fraud_test['unix_time'].apply(datetime.utcfromtimestamp)




#Dropping unix_time from the train and test data
df_fraud_train.drop('unix_time', axis=1, inplace=True)
df_fraud_test.drop('unix_time', axis=1, inplace=True)

ValueError: ignored

In [None]:
#Checking the null values in train data

100*df_fraud_train.isnull().mean()

trans_date_trans_time   0.00
cc_num                  0.00
merchant                0.00
category                0.00
amt                     0.00
first                   0.00
last                    0.00
gender                  0.00
street                  0.00
city                    0.00
state                   0.00
zip                     0.00
lat                     0.00
long                    0.00
city_pop                0.00
job                     0.00
dob                     0.00
trans_num               0.00
unix_time               0.00
merch_lat               0.00
merch_long              0.00
is_fraud                0.00
dtype: float64

In [None]:
#Checking the null values in test data

100*df_fraud_test.isnull().mean()

trans_date_trans_time   0.00
cc_num                  0.00
merchant                0.00
category                0.00
amt                     0.00
first                   0.00
last                    0.00
gender                  0.00
street                  0.00
city                    0.00
state                   0.00
zip                     0.00
lat                     0.00
long                    0.00
city_pop                0.03
job                     0.03
dob                     0.03
trans_num               0.03
unix_time               0.03
merch_lat               0.03
merch_long              0.03
is_fraud                0.03
dtype: float64

## Feature Engineering

In [None]:
#As we can see that there are two columns for first name and last name
# We will merege the first name and last name in train and test data
df_fraud_train['Customer_Full_Name']=df_fraud_train[['first','last']].apply(lambda x:' '.join(x),axis=1)

df_fraud_test['Customer_Full_Name']=df_fraud_test[['first','last']].apply(lambda x:' '.join(x),axis=1)

In [None]:
#Dropping the redundatnt first and last name col from train and test data
df_fraud_train.drop(['first','last'],axis=1, inplace=True)

df_fraud_test.drop(['first','last'],axis=1, inplace=True)

In [None]:
# Now we will create age col by subtracting the DOB col from trans_date col DOB in train and test data
df_fraud_train['Customer_Age']=df_fraud_train['trans_date_trans_time'].dt.year - df_fraud_train['dob'].dt.year

df_fraud_test['Customer_Age']=df_fraud_test['trans_date_trans_time'].dt.year - df_fraud_test['dob'].dt.year

In [None]:
#Droppping the dob col from train and test data

df_fraud_train.drop('dob',axis=1, inplace=True)

df_fraud_test.drop('dob',axis=1, inplace=True)

NameError: ignored

In [None]:
# Now we will create few columns from trans_date_trans_time in train and test data for our further analysis
df_fraud_train['Day_Of_Week']=df_fraud_train['trans_date_trans_time'].dt.dayofweek
df_fraud_train['Day_Name']=df_fraud_train['trans_date_trans_time'].dt.day_name()
df_fraud_train['No_Of_Week']=df_fraud_train['trans_date_trans_time'].dt.week
df_fraud_train['No_Of_Day']=df_fraud_train['trans_date_trans_time'].dt.day
df_fraud_train['Day_Hrs']=df_fraud_train['trans_date_trans_time'].dt.hour
df_fraud_train['Month_Name']=df_fraud_train['trans_date_trans_time'].dt.month_name()
df_fraud_train['Year']=df_fraud_train['trans_date_trans_time'].dt.year

In [None]:
# Now the same we will apply for test data
df_fraud_test['Day_Of_Week']=df_fraud_test['trans_date_trans_time'].dt.dayofweek
df_fraud_test['Day_Name']=df_fraud_test['trans_date_trans_time'].dt.day_name()
df_fraud_test['No_Of_Week']=df_fraud_test['trans_date_trans_time'].dt.week
df_fraud_test['No_Of_Day']=df_fraud_test['trans_date_trans_time'].dt.day
df_fraud_test['Day_Hrs']=df_fraud_test['trans_date_trans_time'].dt.hour
df_fraud_test['Month_Name']=df_fraud_test['trans_date_trans_time'].dt.month_name()
df_fraud_test['Year']=df_fraud_test['trans_date_trans_time'].dt.year

In [None]:
# Now we will use haversine distance function to calculate the distance of base location of customer and location of transaction
# Creating function to calculate the haversine distance
from math import radians, cos, sin, asin, sqrt

def haversine_dist(lon1, lat1, lon2, lat2):
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371 * c # Radius of earth in kilometers.
    return r

In [None]:
# Now we will calculate the distance of base location of customer and location of transaction for train data
df_fraud_train['Dist_Cust_to_merchant']=haversine_dist(df_fraud_train['long'],df_fraud_train['lat'],df_fraud_train['merch_long'],df_fraud_train['merch_lat'])

In [None]:
# Now we will calculate the distance of base location of customer and location of transaction for test data
df_fraud_test['Dist_Cust_to_merchant']=haversine_dist(df_fraud_test['long'],df_fraud_test['lat'],df_fraud_test['merch_long'],df_fraud_test['merch_lat'])

In [None]:
#Now we will create the Age bucket for train and test data
df_fraud_train['Age_Bucket']=pd.cut(df_fraud_train['Customer_Age'],[0,29,39,49,59,9999],labels=['<30','30-40','40-50','50-60','60>'])

#Creating the Age bucket for test data as well
df_fraud_test['Age_Bucket']=pd.cut(df_fraud_test['Customer_Age'],[0,29,39,49,59,9999],labels=['<30','30-40','40-50','50-60','60>'])

NameError: ignored

In [None]:
df_fraud_train.head(4)

In [None]:
#Checking the shape for train and test
df_fraud_train.shape, df_fraud_test.shape

## EDA

In [None]:
# Now we will analyze the target variable. We will check the count of class 1 and 0
#CHecking for train data
sns.countplot(df_fraud_train['is_fraud'])
plt.show()

In [None]:
#Checking for test data
sns.countplot(df_fraud_test['is_fraud'])
plt.show()

In [None]:
#Checking the percentage of fraudulent vs non fraudulent transaction in train and test data
print((df_fraud_train['is_fraud'].value_counts(normalize=True)*100))

#Checking for test data
print((df_fraud_test['is_fraud'].value_counts(normalize=True)*100))

NameError: ignored

- As we can see that from above in train and test data fraudulent transaction has very low entries which is very obvious but it is a case of class imbalance.

In [None]:
# Now we will see the counts of gender in train data
df_fraud_train['gender'].value_counts(normalize=True).plot.bar()
plt.show()

- As we can see that female count is more than male count in training data set.

In [None]:
#Now we will see the distribution of amount column in training data
sns.distplot(df_fraud_train['amt'])
plt.show()

- As we can see data is skewed so we will plot the boxplot to see if there is any outlier present in Amt column

In [None]:
#Checking outliers for the numerical col
df_train_num=df_fraud_train[['amt','city_pop','Customer_Age','Day_Of_Week','No_Of_Week','No_Of_Day','Day_Hrs','Dist_Cust_to_merchant']]
df_train_num.describe(percentiles=[.25,.5,.75,.90,.95,.99])

- As we can see from describe that amt data is skewed.

In [None]:
# Plotting the subplots to check the outliers analysis
plt.figure(figsize=(15,7))
plt.subplot(2,4,1)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='amt',hue='is_fraud')
plt.legend(loc='lower center')

plt.subplot(2,4,2)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='city_pop',hue='is_fraud')

plt.subplot(2,4,3)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='Customer_Age',hue='is_fraud')

plt.subplot(2,4,4)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='Day_Of_Week',hue='is_fraud')

plt.subplot(2,4,5)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='No_Of_Week',hue='is_fraud')

plt.subplot(2,4,6)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='No_Of_Day',hue='is_fraud')

plt.subplot(2,4,7)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='Day_Hrs',hue='is_fraud')
plt.legend(loc='lower center')

plt.subplot(2,4,8)
sns.boxplot(data=df_fraud_train,x='is_fraud',y='Dist_Cust_to_merchant',hue='is_fraud')

plt.legend(loc='lower center')
plt.tight_layout()
plt.show()




- As we can see that customer who are non fraudulent having extreme values in amount col which can happen as most of the customer can have high salary which in turn can purchase the product of high amount
- Also city population having the outlier which can happen as population may vary city to city and we did not find any extreme surprise numbers.
- We can see that max fraudulent transaction has occured in the mid night.
- Rest of the data seems to be fine but few data seems to be skewed.
- So we will not remove the oultiers as they are important for our further analysis.

# Bi-Variate Analysis

In [None]:
#Plotting the barplot for amt col with respect to target variable
sns.barplot(data=df_fraud_train,x='is_fraud',y='amt')
plt.show()

- As we can see that from above barplot that number of fraudulent transaction are more as compare to non fradulent transaction

In [None]:
# Now we will analyse the gender col with respect to target variable
sns.countplot(data=df_fraud_train, x='gender',hue='is_fraud')
plt.show()

In [None]:
# Now we will analyse the gender col with respect to target variable
df_fraud_train.groupby(by=['gender'])['is_fraud'].sum().plot.bar()

In [None]:
df_fraud_train.groupby(by=['gender'])['is_fraud'].sum()

- As we can see that there are not so much difference between gender with respect to fraudulent transaction. However numbers are less but it is done by both the genders.

In [None]:
plt.figure(figsize=(12,5))
df_statewise=df_fraud_train.groupby(by=['state'])['is_fraud'].sum()
df_fraud_statewise=df_statewise.sort_values(ascending=False).head(10)
df_fraud_statewise.plot.bar()
plt.show()

- From above we can see that NY state has having the highest number of fraudulent transactions

In [None]:
#Now we will analyse the fraudulent transactions with respect to category
plt.figure(figsize=(15,5))

df_category=df_fraud_train.groupby(by=['category'])['is_fraud'].sum()
df_fraud_category=df_category.sort_values(ascending=False)
df_fraud_category.plot.bar()
plt.show()

- As we can see that highest number of fraudulent transaction occured in categories Grocery_pos followed by shopping_net

In [None]:
#Now we will analyse the fraudulent transaction with respect to weekday
plt.figure(figsize=(10,5))
df_fraud_train.groupby(by=['Day_Name'])['is_fraud'].sum().sort_values(ascending=False).plot.bar()
plt.show()

- From above chart maximum fraudulent transaction occured in Saturday and Sunday followed by Monday.

In [None]:
#In this graph will analyse the fraudulent transaction with respect to Month
plt.figure(figsize=(10,5))
df_fraud_train.groupby(by=['Month_Name'])['is_fraud'].sum().sort_values(ascending=False).plot.bar()
plt.show()

- Most of the fraudulent transaction happend in the month of March and May

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.distplot(df_fraud_train[df_fraud_train['is_fraud'] == 0]["Day_Hrs"])
plt.title('Non-Fraudulent Transaction')



plt.subplot(1,2,2)
sns.distplot(df_fraud_train[df_fraud_train['is_fraud'] == 1]["Day_Hrs"])
plt.title('Fraudulent Transaction')
plt.show()

- From above we can see that fraudulent transaction happend in odd hrs which is obvius expected.

In [None]:
#Now we will check the agebucket with respect to target variable
df_fraud_train[df_fraud_train['is_fraud']==1]['Age_Bucket'].value_counts(normalize=True).plot.bar()

# Data Preparation For Data Modelling

In [None]:
#Checking the head
df_fraud_train.head(6)

In [None]:
#Creating copy of training data and test data
df_fraud_train_model=df_fraud_train.copy()
df_fraud_test_model=df_fraud_test.copy()

In [None]:
#Checking the shape
df_fraud_train_model.shape, df_fraud_test_model.shape

In [None]:
#Plotting the heatmap to check the correlation
plt.figure(figsize=(15,7))
sns.heatmap(df_fraud_train_model.corr(),annot=True)
plt.show()

In [None]:
# So now we will drop the highly correlated features from the dataset
#now we are creating a square matrix with dimensions equal to the number of features. In which we will have the elements as the absolute value of correlation between the features.
cor_matrix = df_fraud_train_model.corr().abs()
print(cor_matrix)

In [None]:
#Correlation matrix will be mirror image about the diagonal and all the diagonal elements will be 1.
#we are selecting the upper traingular
upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
print(upper_tri)

In [None]:
#So we are selecting the columns which are having absolute correlation greater than 0.70 and making a list of those columns named 'to_drop'.
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.70)]
print(to_drop)

In [None]:
# Now dropping the highly correlated col for test dataset
cor_matrix_1 = df_fraud_test_model.corr().abs()
upper_tri_1 = cor_matrix_1.where(np.triu(np.ones(cor_matrix_1.shape),k=1).astype(np.bool))
to_drop_1 = [column for column in upper_tri_1.columns if any(upper_tri_1[column] > 0.70)]
print(to_drop_1)


In [None]:
#Dropping the selected highly correlated columns list
df_fraud_train_model.drop(to_drop,axis=1,inplace=True)
df_fraud_test_model.drop(to_drop_1,axis=1,inplace=True)

In [None]:
#Checking the shape of train and test data
df_fraud_train_model.shape, df_fraud_test_model.shape

In [None]:
#Checking the columns which has to drop
df_fraud_train_model.columns

In [None]:
#Dropping the irrelevant train dataset columns
df_fraud_train_model.drop(['trans_date_trans_time', 'cc_num', 'merchant' ,'street', 'city', 'state', 'zip', 'lat', 'job',
                           'trans_num','Unix_Time_Stamp', 'Customer_Full_Name','Day_Of_Week', 'Day_Name', 'No_Of_Week', 'No_Of_Day','Day_Hrs','Month_Name','Year','Age_Bucket'],axis=1,inplace=True)
#Dropping the irrelevant test dataset columns
df_fraud_test_model.drop(['trans_date_trans_time', 'cc_num', 'merchant', 'street', 'city', 'state', 'zip', 'lat', 'job',
                           'trans_num','Unix_Time_Stamp', 'Customer_Full_Name','Day_Of_Week', 'Day_Name', 'No_Of_Week', 'No_Of_Day','Day_Hrs','Month_Name','Year','Age_Bucket'],axis=1,inplace=True)

In [None]:
df_fraud_train_model.shape, df_fraud_test_model.shape

In [None]:
# Now we will change the categorical columns to binary 0 and 1
df_fraud_train_model.head(2)


In [None]:
#We will create a dummy variable for some of the categorical variables and dropping the first one
dummy_1= pd.get_dummies(df_fraud_train_model[['category','gender']],drop_first=True)
dummy_2=pd.get_dummies(df_fraud_test_model[['category','gender']],drop_first=True)

#Adding the results to the master dataframe
df_fraud_train_model=pd.concat([df_fraud_train_model,dummy_1],axis=1)
df_fraud_test_model=pd.concat([df_fraud_test_model,dummy_2],axis=1)

In [None]:
#Dropping the repeated col
df_fraud_train_model.drop(['category','gender'],axis=1,inplace=True)
df_fraud_test_model.drop(['category','gender'],axis=1,inplace=True)

In [None]:
df_fraud_train_model.shape,df_fraud_test_model.shape

In [None]:
#Splitting the data into X and y
X_training=df_fraud_train_model.drop('is_fraud',axis=1)
y_training=df_fraud_train_model['is_fraud']

x_testing=df_fraud_test_model.drop('is_fraud',axis=1)
y_testing=df_fraud_test_model['is_fraud']

In [None]:
# Lets check the distribution before scaling
plt.figure(figsize=[15,8])
plt.subplot(2,2,1)
plt.title('Distribution of Amount', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_training['amt'])
plt.subplot(2,2,2)
plt.title('Distribution of Customer_Age', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_training['Customer_Age'])
plt.subplot(2,2,3)
plt.title('Distribution_of_distance_Customer_to_merchant)', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_training['Dist_Cust_to_merchant'])
plt.subplot(2,2,4)
plt.title('Distribution of city_pop', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_training['city_pop'])
plt.show()

- As we can see that data is skewed and we have to convert it normal distribution before passing it to the model

In [None]:
#Splitting the training data into train validation split
X_train, X_valid, y_train, y_valid=train_test_split(X_training,y_training, train_size=.70,stratify=y_training, random_state=42)

#Checking the shape
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

In [None]:
#Scaling the training data through Power transformation
scaler_PT=PowerTransformer()
#We will fit and transform the training data
X_train[['amt',	'city_pop','Customer_Age','Dist_Cust_to_merchant']]=scaler_PT.fit_transform(X_train[['amt',	'city_pop','Customer_Age','Dist_Cust_to_merchant']])

#And we will only transform the validation set

X_valid[['amt',	'city_pop','Customer_Age','Dist_Cust_to_merchant']]=scaler_PT.transform(X_valid[['amt','city_pop','Customer_Age','Dist_Cust_to_merchant']])

In [None]:
# Lets check the distribution after scaling
plt.figure(figsize=[15,8])
plt.subplot(2,2,1)
plt.title('Distribution of Amount', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_train['amt'])
plt.subplot(2,2,2)
plt.title('Distribution of Customer_Age', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_train['Customer_Age'])
plt.subplot(2,2,3)
plt.title('Distribution_of_Customer_to_merchant)', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_train['Dist_Cust_to_merchant'])
plt.subplot(2,2,4)
plt.title('Distribution of city_pop', fontsize= 10, color = 'Red', fontweight = 100)
plt.hist(X_train['city_pop'])
plt.show()

- As we can see that data has been quite normally distributed

In [None]:
# We will apply two class imbalance techniques for our model
#Now we will apply SMOTE to balance the class
from collections import Counter
counter= Counter(y_train)
print('Before Applying SMOTE', counter)
smt=SMOTE(sampling_strategy={1:555555}, random_state=42)
x_training_smte,y_training_smte=smt.fit_resample(X_train,y_train)
counter = Counter(y_training_smte)
print('After Applying SMOTE', counter)


#We will apply ADASYN to balance the class
counter_1= Counter(y_train)
print('Before Applying ADASYN', counter_1)
ada = ADASYN(sampling_strategy={1:555555},random_state=42)
X_training_ada, y_training_ada = ada.fit_resample(X_train, y_train)
counter_1 = Counter(y_training_ada)
print('After Applying ADASYN', counter_1)


In [None]:
#checking the shape after class balance
x_training_smte.shape,y_training_smte.shape, X_training_ada.shape, y_training_ada.shape

In [None]:
X_train.shape,y_train.shape

In [None]:
#Checking the VIF for attributes if it is not so importance we will drop that columns
col=x_training_smte.columns

In [None]:
# We will use the VIF to check the import variables and remove the redundant variable basis on the VIF score
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Creating the data frame theat will contain all features and their respective VIF's
vif=pd.DataFrame()
vif['features'] = x_training_smte[col].columns
vif['VIF'] = [variance_inflation_factor(x_training_smte[col].values,i) for i in range (x_training_smte[col].shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by= 'VIF', ascending = False)
vif

- As we can see that all the VIF values are less than 5 so we will not drop any features

### Creating Model Algorithm & Hyperparameter Tuning

In [None]:
#Creating an instance of Algorithm for Classification models
#Logistic Regression
model_LR=LogisticRegression()

#Decision Tree
model_DT = DecisionTreeClassifier(random_state = 42)

#Random Forest
model_RF = RandomForestClassifier(oob_score= True,random_state=42)

#XGBoost
model_XGB = XGBClassifier()

#XGBoost Ray
#model_XGBR= RayXGBClassifier( n_jobs=4,random_state = 'seed')



In [None]:
#creating an instance of Stratified K Folf CV and taking n_splits as 5
skfold=StratifiedKFold(n_splits=5)

#Creating hyper parameter tuning for logistic regression
params_LR = {
'solver': ['liblinear'],
'penalty' : ['l1','l2'],
'C' : np.logspace(-1, 5, 10),
'class_weight' : [None, 'balanced']}


#Creating hyper parameter tuning for Decision Tree
params_DT = {
'max_depth': [3,5,10,None],
'min_samples_leaf': [100,150,200],
'max_features' : randint(1,5),
'min_samples_split' : [100,200,300],
'criterion': ["gini","entropy"]}

#Creating hyper parameter tuning for Random Forest
params_RF = {
'max_depth': [3,5,10,None],
'n_estimators' : [10,15,20],
'max_features' : randint(1,6),
'min_samples_leaf': [100,150,200],
'min_samples_split' : [100,200,300],
'bootstrap' : [True, False],
'criterion': ['gini','entropy']}

#Creating hyper parameter tuning for XGBoost
params_XGB = {
'learning_rate' : [0.05, 0.1],
'max_depth': [3,5],
#'min_child_weight' : [1,3,5],
'gamma' : [0.1,0.2],
'max_features': randint(1,5),
'method':     ['approx','hist']}
#'colsample_bytree': np.arange(0.4, 1.0, 0.1)}

#Creating hyper parameter tuning for XGBoost Ray
#params_XGBR= {
#'learning_rate' : [0.05, 0.1],
#'max_depth': [3,5],
#'gamma' : [0.1,0.2],
#'max_features': randint(1,5),
#'method':     ['hist']}

#Since XGBoost & XGBoost Ray is taking computational longer time to run so we have to avoid using it for this problem statement



In [None]:
# Creating function to plot ROC curve
def plot_roc(actual, proba):
    fpr, tpr, thresholds = metrics.roc_curve(actual, proba, drop_intermediate = False )
    auc_score = metrics.roc_auc_score(actual, proba)
    plt.figure(figsize=(7, 5))
    plt.plot( fpr, tpr, label='ROC Curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
#Creating function to for variuos metrices for a classification model
def model_metrics(a, b):
  confusion = confusion_matrix(a, b)
  TP = confusion[1,1] # true positive
  TN = confusion[0,0] # true negatives
  FP = confusion[0,1] # false positives
  FN = confusion[1,0] # false negatives
  print ('Accuracy    : ', metrics.accuracy_score(a, b))
  print ('Sensitivity : ', TP / float(TP+FN))
  print ('Specificity : ', TN / float(TN+FP))
  print ('Precision   : ', TP / float(TP + FP))
  print ('Recall      : ', TP / float(TP + FN))
  print('F1_Score:',metrics.f1_score(a,b))
  print(confusion)


  return None


# Logistic Regression Without Sampling

In [None]:
#Creating 1st model of Logistic Regression without Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
LR_RCV = RandomizedSearchCV(model_LR, params_LR, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
LR_RCV.fit(X_train, y_train)
print(LR_RCV.best_estimator_)
print(LR_RCV.best_params_)
print(LR_RCV.best_score_)

In [None]:
#Getting the predicted values on the train set and validation set
y_train_pred = LR_RCV.best_estimator_.predict(X_train)
y_test_val_pred = LR_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_train, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_train, y_train_pred))
model_metrics(y_train, y_train_pred)
print('*'*70)


#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print("Validation_Data_metrics")
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

# Decision Tree Without Sampling

In [None]:
#Creating 2nd model of Decision Tree without Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
DT_RCV = RandomizedSearchCV(model_DT, params_DT, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
DT_RCV.fit(X_train, y_train)
print(DT_RCV.best_estimator_)
print(DT_RCV.best_params_)
print(DT_RCV.best_score_)


#Getting the predicted values on the train set and validation set
y_train_pred = DT_RCV.best_estimator_.predict(X_train)
y_test_val_pred = DT_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_train, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_train, y_train_pred))
model_metrics(y_train, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)



# Random Forest Without Sampling

In [None]:
#Creating 3rd model of Random Forest without Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
RF_RCV = RandomizedSearchCV(model_RF, params_RF, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
RF_RCV.fit(X_train, y_train)
print(RF_RCV.best_estimator_)
print(RF_RCV.best_params_)
print(RF_RCV.best_score_)


#Getting the predicted values on the train set and validation set
y_train_pred = RF_RCV.best_estimator_.predict(X_train)
y_test_val_pred = RF_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_train, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_train, y_train_pred))
model_metrics(y_train, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

# XGBoost without Sampling

In [None]:
#Creating 4th model of XGBoost without Sampling

#creating an instance of Stratified K Folf CV and taking n_splits as 3 as XGBoost is taking time run the model
#skfold=StratifiedKFold(n_splits=3)
#Using Randomzided Search CV techniques to get the best parameter for the model building
#XGB_RCV = RandomizedSearchCV(model_XGB, params_XGB, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
#XGB_RCV.fit(X_train, y_train)
#print(XGB_RCV.best_estimator_)
#print(XGB_RCV.best_params_)
#print(XGB_RCV.best_score_)


#Getting the predicted values on the train set and validation set
#y_train_pred = XGB_RCV.best_estimator_.predict(X_train)
#y_test_val_pred = XGB_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
#plot_roc(y_train, y_train_pred)
#print("Training_Data_Metrics")
#print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_train, y_train_pred))
#model_metrics(y_train, y_train_pred)
#print('*'*70)
#print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
#plot_roc(y_valid, y_test_val_pred)
#print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
#model_metrics(y_valid, y_test_val_pred)

#### Applied XGBoost without Sampling technique for model building but the model is taking computational lot of time to run, so we are excluding this model in our project.

# Logistic Regression With SMOTE Sampling

In [None]:
#Creating 5th model of Logistic Regression with SMOTE Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
LR_RCV = RandomizedSearchCV(model_LR, params_LR, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
LR_RCV.fit(x_training_smte, y_training_smte)
print(LR_RCV.best_estimator_)
print(LR_RCV.best_params_)
print(LR_RCV.best_score_)

#Getting the predicted values on the train set and validation set
y_train_pred = LR_RCV.best_estimator_.predict(x_training_smte)
y_test_val_pred = LR_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_training_smte, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_training_smte, y_train_pred))
model_metrics(y_training_smte, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

# Decision Tree With SMOTE Sampling

In [None]:
#Creating 6th model of Decision Tree with SMOTE Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
DT_RCV = RandomizedSearchCV(model_DT, params_DT, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
DT_RCV.fit(x_training_smte, y_training_smte)
print(DT_RCV.best_estimator_)
print(DT_RCV.best_params_)
print(DT_RCV.best_score_)

#Getting the predicted values on the train set and validation set
y_train_pred = DT_RCV.best_estimator_.predict(x_training_smte)
y_test_val_pred = DT_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_training_smte, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_training_smte, y_train_pred))
model_metrics(y_training_smte, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

# Random Forest With SMOTE Sampling

In [None]:
#Creating 7th model of Random Forest with SMOTE Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
RF_RCV = RandomizedSearchCV(model_RF, params_RF, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
RF_RCV.fit(x_training_smte, y_training_smte)
print(RF_RCV.best_estimator_)
print(RF_RCV.best_params_)
print(RF_RCV.best_score_)

#Getting the predicted values on the train set and validation set
y_train_pred = RF_RCV.best_estimator_.predict(x_training_smte)
y_test_val_pred = RF_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_training_smte, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_training_smte, y_train_pred))
model_metrics(y_training_smte, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

- Since XGBoost classifier is taking computational longer time to run so we are avoiding it.

# Logistic Regression With ADASYN Sampling

In [None]:
#Creating 8th model of Logistic Regression with ADASYN Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
LR_RCV = RandomizedSearchCV(model_LR, params_LR, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
LR_RCV.fit(X_training_ada, y_training_ada)
print(LR_RCV.best_estimator_)
print(LR_RCV.best_params_)
print(LR_RCV.best_score_)

#Getting the predicted values on the train set and validation set
y_train_pred = LR_RCV.best_estimator_.predict(X_training_ada)
y_test_val_pred = LR_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_training_ada, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_training_ada, y_train_pred))
model_metrics(y_training_ada, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

# Decision Tree With ADASYN Sampling

In [None]:
#Creating 9th model of Decision Tree with ADASYN Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
DT_RCV = RandomizedSearchCV(model_DT, params_DT, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
DT_RCV.fit(X_training_ada, y_training_ada)
print(DT_RCV.best_estimator_)
print(DT_RCV.best_params_)
print(DT_RCV.best_score_)

#Getting the predicted values on the train set and validation set
y_train_pred = DT_RCV.best_estimator_.predict(X_training_ada)
y_test_val_pred = DT_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_training_ada, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_training_ada, y_train_pred))
model_metrics(y_training_ada, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

# Random Forest With ADASYN Sampling

In [None]:
#Creating 10th model of Random Forest with ADASYN Sampling
#Using Randomzided Search CV techniques to get the best parameter for the model building
RF_RCV = RandomizedSearchCV(model_RF, params_RF, cv=skfold, scoring='roc_auc', n_jobs=-1, verbose=1, random_state=42)

#Fitting the model
RF_RCV.fit(X_training_ada, y_training_ada)
print(RF_RCV.best_estimator_)
print(RF_RCV.best_params_)
print(RF_RCV.best_score_)

#Getting the predicted values on the train set and validation set
y_train_pred = RF_RCV.best_estimator_.predict(X_training_ada)
y_test_val_pred = RF_RCV.best_estimator_.predict(X_valid)

#Plotting the AUC ROC Curve for training data
plot_roc(y_training_ada, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_training_ada, y_train_pred))
model_metrics(y_training_ada, y_train_pred)
print('*'*70)
print("Validation_Data_metrics")

#Plotting the AUC ROC Curve for Validation Data
plot_roc(y_valid, y_test_val_pred)
print ('AUC for the Validation_Data: ', metrics.roc_auc_score( y_valid, y_test_val_pred))
model_metrics(y_valid, y_test_val_pred)

# Evaluation Of The Model

######As we have seen above that Decicion Tree and Random Forest algorith has performed very well on training data and validation data after balancing the class thorugh SMOTE and ADASYN technique

- #### So we will now apply the Random Forest With ADASYN technique trained model on test data set to check if the model is performing ver well or not.
- #### As we have seen above that the Random Forest With ADASYN has given the fantastic results more than 95% of score. Also the Recall on validation data is approx 95% which is good to go

In [None]:
#Splitting the data into X and y
X_train_final=df_fraud_train_model.drop('is_fraud',axis=1)
y_train_final=df_fraud_train_model['is_fraud']

X_test_final=df_fraud_test_model.drop('is_fraud',axis=1)
y_test_final=df_fraud_test_model['is_fraud']


In [None]:
#Before testing the model on test data set we will have to power tranform the test data
#But we will only transform the test rather than fit and transform
#Scaling the training and testing data through Power transformation
X_train_final[['amt',	'city_pop','Customer_Age','Dist_Cust_to_merchant']]=scaler_PT.fit_transform(X_train_final[['amt',	'city_pop','Customer_Age','Dist_Cust_to_merchant']])

#We will transform the testing data with scaler_PT.transform
X_test_final[['amt',	'city_pop','Customer_Age','Dist_Cust_to_merchant']]=scaler_PT.transform(X_test_final[['amt',	'city_pop','Customer_Age','Dist_Cust_to_merchant']])



In [None]:
#We will apply ADASYN to balance the class before passing it to train the model
counter= Counter(y_train_final)
print('Before Applying ADASYN', counter)
ada = ADASYN(sampling_strategy={1:555555},random_state=42)
X_train_ada, y_train_ada = ada.fit_resample(X_train_final, y_train_final)
counter = Counter(y_train_ada)
print('After Applying ADASYN', counter)

###### Building the Final Model on the best parameter of Random Forest With ADASYN Sampling on whole training data set and then we will apply the model for prediction on test data

In [None]:
#Creating the function to check the how much time Model is taking to training the data
def timer(start_time=None):
  if not start_time:
    start_time =  datetime.now()
    return start_time
  elif start_time:
    thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(),3600)
    tmin, tsec = divmod(temp_sec, 60)
    print('\n Time taken : %i hours %i min and %i sec.' %(thour, tmin, round(tsec,2)))

In [None]:
#Applying Random Forest With ADASYN Sampling method on test data
#Creating Final model of Random Forest with ADASYN Sampling on whole training data

Final_model_RF = RandomForestClassifier(oob_score= True,criterion='entropy', max_features=4,min_samples_leaf=200, min_samples_split=300,
                                  n_estimators=10,random_state=42)
#Here starts the time
start_time = timer(None)
#Fitting the Final Model with best parameter on training data
Final_model_RF.fit(X_train_ada, y_train_ada)

#Here ends the time
timer(start_time)

#Getting the predicted values on the train set and test set
y_train_pred = Final_model_RF.predict(X_train_ada)
y_test_pred = Final_model_RF.predict(X_test_final)

#Plotting the AUC ROC Curve for training data
plot_roc(y_train_ada, y_train_pred)
print("Training_Data_Metrics")
print ('AUC for the Training_Data: ', metrics.roc_auc_score( y_train_ada, y_train_pred))
model_metrics(y_train_ada, y_train_pred)
print('*'*70)
print("Test_Data_metrics")

#Plotting the AUC ROC Curve for Test Data
plot_roc(y_test_final, y_test_pred)
print ('AUC for the Test_Data: ', metrics.roc_auc_score( y_test_final, y_test_pred))
model_metrics(y_test_final, y_test_pred)

- So from above final model building on whole training data set and running on test data for prediction we can see the very good results. And the results are as:

 ##### 1) AUC on Test Data - 96%

 ##### 2) Accuracy on Test Data - 98%

 ##### 3) Senstivity on Test Data - 94% (Also known as Recall)

 ##### 4) Precision on Test Data - 15%

 ##### 5) Recall on Test Data - 94%

 - As we understood from the modules that 'Accuracy' is not always the correct metric for solving classification problems of imbalanced data. Metrics like confusion matrix, precision, recall, F1 score which are threshold dependent whereas AUC-ROC score evaluates the performance of the model at all the classification thresholds.

 - Therefore in this problem statment, for banks having a larger transaction value. So here, to save banks from high-value fraudulent transactions, we have to focus on a high recall in order to detect actual fraudulent transactions.

In [None]:
# Displaying the top features importance
feature_imp = pd.DataFrame({
    "Col_Name": X_train_ada.columns,
    "Feature_Imp": Final_model_RF.feature_importances_})

feature_imp.sort_values(by="Feature_Imp", ascending=False)

# Conclusion

- To save banks from high-value fraudulent transactions, we have to focus on a high recall in order to detect actual fraudulent transactions.
- In our model we have got the Recall as 94% which is good to go to deploy this model and the AUC is 96% which means model can classify fraudulent and non fraudulent transactions
- Top Feature Importance which are important in model building are amt, Customer_age, category_food_dining, city_pop
- New york,Texas ,Pennsylvania records major fraud cases.
- Grocery pos, Shopping_net, misc_net categories have more likely to fraud
-  Avg fraud transactions are more than the non fraudulent transactions.


# Cost-Benefit Analysis

In [None]:
#Now basis on the questions splitting the train & test data basis on the target variable
#Splitting the train data basis on target variable
fraud_train_df=df_fraud_train[df_fraud_train['is_fraud']==1]
non_fraud_train_df=df_fraud_train[df_fraud_train['is_fraud']==0]

#Splitting the test data basis on target variable
fraud_test_df=df_fraud_test[df_fraud_test['is_fraud']==1]
non_fraud_test_df=df_fraud_test[df_fraud_test['is_fraud']==0]

In [None]:
df_fraud_train[df_fraud_train['is_fraud']==0]['amt'].mean()

# Part I: Analyse the dataset and find the following figures:

- Average number of transactions per month
- Average number of fraudulent transactions per month
- Average amount per fraudulent transaction

In [None]:
#Average number of transactions per month on TRAINING DATA
Avg_Transactions_Per_Month_Training = (len(df_fraud_train)/12)
print('Avg No. Of Transactions per month on Training Data is : {}'.format(round(Avg_Transactions_Per_Month_Training),2))

#Average number of transactions per month on TEST DATA
Avg_Transactions_Per_Month_Test =  (len(df_fraud_test)/12)
print('Avg No. Of Transactions per month on Test Data is : {}'.format(round(Avg_Transactions_Per_Month_Test),2))

print('\n')
#Average number of fraudulent transactions per month on TRAINING DATA
Avg_No_Fraud_Transaction_Per_month_Training = (len(fraud_train_df)/12)
print('Avg No. Of Fraud Transactions per month on Training Data is : {}'.format(Avg_No_Fraud_Transaction_Per_month_Training))

#Average number of fraudulent transactions per month on TEST DATA
Avg_No_Fraud_Transaction_Per_month_Test = (len(fraud_test_df)/12)
print('Avg No. Of Fraud Transactions per month on Test Data is : {}'.format(Avg_No_Fraud_Transaction_Per_month_Test))

print('\n')
#Average amount per fraudulent transaction on TRAINING DATA
Avg_Amt_Per_Fraud_Transaction_Training = fraud_train_df['amt'].mean()
print('Avg Amt Per Fraud Transactions on Training is : {}'.format(round(Avg_Amt_Per_Fraud_Transaction_Training),2))

#Average amount per fraudulent transaction on TEST DATA
Avg_Amt_Per_Fraud_Transaction_Test = fraud_test_df['amt'].mean()
print('Avg Amt Per Fraud Transactions on Test is : {}'.format(round(Avg_Amt_Per_Fraud_Transaction_Test),2))

# Part II: Compare the cost incurred per month by the bank before and after the model deployment:

- Cost incurred per month before the model was deployed = Average amount per fraudulent transaction * Average number of fraudulent transactions per month

- Cost incurred per month after the model is built and deployed: Use the test metric from the model evaluation part and the calculations performed in Part I to compute the values given below --

 1) Let TF be the average number of transactions per month detected as fraudulent by the model and let the cost of providing customer executive support per fraudulent transaction detected by the model = $1.5

   - Total cost of providing customer support per month for fraudulent   transactions detected by the model = 1.5 * TF.

 2) Let FN be the average number of transactions per month that are fraudulent but not detected by the model

 - Cost incurred due to these fraudulent transactions left undetected by the model = Average amount per fraudulent transaction * FN
 - Therefore, the cost incurred per month after the model is built and deployed = 1.5*TF + Average amount per fraudulent transaction * FN
 - Final savings = Cost incurred before - Cost incurred after.


In [None]:
# Cost incurred per month before the model was deployed = Average amount per fraudulent transaction * Average number of fraudulent transactions per month
# Cost incurred per month before the model was deployed for TRAINING data
Avg_Amt_Per_Fraud_Transaction_Train = fraud_train_df['amt'].mean()
Avg_No_Fraud_Transaction_Per_month_Train = (len(fraud_train_df)/12)

Cost_incurred_per_month_before_the_model_was_deployed_Train = Avg_Amt_Per_Fraud_Transaction_Train * Avg_No_Fraud_Transaction_Per_month_Train
print('Cost incurred per month before the model was deployed for Train data is : {}'.format(round(Cost_incurred_per_month_before_the_model_was_deployed_Train),2))

print('\n')
# Cost incurred per month before the model was deployed for TEST data
Avg_Amt_Per_Fraud_Transaction_Test = fraud_test_df['amt'].mean()
Avg_No_Fraud_Transaction_Per_month_Test = (len(fraud_test_df)/12)

Cost_incurred_per_month_before_the_model_was_deployed_Test = Avg_Amt_Per_Fraud_Transaction_Test * Avg_No_Fraud_Transaction_Per_month_Test
print('Cost incurred per month before the model was deployed for Test data is : {}'.format(round(Cost_incurred_per_month_before_the_model_was_deployed_Test),2))

#### Total cost of providing customer support per month for fraudulent transactions detected by the model

In [None]:
# Average number of transactions per month detected as fraudulent by the model on TRAINING data
TF_Train = (25364 + 540506)/12
print('TF_Train is : {}'.format(TF_Train))

# Cost of providing customer executive support per fraudulent transaction detected by the model = $1.5
Total_cost_providing_custmr_support_per_mon_fraud_transactions_detected_by_the_model_Train = 1.5 * TF_Train
print('Total cost of providing customer support per month for fraudulent transactions detected by the model on train data is : {}' .format(Total_cost_providing_custmr_support_per_mon_fraud_transactions_detected_by_the_model_Train))


print('\n')
# Average number of transactions per month detected as fraudulent by the model on TEST data
TF_Test = (2014 + 10945)/12
print('TF_Test is : {}'.format(TF_Test))

# Cost of providing customer executive support per fraudulent transaction detected by the model = $1.5
Total_cost_providing_custmr_support_per_mon_fraud_transactions_detected_by_the_model_Test = 1.5 * TF_Test
print('Total cost of providing customer support per month for fraudulent transactions detected by the model on test data is : {}' .format(Total_cost_providing_custmr_support_per_mon_fraud_transactions_detected_by_the_model_Test))


### Cost incurred due to these fraudulent transactions left undetected by the model

In [None]:
# Average number of transactions per month that are fraudulent but not detected by the model on TRAINING data
FN_Train = 14796/12
print('FN_Train is : {}'.format(FN_Train))

# Cost incurred due to these fraudulent transactions left undetected by the model = Average amount per fraudulent transaction * FN
Cost_fraud_trans_undetect_model_Train = Avg_Amt_Per_Fraud_Transaction_Training * FN_Train
print('Cost incurred due to these fraudulent transactions left undetected by the model for Train data is : {}'.format(Cost_fraud_trans_undetect_model_Train))

# the cost incurred per month after the model is built and deployed on TRAINING data
Cost_per_mon_aftr_model_deployed_Train= Total_cost_providing_custmr_support_per_mon_fraud_transactions_detected_by_the_model_Train + Cost_fraud_trans_undetect_model_Train
print('Cost incurred per month after the model is built and deployed on train data is : {}'.format(Cost_per_mon_aftr_model_deployed_Train))

print('\n')
# Average number of transactions per month that are fraudulent but not detected by the model on TEST data
FN_Test = 131/12
print('FN_Test is : {}'.format(FN_Test))

# Cost incurred due to these fraudulent transactions left undetected by the model on TEST data
Cost_fraud_trans_undetect_model_Test = Avg_Amt_Per_Fraud_Transaction_Test * FN_Test
print('Cost incurred due to these fraudulent transactions left undetected by the model for Test data is : {}'.format(Cost_fraud_trans_undetect_model_Test))

# Cost incurred per month after the model is built and deployed on TEST data
Cost_per_mon_aftr_model_deployed_Test = Total_cost_providing_custmr_support_per_mon_fraud_transactions_detected_by_the_model_Test + Cost_fraud_trans_undetect_model_Test
print('Cost incurred per month after the model is built and deployed on test data is : {}'.format(Cost_per_mon_aftr_model_deployed_Test))




# Final savings

In [None]:
# Final Savings on TRAINING data
Final_Savings_Train = Cost_incurred_per_month_before_the_model_was_deployed_Train - Cost_per_mon_aftr_model_deployed_Train
print('Final Savings on Train Data is : {}'.format(Final_Savings_Train))

print('\n')
# Final Savings on TEST data
Final_Savings_Test = Cost_incurred_per_month_before_the_model_was_deployed_Test - Cost_per_mon_aftr_model_deployed_Test
print('Final Savings on Test Data is : {}'.format(Final_Savings_Test))
