# Describe your fraud detection model in elaboration.

The fraud detection is a binary classification problem, where the goal is to detect whether a transaction is fraudulant(1) or non-fraudulant(0). To build an effective fraud detectection model, exploratory data analysis is crucial to understand the data, it is necessary to clean the data, and prepare it for modelling. This fraud detection model has been done using K_Nearest Classifier, Decision Tree Classifier and Random Forest Classifier.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the dataset

In [4]:
data=pd.read_csv("Fraud.csv")
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


# Remove unwanted attributes which are not used for prediction/detection

In [5]:
data=data.drop(columns=['nameOrig','nameDest'])
data

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,170136.00,160296.36,0.00,0.00,0,0
1,1,PAYMENT,1864.28,21249.00,19384.72,0.00,0.00,0,0
2,1,TRANSFER,181.00,181.00,0.00,0.00,0.00,1,0
3,1,CASH_OUT,181.00,181.00,0.00,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,41554.00,29885.86,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,339682.13,0.00,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,6311409.28,0.00,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,6311409.28,0.00,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,850002.52,0.00,0.00,0.00,1,0


The attributes that play a key role in the detection process are --type, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest, isFraud, isFlaggedFraud

The attributes like nameOrig and nameDest are dropped because they donot contribute in the prediction process

# Seperation of Dependent and Independent Variables.

In [6]:
#independent variables
X=data.iloc[:,:-2].values
X

array([[1, 'PAYMENT', 9839.64, ..., 160296.36, 0.0, 0.0],
       [1, 'PAYMENT', 1864.28, ..., 19384.72, 0.0, 0.0],
       [1, 'TRANSFER', 181.0, ..., 0.0, 0.0, 0.0],
       ...,
       [743, 'CASH_OUT', 6311409.28, ..., 0.0, 68488.84, 6379898.11],
       [743, 'TRANSFER', 850002.52, ..., 0.0, 0.0, 0.0],
       [743, 'CASH_OUT', 850002.52, ..., 0.0, 6510099.11, 7360101.63]],
      dtype=object)

In [7]:
#dependent variable-1(for detecting isFraud)
y=data.iloc[:,-2].values
y

array([0, 0, 1, ..., 1, 1, 1], dtype=int64)

In [8]:
#dependent variable-2(for detecting isFlaggedFraud)
z=data.iloc[:,-1].values
z

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

# Encoding categorical data

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[1])],remainder='passthrough')
X=np.array(ct.fit_transform(X))
print(X)

[[0.0 0.0 0.0 ... 160296.36 0.0 0.0]
 [0.0 0.0 0.0 ... 19384.72 0.0 0.0]
 [0.0 0.0 0.0 ... 0.0 0.0 0.0]
 ...
 [0.0 1.0 0.0 ... 0.0 68488.84 6379898.11]
 [0.0 0.0 0.0 ... 0.0 0.0 0.0]
 [0.0 1.0 0.0 ... 0.0 6510099.11 7360101.63]]


# Splitting the dataset into the training and test set

In [10]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test,z_train,z_test=train_test_split(X,y,z,test_size=0.2,random_state=1)
print(X_test)

[[1.0 0.0 0.0 ... 31616.12 169508.66 145951.53]
 [0.0 0.0 0.0 ... 0.0 0.0 0.0]
 [0.0 0.0 0.0 ... 0.0 0.0 0.0]
 ...
 [0.0 1.0 0.0 ... 0.0 601603.76 955405.34]
 [0.0 0.0 0.0 ... 323169.93 0.0 0.0]
 [0.0 0.0 0.0 ... 0.0 0.0 0.0]]


# Feature Scaling

In [11]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train[:,1:]=sc.fit_transform(X_train[:,1:])
X_test[:,1:]=sc.transform(X_test[:,1:])
print(X_train)

[[0.0 -0.7365698435409481 -0.08091833106514933 ... -0.2923588746637241
  -0.3240454946825773 -0.33352095618637007]
 [0.0 -0.7365698435409481 -0.08091833106514933 ... -0.2923588746637241
  -0.3240454946825773 -0.33352095618637007]
 [0.0 1.357644504079276 -0.08091833106514933 ... -0.2923588746637241
  -0.3240454946825773 -0.2910790614457731]
 ...
 [0.0 1.357644504079276 -0.08091833106514933 ... -0.2923588746637241
  2.194561180763331 2.030544387984998]
 [0.0 -0.7365698435409481 -0.08091833106514933 ... -0.2923588746637241
  -0.3240454946825773 -0.33352095618637007]
 [0.0 -0.7365698435409481 -0.08091833106514933 ... -0.2923588746637241
  -0.3240454946825773 -0.33352095618637007]]


# Detection of Fraud

We can detect fraud by using various classification techniques.
Some include 
1) K-NN.
2) Decision Tree Classifier.
3) Random Forest.


# 1) K-NN.

In [12]:
#Training the K-NN Model;
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(X_train,y_train)

In [58]:
#Predicting the testset result;
y_pred=classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


In [59]:
#Confusion Matrix;
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(y_test,y_pred)
print(cm)

[[209498      3]
 [   113    101]]


In [60]:
#Accuracy;
print(accuracy_score(y_test,y_pred)*100)

99.94468683689769


# 2)Decision Tree Classifier

In [61]:
#Training the Decision Tree Model;
from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy',random_state=0)
classifier.fit(X_train,y_train)

In [62]:
#Predicting the testset result;
y_pred=classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


In [63]:
#Confusion Matrix;
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(y_test,y_pred)
print(cm)

[[209463     38]
 [    35    179]]


In [64]:
#Accuracy;
print(accuracy_score(y_test,y_pred)*100)

99.96519085425459


# 3)Random Forest

In [65]:
#Training the Random forest Model;
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(criterion='entropy',random_state=0)
classifier.fit(X_train,y_train)

In [66]:
#Predicting the testset result;
y_pred=classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


In [67]:
#Confusion Matrix;
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(y_test,y_pred)
print(cm)

[[209496      5]
 [    52    162]]


In [68]:
#Accuracy;
print(accuracy_score(y_test,y_pred)*100)

99.9728202560618


# Detection of flagged fraud

# 1) K-NN

In [69]:
#Training the K-NN Model;
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(X_train,z_train)

In [70]:
#Predicting the testset result;
z_pred=classifier.predict(X_test)
print(np.concatenate((z_pred.reshape(len(z_pred),1),z_test.reshape(len(z_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


In [71]:
#Confusion Matrix;
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(z_test,z_pred)
print(cm)

[[209715]]


In [72]:
#Accuracy;
print(accuracy_score(z_test,z_pred)*100)

100.0


# 2) Decision Tree Classifier

In [73]:
#Training the Decision Tree Model;
from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy',random_state=0)
classifier.fit(X_train,z_train)

In [74]:
#Predicting the testset result;
z_pred=classifier.predict(X_test)
print(np.concatenate((z_pred.reshape(len(z_pred),1),z_test.reshape(len(z_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


In [75]:
#Confusion Matrix;
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(z_test,z_pred)
print(cm)

[[209715]]


In [76]:
#Accuracy;
print(accuracy_score(z_test,z_pred)*100)

100.0


Here are key strategies that a company should implement to both prevent fraud and improve detection during such transitions: ->Adopt Strong Authentication Mechanisms ->Implement Robust Access Control and Monitoring ->Leverage Machine Learning for Fraud Detection ->Secure Software Development and Infrastructure

Here's how we can measure the effictiveness of the strategies that has been implemented by the company ->Key Performance Indicators (KPIs) to Track ->Fraud Detection Rate ->False Positive Rate ->False Negative Rate ->Precision, Recall, and F1 Score -> User and Employee Feedback ->Model Performance Evaluation ->Fraud Simulation Tests