# Fraud Detection with Binary Classification Models

### Prompt:
To identify online payment fraud with machine learning, we need to train a machine learning model for classifying fraudulent and non-fraudulent payments.

### Feature Explanation:
step: represents a unit of time where 1 step equals 1 hour\
type: type of online transaction\
amount: the amount of the transaction\
nameOrig: customer starting the transaction\
oldbalanceOrg: balance before the transaction\
newbalanceOrig: balance after the transaction\
nameDest: recipient of the transaction\
oldbalanceDest: initial balance of recipient before the transaction\
newbalanceDest: the new balance of recipient after the transaction\
isFraud: fraud transaction

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.float_format', lambda x: '%.4f' % x)

df = pd.read_csv('Pumpkin_Seeds_Dataset.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Pumpkin_Seeds_Dataset.csv'

In [3]:
df

NameError: name 'df' is not defined

In [4]:
df.info()

NameError: name 'df' is not defined

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(df.corr(), annot=True)

In [None]:
sns.violinplot(df['step'])

In [None]:
plt.ticklabel_format(style="plain", axis='y')
sns.countplot(df['type'])

In [None]:
plt.ticklabel_format(style="plain", axis='x')
sns.violinplot(df['amount'])

In [None]:
plt.ticklabel_format(style="plain", axis='x')
sns.histplot(df['amount'])
plt.xlim(0,100000)

In [None]:
print(f"There are {df.query('amount > 1000000').shape[0]} transactions over 1,000,000.")

In [None]:
print(f"There are {df.query('amount > 10000000').shape[0]} transactions over 10,000,000.")

In [None]:
pd.pivot_table(df, index=df['isFraud'], values='isFraud', aggfunc='count')

In [None]:
pd.pivot_table(df, index=df['isFlaggedFraud'], values='isFlaggedFraud', aggfunc='count')

Let's convert the types to numerical values.

In [None]:
dic = {'PAYMENT': 1, 'TRANSFER':2, "CASH_OUT":3, "DEBIT":4, "CASH_IN":5}
df["type"] = df["type"].map(dic)

In [None]:
from sklearn.model_selection import train_test_split

x=list(df.columns[:9])
x.remove('nameOrig')
x.remove('nameDest')
x = df[x]
y=df['isFraud']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=.33, random_state=1)

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score

clf = LogisticRegressionCV(cv=5, max_iter=500, random_state=0)
clf.fit(xTrain,yTrain)

In [None]:
preds = clf.predict(xTest)
print(f"Our accuracy score on the test data is {round(accuracy_score(yTest,preds),5)*100}%")

In [None]:
accuracy_score(df['isFraud'], df['isFlaggedFraud'])
print(f"The current accuracy score provided in the data set is {round(accuracy_score(df['isFraud'], df['isFlaggedFraud']),5)*100}%")

We will see if we can beat the fraud rate from the data set by trying a different model

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
confmatrix = confusion_matrix(yTest, preds)
cm = ConfusionMatrixDisplay(confmatrix, display_labels=[False, True])
cm.plot(cmap="YlGnBu_r")
plt.show()

## Lets try another model to improve accuracy

In [None]:
from sklearn.ensemble import RandomForestClassifier

Rclf = RandomForestClassifier(n_estimators=20, random_state=0, max_depth=6)
Rclf = Rclf.fit(xTrain,yTrain)

In [None]:
yRPred = Rclf.predict(xTest)
print(f"Our new accuracy score on the test data is {round(accuracy_score(yTest,yRPred),8)*100}%")

In [None]:
plt.figure(figsize=(15,7))
plt.title("Feature Importance", fontsize= 15)
x= list(df.columns)
x.remove('nameOrig')
x.remove('nameDest')
x.remove('isFraud')
x.remove('isFlaggedFraud')
sns.barplot(y= Rclf.feature_importances_, x = x)

In [None]:
confmatrix = confusion_matrix(yTest, yRPred)
cm = ConfusionMatrixDisplay(confmatrix, display_labels=[False, True])
cm.plot(cmap="YlGnBu_r")
plt.show()

Now we have only 1 false positive!

In [None]:
from sklearn.metrics import RocCurveDisplay

ax = plt.gca()
rfcDisplay = RocCurveDisplay.from_estimator(Rclf, xTest, yTest, ax=ax)
clfDisplay = RocCurveDisplay.from_estimator(clf, xTest, yTest, ax=ax)
plt.show()

In [None]:
from sklearn.metrics import classification_report


print(classification_report(yRPred, yTest, target_names=['Not Fraud', 'Fraud']))

### We can conclude that our random forest model is a pretty good at detecting fraud. 