# Fraud prediction models and threshold analysis

In this notebook, I worked with logistic regression, SVC and XGBoost models, also changing threshold to see its impact on recall and f1 score, which are very important to fraud detection applications, to reduce the occurrence of false negatives.

# Load data

In [None]:
import numpy as np
import pandas as pd

In [None]:
df_test =  pd.read_csv('/kaggle/input/fraud-detection/fraudTest.csv')
df_train =  pd.read_csv('/kaggle/input/fraud-detection/fraudTrain.csv')

In [None]:
print(len(df_train), len(df_test))

I'm goint to use both dataframes, test and train, into one, with union operation

In [None]:
df_complete = pd.concat([df_train, df_test])
len(df_complete)

In [None]:
df_complete.head()

# Data Understanding and Exploration

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df_complete.groupby('is_fraud').count()['cc_num'].plot.bar()

We can see that, as usual, the number of fraudulent transactions is much smaller, which can lead our model to predict non fraudulent (0) with higher performance than fraud (1) transactions.

In [None]:
fraud = df_complete[df_complete['is_fraud'] == 1]
non_fraud = df_complete[df_complete['is_fraud'] == 0]

print(len(fraud), len(non_fraud))

In [None]:
df_complete.describe()

In [None]:
df_complete.info()

Checking distinct

In [None]:
df_complete.nunique()

Checking for NaN/null and duplicated cells

In [None]:
df_complete.isna().sum().sum()

In [None]:
df_complete.duplicated().sum()

I'm not going to analyze outliers for they can be important to detect frauds.

In [None]:
sns.heatmap(df_complete[[i for i in df_complete.columns\
                         if df_complete[i].dtype == 'int64' \
                            or df_complete[i].dtype == 'float64']]\
                            .corr())

This shows greater correlation of 'is_fraud' column with 'amt' variable

# Data Preparation

### Balancing the dataset

In [None]:
df_balanced = pd.concat([fraud, non_fraud.sample(len(fraud), random_state= 42)])

In [None]:
df_balanced.shape

In [None]:
df_balanced.groupby('is_fraud').count()['cc_num'].plot.bar()

### Dropping columns

In [None]:
# Dropping columns not relevant for this case
columns_dropped = ['Unnamed: 0',
                   'merchant', 
                   'cc_num',
                   'first', 
                   'last',
                   'gender',
                   'trans_num',
                   'unix_time',
                   'street',
                   'merch_lat',
                   'merch_long',
                   'job',
                   'zip',
                   ]

df_balanced.drop(columns = columns_dropped, inplace = True)

In [None]:
df_balanced.info()

### Feature Engineering: Managing datetimes

In [None]:
# First, I'm converting 'trans_date_trans_time' and 'dob' into datetime type
df_balanced['trans_date_trans_time'] = pd.to_datetime(df_balanced['trans_date_trans_time'])
df_balanced['dob'] = pd.to_datetime(df_balanced['dob'])

In [None]:
df_balanced.info()

In [None]:
# Now, we can use these datetime variables to extract relevant information
# about the transaction and the client, such as day hour and age

# Lets change these columns
df_balanced['trans_date_trans_time'] = df_balanced['trans_date_trans_time'].dt.hour

In [None]:
df_balanced = df_balanced.rename(columns = {'trans_date_trans_time': 'hour_transaction'})

In [None]:
# Function to get time of day
def get_tod(hour):
    if 4 < hour['hour_transaction'] <= 12:
        ans = 'morning'
    elif 12 < hour['hour_transaction'] <= 20:
        ans = 'afternoon'
    elif hour['hour_transaction'] <= 4 or hour['hour_transaction'] > 20:
        ans = 'night'
    return ans

In [None]:
df_balanced['hour_transaction'] = df_balanced.apply(get_tod, axis = 1)

In [None]:
df_balanced.head()

In [None]:
# Now, about 'dob' (day of birth), we can get the age of the user
df_balanced['dob']= df_balanced['dob'].dt.year
df_balanced = df_balanced.rename(columns = {'dob': 'age'})

In [None]:
from datetime import datetime
df_balanced['age'] = datetime.now().year - df_balanced['age']

In [None]:
# Analyzing how many frauds occur for each age group
df_balanced[df_balanced['is_fraud'] == 1].groupby('age').count()['is_fraud']

In [None]:
df_balanced.info()

### Label Encoding

Using label encoding for categorical data

In [None]:
NUMERICAL_FEATURES = [i for i in df_balanced.columns if df_balanced[i].dtype == 'int64'\
                      or df_balanced[i].dtype =='int32' \
                      or df_balanced[i].dtype =='float64']
CATEGORICAL_FEATURES = [i for i in df_balanced.columns if df_balanced[i].dtype == 'object']

In [None]:
NUMERICAL_FEATURES

In [None]:
CATEGORICAL_FEATURES

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
encoder.fit(df_balanced[CATEGORICAL_FEATURES])

df_balanced[CATEGORICAL_FEATURES] = encoder.transform(df_balanced[CATEGORICAL_FEATURES])

In [None]:
df_balanced.head()

### Correcting datatypes

In [None]:
df_balanced[['is_fraud', 'age']] = df_balanced[['is_fraud', 'age']].astype('float64')

### Scaling dataset

As I'm trying different models, such as SVM which relies on distance, I'll scale the dataset.

In [None]:
sns.boxplot(df_balanced[NUMERICAL_FEATURES])

In [None]:
sns.boxplot(df_balanced[['amt']])

And not all of them seem to follow a gaussian normal distribution, so i'm using normal minmax scaler.

In [None]:
# Using min max scaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_balanced)
df_scaled = pd.DataFrame(df_scaled)

In [None]:
last_column = df_scaled.shape[1]-1

In [None]:
print(f"Not fraud: {df_scaled[df_scaled[last_column] == 0].count()[last_column]}")
print(f"Fraud: {df_scaled[df_scaled[last_column] == 1].count()[last_column]}")

In [None]:
df_scaled.rename(columns={last_column: 'is_fraud'}, inplace=True)
df_scaled.head()

# Modeling

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

In [None]:
# X = feature values, all the columns except the last column
X = df_scaled.drop(columns = 'is_fraud')

# y = target values, last column of the data frame
y = df_scaled['is_fraud']

In [None]:
# Spliting train and test - hold out
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression model

In [None]:
# Training
model = LogisticRegression()
model.fit(x_train, y_train)

In [None]:
# Now let's define a function to get the predictions and set the threshold

def predict(model, test_set, threshold):
    predictions = model.predict(test_set)
    pred_threshold = model.predict_proba(test_set)
    test_set["prediction"] = predictions
    test_set["pred_threshold"] = (pred_threshold >= threshold)[:, 1].astype(float)
    return test_set

In [None]:
# Use 0.4 as threshold for LR model
predict(model, x_test, 0.4)

The 'prediction' column is for standard threshold (0.5) and the 'pred_threshold' stands for our results with the changed threshold.

In [None]:
y_test = pd.DataFrame(y_test)

In [None]:
x_test["real"] = y_test["is_fraud"]

In [None]:
x_test.head(5)

In [None]:
# With 0.5 threshold
print(classification_report(x_test['real'], x_test['prediction']))

In [None]:
# With 0.4 threshold
print(classification_report(x_test['real'], x_test['pred_threshold']))

We can see an improvement in the recall for frauds, which is now 0.94, but also some other metrics such as precision got a little bit worse.

The F1-score overall got better!

In [None]:
# Let's define now a function to get the confusion matrix
def confusion_matrix_plot(test_set, pred_label, model):
    cm = confusion_matrix(x_test['real'], x_test[pred_label], labels=model.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model.classes_)
    disp.plot()
    plt.show()

In [None]:
confusion_matrix_plot(x_test, 'prediction', model)

In [None]:
confusion_matrix_plot(x_test, 'pred_threshold', model)

This way, we got way better at predicting fraud transactions, but worse to predict non fraud.

# SVM Classifier Model

Let's try now with SVC model

In [None]:
from sklearn.svm import SVC

In [None]:
model_SVM = SVC(probability = True, random_state=42)

In [None]:
model_SVM.fit(x_train, y_train)

In [None]:
# Dropping our last predictions
x_test = x_test.drop(columns = {'prediction','pred_threshold' ,'real'})

In [None]:
# Use 0.4 as threshold for LR model
predict(model_SVM, x_test, 0.4)

In [None]:
x_test["real"] = y_test["is_fraud"]

# With 0.5 threshold
print(classification_report(x_test['real'], x_test['prediction']))
# With 0.4 threshold
print(classification_report(x_test['real'], x_test['pred_threshold']))

In this case, we've got more balanced metrics, and a little improvement in recall. The f1-score is somewhat similar to LR.

In [None]:
confusion_matrix_plot(x_test, 'pred_threshold', model_SVM)

# XGBoost Model

Finally, I will implement XGBoost model and compare with LR and SVC.

In [None]:
from xgboost import XGBClassifier

In [None]:
# XGBoost classifier model
xgb = XGBClassifier(objective='binary:logistic')

In [None]:
xgb.fit(x_train, y_train)

In [None]:
# Drop again our last predictions
x_test = x_test.drop(columns = {'prediction','pred_threshold' ,'real'})

In [None]:
# Experimenting 0.3 threshold for XGBoost model
predict(xgb, x_test, 0.3)

In [None]:
x_test["real"] = y_test["is_fraud"]
print(classification_report(x_test['real'], x_test['prediction']))
print(classification_report(x_test['real'], x_test['pred_threshold']))

XGBoost got overall some much better results than the previous models, and we see also that in this case, reducing threshold to 0.3 got to a little bit worse model, which means that there is a different ROC (Receiver Operating Curve) behavior to this model

In [None]:
confusion_matrix_plot(x_test, 'prediction', xgb)