This Notebook is for the Models of our Project.

Models include:
- Logistic Regression
- Random Forest
- KNN
- Neural Network

Below you can find the Pre-processing, Training, and Testing for Each model

At the end we will conclude with a comparison between each model and discuss results!

In [64]:
# imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

In [139]:
# Data collection
data = pd.read_csv('credit_card_fraud.csv', parse_dates=['trans_date_trans_time',])

X = data.drop(['is_fraud'], axis=1)
Y = data['is_fraud']

In [140]:
# Pre-processing --------------------------------------------------------

# changing data types
X['dob'] = pd.to_datetime(X['dob'])

# creating columns out of our original Dataset --------------------------

X['hour_of_transaction'] = X.trans_date_trans_time.dt.hour # hour of transaction
X['month_of_transaction'] = X.trans_date_trans_time.dt.month # month of transaction
X['dow_of_transaction'] = X.trans_date_trans_time.dt.day_name() # day of week of transaction
X['cust_age'] = (X['trans_date_trans_time'] - X['dob']).astype('timedelta64[Y]') # age of person during transaction

# encoding: 0 = normal time, 1 = odd time
X['Normal_transaction_time'] = 0
X.loc[X.hour_of_transaction < 5,'Normal_transaction_time'] = 1
X.loc[X.hour_of_transaction > 21,'Normal_transaction_time'] = 1

# one-hot encoding the categorical features
encoder = OneHotEncoder()
dow_encoded = encoder.fit_transform(X[['dow_of_transaction']])
dow_encoded_df = pd.DataFrame(dow_encoded.toarray(), columns=encoder.categories_[0])
X = pd.concat([X, dow_encoded_df], axis=1)

state_encoded = encoder.fit_transform(X[['state']])
state_encoded_df = pd.DataFrame(state_encoded.toarray(), columns=encoder.categories_[0])
X = pd.concat([X,state_encoded_df], axis=1)

merch_encoded = encoder.fit_transform(X[['merchant']])
merch_encoded_df = pd.DataFrame(merch_encoded.toarray(), columns=encoder.categories_[0])
X = pd.concat([X, merch_encoded_df], axis=1)

cat_encoded = encoder.fit_transform(X[['category']])
cat_encoded_df = pd.DataFrame(cat_encoded.toarray(), columns=encoder.categories_[0])
X = pd.concat([X, cat_encoded_df], axis=1)

city_encoded = encoder.fit_transform(X[['city']])
city_encoded_df = pd.DataFrame(city_encoded.toarray(), columns=encoder.categories_[0])
X = pd.concat([X, city_encoded_df], axis=1)


# Normalizing the features with varying features ------------------------------------------------------------

# min-max normalization since no real outliers for these features
X['cust_age'] = (X['cust_age'] - X['cust_age'].min()) / (X['cust_age'].max() - X['cust_age'].min())

# z-score normalization for values that are wide-spread such as amt and city population
X['amt'] = (X['amt'] - X['amt'].mean()) / X['amt'].std() 
X['city_pop'] = (X['city_pop'] - X['city_pop'].mean()) / X['city_pop'].std() 

# getting rid of unnecessary columns
X.drop(['trans_num', 'job','trans_date_trans_time', 'state', 'city', 'merchant', 'category', 'dow_of_transaction', 'dob'], axis=1, inplace=True)

Since this data set is heavely skewed in Non-Fraudulent transactions favor, we have done some research in how to address this.
We concluded that we can take the approach of doing under-sampling, over-sampling, and combining both.

Under-sampling: The number of samples taken from majority class (Not Fraud) will be equal to total number of samples of minority class (Fraud)
Over-sampling: Selecting random samples from the minority class (Fraud) and adding to the training data copies of the sample


Logistic Regression Model - Under Sampling

In [141]:
under_sample = RandomUnderSampler()
X_under, Y_under = under_sample.fit_resample(X,Y) # data set used for all under sampled models

X_train_u, X_test_u, Y_train_u, Y_test_u = train_test_split(X_under, Y_under, test_size = 0.2, random_state=42)

print('Training Data Shape   : ', X_train_u.shape)
print('Training Labels Shape : ', Y_train_u.shape)
print('Testing Data Shape    : ', X_test_u.shape)
print('Testing Labels Shape  : ', Y_test_u.shape)
print()

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model.fit(X_train_u,Y_train_u)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

pred_train_lr = lr_model.predict(X_train_u)
pred_test_lr  = lr_model.predict(X_test_u)

print('Logistic Regression Results with Under-Sampling:')
print()
print('Training Accuracy : ', accuracy_score(Y_train_u, pred_train_lr))
print('Testing  Accuracy : ', accuracy_score(Y_test_u, pred_test_lr))

# Checking f1 score, precision and recall
print('Training Set f1 score : ', f1_score(Y_train_u, pred_train_lr))
print('Testing  Set f1 score : ', f1_score(Y_test_u, pred_test_lr))
print()
print('Test set precision : ', precision_score(Y_train_u, pred_train_lr))
print('Test set recall    : ', recall_score(Y_test_u, pred_test_lr))




Training Data Shape   :  (2851, 913)
Training Labels Shape :  (2851,)
Testing Data Shape    :  (713, 913)
Testing Labels Shape  :  (713,)

Logistic Regression Results with Under-Sampling:

Training Accuracy :  0.8905647141353911
Testing  Accuracy :  0.879382889200561
Training Set f1 score :  0.8925619834710744
Testing  Set f1 score :  0.8753623188405797

Test set precision :  0.8894989704873026
Test set recall    :  0.9014925373134328


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Random Forest Model - Under Sampling

Hyperparameters include:
n_estimators: Determines the number of decision tress that are "grown" in random forest
max_depth: the maximum depth for each decision tree
random_state: helps randomize the data to generate diverse decision trees and will help in comparing later since each model has same random_state

Hyperparameters tested:

n_estimators=100, max_depth=10, random_state=42




In [142]:
from sklearn.ensemble import RandomForestClassifier


rf_classifier = RandomForestClassifier(n_estimators=200, max_depth=200, random_state=42)
rf_classifier.fit(X_train_u, Y_train_u)

pred_train_rf = rf_classifier.predict(X_train_u)
pred_test_rf = rf_classifier.predict(X_test_u)

print('Random Forest Classifier Results with Under-Sampling:')
print()

print('Training Set Accuracy : ', accuracy_score(Y_train_u, pred_train_rf))
print('Testing Set Accuracy  : ', accuracy_score(Y_test_u, pred_test_rf))





Random Forest Classifier Results with Under-Sampling:

Training Set Accuracy :  1.0
Testing Set Accuracy  :  0.9635343618513323


In [147]:
from sklearn.neural_network import MLPClassifier

nn_classifier = MLPClassifier(hidden_layer_sizes=(913,500,250,100,50,1), activation='relu', random_state=42)
nn_classifier.fit(X_train_u, Y_train_u)

pred_train_nn = nn_classifier.predict(X_train_u)
pred_test_nn = nn_classifier.predict(X_test_u)

print('Neural Network (MLP) Classifier Results with Under-Sampling:')
print()

print('Training Set Accuracy : ', accuracy_score(Y_train_u, pred_train_nn))
print('Testing Set Accuracy  : ', accuracy_score(Y_test_u, pred_test_nn))






Neural Network (MLP) Classifier Results with Under-Sampling:

Training Set Accuracy :  0.9817607856892319
Testing Set Accuracy  :  0.8611500701262272


In [148]:
over_sample = RandomOverSampler()
X_over, Y_over = over_sample.fit_resample(X,Y) # data set used for all over sampled models


X_train_o, X_test_o, Y_train_o, Y_test_o = train_test_split(X_over, Y_over, test_size = 0.2, random_state=42)

print('Training Data Shape   : ', X_train_o.shape)
print('Training Labels Shape : ', Y_train_o.shape)
print('Testing Data Shape    : ', X_test_o.shape)
print('Testing Labels Shape  : ', Y_test_o.shape)
print()

from sklearn.linear_model import LogisticRegression

lr_model_over = LogisticRegression()
lr_model_over.fit(X_train_o,Y_train_o)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

pred_train_lr2 = lr_model_over.predict(X_train_o)
pred_test_lr2  = lr_model_over.predict(X_test_o)

print('Logistic Regression Results with Under-Sampling:')
print()
print('Training Accuracy : ', accuracy_score(Y_train_o, pred_train_lr2))
print('Testing  Accuracy : ', accuracy_score(Y_test_o, pred_test_lr2))

# Checking f1 score, precision and recall
print('Training Set f1 score : ', f1_score(Y_train_o, pred_train_lr2))
print('Testing  Set f1 score : ', f1_score(Y_test_o, pred_test_lr2))
print()
print('Test set precision : ', precision_score(Y_train_o, pred_train_lr2))
print('Test set recall    : ', recall_score(Y_test_o, pred_test_lr2))


Training Data Shape   :  (540520, 913)
Training Labels Shape :  (540520,)
Testing Data Shape    :  (135130, 913)
Testing Labels Shape  :  (135130,)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Results with Under-Sampling:

Training Accuracy :  0.8968770813290905
Testing  Accuracy :  0.8968992821727225
Training Set f1 score :  0.899475556004415
Testing  Set f1 score :  0.8997467042772438

Test set precision :  0.8770777206446122
Test set recall    :  0.9240160215196795


In [149]:
rf_classifier_o = RandomForestClassifier(n_estimators=200, max_depth=200, random_state=42)
rf_classifier_o.fit(X_train_o, Y_train_o)

pred_train_rf2 = rf_classifier.predict(X_train_o)
pred_test_rf2 = rf_classifier.predict(X_test_o)

print('Random Forest Classifier Results with Under-Sampling:')
print()

print('Training Set Accuracy : ', accuracy_score(Y_train_o, pred_train_rf2))
print('Testing Set Accuracy  : ', accuracy_score(Y_test_o, pred_test_rf2))

Random Forest Classifier Results with Under-Sampling:

Training Set Accuracy :  0.968252793606157
Testing Set Accuracy  :  0.9680011840449937
