This Notebook is for the Models of our Project.

Models include:
- Logistic Regression
- Neural Network
- KNN
- Random Forest

Below you can find the Pre-processing, Training, and Testing for Each model

At the end we will conclude with a comparison between each model and discuss results!

In [56]:
# imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

In [57]:
# Data collection
data = pd.read_csv('credit_card_fraud.csv', parse_dates=['trans_date_trans_time',])

X = data.drop(['is_fraud'], axis=1)
Y = data['is_fraud']

In [58]:
X.head()

Unnamed: 0,trans_date_trans_time,merchant,category,amt,city,state,lat,long,city_pop,job,dob,trans_num,merch_lat,merch_long
0,2019-01-01 00:00:44,"Heller, Gutmann and Zieme",grocery_pos,107.23,Orient,WA,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,49.159047,-118.186462
1,2019-01-01 00:00:51,Lind-Buckridge,entertainment,220.11,Malad City,ID,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,43.150704,-112.154481
2,2019-01-01 00:07:27,Kiehn Inc,grocery_pos,96.29,Grenada,CA,41.6125,-122.5258,589,Systems analyst,1945-12-21,413636e759663f264aae1819a4d4f231,41.65752,-122.230347
3,2019-01-01 00:09:03,Beier-Hyatt,shopping_pos,7.77,High Rolls Mountain Park,NM,32.9396,-105.8189,899,Naval architect,1967-08-30,8a6293af5ed278dea14448ded2685fea,32.863258,-106.520205
4,2019-01-01 00:21:32,Bruen-Yost,misc_pos,6.85,Freedom,WY,43.0172,-111.0292,471,"Education officer, museum",1967-08-02,f3c43d336e92a44fc2fb67058d5949e3,43.753735,-111.454923


In [59]:
# Pre-processing --------------------------------------------------------

# changing data types
X['dob'] = pd.to_datetime(X['dob'])

# creating columns out of our original Dataset --------------------------

X['hour_of_transaction'] = X.trans_date_trans_time.dt.hour # hour of transaction
X['month_of_transaction'] = X.trans_date_trans_time.dt.month # month of transaction
X['dow_of_transaction'] = X.trans_date_trans_time.dt.day_name() # day of week of transaction
X['cust_age'] = (X['trans_date_trans_time'] - X['dob']).astype('timedelta64[Y]') # age of person during transaction

# encoding: 0 = normal time, 1 = odd time
X['Normal_transaction_time'] = 0
X.loc[X.hour_of_transaction < 5,'Normal_transaction_time'] = 1
X.loc[X.hour_of_transaction > 21,'Normal_transaction_time'] = 1

# one-hot encoding the categorical features
encoder = OneHotEncoder()
X[['dow_of_transaction']] = encoder.fit_transform(X[['dow_of_transaction']])
X[['state']] = encoder.fit_transform(X[['state']])
X[['merchant']] = encoder.fit_transform(X[['merchant']])
X[['category']] = encoder.fit_transform(X[['category']])
X[['city']] = encoder.fit_transform(X[['city']])

# Normalizing the features with varying features ------------------------------------------------------------

# min-max normalization since no real outliers for these features
X['cust_age'] = (X['cust_age'] - X['cust_age'].min()) / (X['cust_age'].max() - X['cust_age'].min())

# z-score normalization for values that are wide-spread such as amt and city population
X['amt'] = (X['amt'] - X['amt'].mean()) / X['amt'].std() 
X['city_pop'] = (X['city_pop'] - X['city_pop'].mean()) / X['city_pop'].std() 


# getting rid of unnecessary columns
X = X.drop(['trans_num', 'job','trans_date_trans_time'], axis=1)








In [60]:
X.head()

Unnamed: 0,merchant,category,amt,city,state,lat,long,city_pop,dob,merch_lat,merch_long,hour_of_transaction,month_of_transaction,dow_of_transaction,cust_age,Normal_transaction_time
21,"(0, 175)\t1.0","(0, 11)\t1.0",0.963363,"(0, 70)\t1.0","(0, 2)\t1.0",37.7773,-119.0825,-0.363471,1927-09-09,36.819789,-119.670559,1,1,"(0, 5)\t1.0",0.973684,1
117,"(0, 415)\t1.0","(0, 4)\t1.0",0.496316,"(0, 70)\t1.0","(0, 2)\t1.0",37.7773,-119.0825,-0.363471,1927-09-09,38.179302,-119.614224,7,1,"(0, 5)\t1.0",0.973684,0
389,"(0, 66)\t1.0","(0, 7)\t1.0",0.438422,"(0, 70)\t1.0","(0, 2)\t1.0",37.7773,-119.0825,-0.363471,1927-09-09,37.385819,-120.061873,20,1,"(0, 5)\t1.0",0.973684,0
521,"(0, 94)\t1.0","(0, 4)\t1.0",0.234557,"(0, 70)\t1.0","(0, 2)\t1.0",37.7773,-119.0825,-0.363471,1927-09-09,38.733766,-118.876821,7,1,"(0, 6)\t1.0",0.973684,0
541,"(0, 178)\t1.0","(0, 0)\t1.0",-0.265211,"(0, 70)\t1.0","(0, 2)\t1.0",37.7773,-119.0825,-0.363471,1927-09-09,38.130066,-119.710773,10,1,"(0, 6)\t1.0",0.973684,0


Since this data set is heavely skewed in Non-Fraudulent transactions favor, we have done some research in how to address this.
We concluded that we can take the approach of doing under-sampling, over-sampling, and combining both.

Under-sampling: The number of samples taken from majority class (Not Fraud) will be equal to total number of samples of minority class (Fraud)
Over-sampling: Selecting random samples from the minority class (Fraud) and adding to the training data copies of the sample


Logistic Regression Model - Under Sampling

In [61]:
under_sample = RandomUnderSampler()
X_under, Y_under = under_sample.fit_resample(X,Y) # data set used for all under sampled models

X_train_LR, X_test_LR, Y_train_LR, Y_test_LR = train_test_split(X_under, Y_under, test_size = 0.2, random_state=42)

print('Training Data Shape   : ', X_train_LR.shape)
print('Training Labels Shape : ', Y_train_LR.shape)
print('Testing Data Shape    : ', X_test_LR.shape)
print('Testing Labels Shape  : ', Y_test_LR.shape)

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model.fit(X_train_LR,Y_train_LR)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

pred_train = lr_model.predict(X_train_LR)
pred_test  = lr_model.predict(X_test_LR)

print('Training Accuracy : ', accuracy_score(Y_train_LR, pred_train))
print('Testing  Accuracy : ', accuracy_score(Y_test_LR, pred_test))

# Checking f1 score, precision and recall
print('Training Set f1 score : ', f1_score(Y_train_LR, pred_train))
print('Testing  Set f1 score : ', f1_score(Y_test_LR, pred_test))
print()
print('Test set precision : ', precision_score(Y_test_LR, pred_test))
print('Test set recall    : ', recall_score(Y_test_LR, pred_test))




Training Data Shape   :  (2851, 16)
Training Labels Shape :  (2851,)
Testing Data Shape    :  (713, 16)
Testing Labels Shape  :  (713,)


ValueError: setting an array element with a sequence.