This Notebook is for the Models of our Project.

Models include:
- Logistic Regression
- Neural Network
- KNN
- Random Forest

Below you can find the Pre-processing, Training, and Testing for Each model

At the end we will conclude with a comparison between each model and discuss results!

In [25]:
# imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

In [17]:
# Data collection
data = pd.read_csv('credit_card_fraud.csv', parse_dates=['trans_date_trans_time',])

X = data.drop(['is_fraud'], axis=1)
Y = data['is_fraud']

In [8]:
X.head()

Unnamed: 0,trans_date_trans_time,merchant,category,amt,city,state,lat,long,city_pop,job,dob,trans_num,merch_lat,merch_long
0,2019-01-01 00:00:44,"Heller, Gutmann and Zieme",grocery_pos,107.23,Orient,WA,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,49.159047,-118.186462
1,2019-01-01 00:00:51,Lind-Buckridge,entertainment,220.11,Malad City,ID,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,43.150704,-112.154481
2,2019-01-01 00:07:27,Kiehn Inc,grocery_pos,96.29,Grenada,CA,41.6125,-122.5258,589,Systems analyst,1945-12-21,413636e759663f264aae1819a4d4f231,41.65752,-122.230347
3,2019-01-01 00:09:03,Beier-Hyatt,shopping_pos,7.77,High Rolls Mountain Park,NM,32.9396,-105.8189,899,Naval architect,1967-08-30,8a6293af5ed278dea14448ded2685fea,32.863258,-106.520205
4,2019-01-01 00:21:32,Bruen-Yost,misc_pos,6.85,Freedom,WY,43.0172,-111.0292,471,"Education officer, museum",1967-08-02,f3c43d336e92a44fc2fb67058d5949e3,43.753735,-111.454923


In [18]:
# Pre-processing --------------------------------------------------------

# getting rid of unnecessary columns
X = X.drop(['trans_num'], axis=1)

# changing data types
X['dob'] = pd.to_datetime(data['dob'])

# creating columns out of our original Dataset --------------------------

X['hour_of_transaction'] = data.trans_date_trans_time.dt.hour # hour of transaction
X['month_of_transaction'] = data.trans_date_trans_time.dt.month # month of transaction
X['dow_of_transaction'] = data.trans_date_trans_time.dt.day_name() # day of week of transaction
X['cust_age'] = (X['trans_date_trans_time'] - X['dob']).astype('timedelta64[Y]') # age of person during transaction

# encoding: 0 = normal time, 1 = odd time
X['Normal_transaction_time'] = 0
X.loc[X.hour_of_transaction < 5,'Normal_transaction_time'] = 1
X.loc[X.hour_of_transaction > 21,'Normal_transaction_time'] = 1


X.head()








Unnamed: 0,trans_date_trans_time,merchant,category,amt,city,state,lat,long,city_pop,job,dob,merch_lat,merch_long,hour_of_transaction,month_of_transaction,dow_of_transaction,cust_age,Normal_transaction_time
0,2019-01-01 00:00:44,"Heller, Gutmann and Zieme",grocery_pos,107.23,Orient,WA,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,49.159047,-118.186462,0,1,Tuesday,40.0,1
1,2019-01-01 00:00:51,Lind-Buckridge,entertainment,220.11,Malad City,ID,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,43.150704,-112.154481,0,1,Tuesday,56.0,1
2,2019-01-01 00:07:27,Kiehn Inc,grocery_pos,96.29,Grenada,CA,41.6125,-122.5258,589,Systems analyst,1945-12-21,41.65752,-122.230347,0,1,Tuesday,73.0,1
3,2019-01-01 00:09:03,Beier-Hyatt,shopping_pos,7.77,High Rolls Mountain Park,NM,32.9396,-105.8189,899,Naval architect,1967-08-30,32.863258,-106.520205,0,1,Tuesday,51.0,1
4,2019-01-01 00:21:32,Bruen-Yost,misc_pos,6.85,Freedom,WY,43.0172,-111.0292,471,"Education officer, museum",1967-08-02,43.753735,-111.454923,0,1,Tuesday,51.0,1


In [23]:
print(Y.value_counts())
print(X.duplicated().sum())

0    337825
1      1782
Name: is_fraud, dtype: int64
0


In [None]:
Y

Since this data set is heavely skewed in Non-Fraudulent transactions favor, we have done some research in how to address this.
We concluded that we can take the approach of doing under-sampling, over-sampling, and combining both.

Under-sampling: The number of samples taken from majority class (Not Fraud) will be equal to total number of samples of minority class (Fraud)
Over-sampling: Selecting random samples from the minority class (Fraud) and adding to the training data copies of the sample


Logistic Regression Model - Under Sampling

In [35]:
under_sample = RandomUnderSampler()
X_under, Y_under = under_sample.fit_resample(X,Y) # data set used for all under sampled models

X_train_LR, X_test_LR, Y_train_LR, Y_test_LR = train_test_split(X_under, Y_under, test_size = 0.2)


(713, 18)