# Fraud Detection

This dataset is taken from [Kaggle](https://www.kaggle.com/datasets/kartik2112/fraud-detection/data).

## Load Dataset From Local CSV File

In [1]:
from helpers import clean_dataset, encode_categories, get_data, xy_split

train, test = get_data()

print(f"Train dataset: {train.shape[0]:10,} rows x {train.shape[1]:2,} columns")
print(f"Test dataset:  {test.shape[0]:10,} rows x {test.shape[1]:2,} columns")


Train dataset:  1,296,675 rows x 22 columns
Test dataset:     555,719 rows x 22 columns


In [2]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1296675 entries, 0 to 1296674
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   trans_date_trans_time  1296675 non-null  object 
 1   cc_num                 1296675 non-null  int64  
 2   merchant               1296675 non-null  object 
 3   category               1296675 non-null  object 
 4   amt                    1296675 non-null  float64
 5   first                  1296675 non-null  object 
 6   last                   1296675 non-null  object 
 7   gender                 1296675 non-null  object 
 8   street                 1296675 non-null  object 
 9   city                   1296675 non-null  object 
 10  state                  1296675 non-null  object 
 11  zip                    1296675 non-null  int64  
 12  lat                    1296675 non-null  float64
 13  long                   1296675 non-null  float64
 14  city_pop               

In [3]:
train.head(10)

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0
5,2019-01-01 00:04:08,4767265376804500,"fraud_Stroman, Hudson and Erdman",gas_transport,94.63,Jennifer,Conner,F,4655 David Island,Dublin,...,40.375,-75.2045,2158,Transport planner,1961-06-19,189a841a0a8ba03058526bcfe566aab5,1325376248,40.653382,-76.152667,0
6,2019-01-01 00:04:42,30074693890476,fraud_Rowe-Vandervort,grocery_net,44.54,Kelsey,Richards,F,889 Sarah Station Suite 624,Holcomb,...,37.9931,-100.9893,2691,Arboriculturist,1993-08-16,83ec1cc84142af6e2acf10c44949e720,1325376282,37.162705,-100.15337,0
7,2019-01-01 00:05:08,6011360759745864,fraud_Corwin-Collins,gas_transport,71.65,Steven,Williams,M,231 Flores Pass Suite 720,Edinburg,...,38.8432,-78.6003,6018,"Designer, multimedia",1947-08-21,6d294ed2cc447d2c71c7171a3d54967c,1325376308,38.948089,-78.540296,0
8,2019-01-01 00:05:18,4922710831011201,fraud_Herzog Ltd,misc_pos,4.27,Heather,Chase,F,6888 Hicks Stream Suite 954,Manor,...,40.3359,-79.6607,1472,Public affairs consultant,1941-03-07,fc28024ce480f8ef21a32d64c93a29f5,1325376318,40.351813,-79.958146,0
9,2019-01-01 00:06:01,2720830304681674,"fraud_Schoen, Kuphal and Nitzsche",grocery_pos,198.39,Melissa,Aguilar,F,21326 Taylor Squares Suite 708,Clarksville,...,36.522,-87.349,151785,Pathologist,1974-03-28,3b9014ea8fb80bd65de0b1463b00b00e,1325376361,37.179198,-87.485381,0


## Clean Dataset

Create a function to clean the dataset. We do this so we can reuse the function for both the training and testing datasets.

The following cleaning steps are applied to the dataset:

- Clean the merchant names.
- Convert the timestamp to year, month, day, and day name.
- Change the date of birth to just a year. This is an analog of age.
- Compute the delta of latitude and longitude.
- Remove all unused columns.

In [4]:
clean_train = clean_dataset(train)
clean_test = clean_dataset(test)

clean_train.info()


<class 'pandas.core.frame.DataFrame'>
Index: 1296675 entries, 0 to 1296674
Data columns (total 13 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   category    1296675 non-null  object 
 1   amt         1296675 non-null  float64
 2   gender      1296675 non-null  object 
 3   city_pop    1296675 non-null  int64  
 4   is_fraud    1296675 non-null  int64  
 5   year        1296675 non-null  int32  
 6   month       1296675 non-null  int32  
 7   day         1296675 non-null  int32  
 8   day_name    1296675 non-null  object 
 9   hour        1296675 non-null  int32  
 10  dob_year    1296675 non-null  int32  
 11  lat_delta   1296675 non-null  float64
 12  long_delta  1296675 non-null  float64
dtypes: float64(3), int32(5), int64(2), object(3)
memory usage: 113.8+ MB


In [5]:
clean_train.head(10)

Unnamed: 0,category,amt,gender,city_pop,is_fraud,year,month,day,day_name,hour,dob_year,lat_delta,long_delta
0,misc_net,4.97,F,3495,0,2019,1,1,Tuesday,0,1988,0.067507,0.870215
1,grocery_pos,107.23,F,149,0,2019,1,1,Tuesday,0,1978,0.271247,0.024038
2,entertainment,220.11,M,4154,0,2019,1,1,Tuesday,0,1962,0.969904,0.107519
3,gas_transport,45.0,M,1939,0,2019,1,1,Tuesday,0,1967,0.803731,0.447271
4,misc_pos,41.96,M,99,0,2019,1,1,Tuesday,0,1986,0.254299,0.830441
5,gas_transport,94.63,F,2158,0,2019,1,1,Tuesday,0,1961,0.278382,0.948167
6,grocery_net,44.54,F,2691,0,2019,1,1,Tuesday,0,1993,0.830395,0.83593
7,gas_transport,71.65,M,6018,0,2019,1,1,Tuesday,0,1947,0.104889,0.060004
8,misc_pos,4.27,F,1472,0,2019,1,1,Tuesday,0,1941,0.015913,0.297446
9,grocery_pos,198.39,F,151785,0,2019,1,1,Tuesday,0,1974,0.657198,0.136381


## Modeling

### Categorical Data Encoding

We need to encode categorical data. Categorical data does not play well with machine learning models.

TODO: Try this with the boolean (OneHot) encoder.

In [6]:
encoded_train = encode_categories(clean_train)
encoded_test = encode_categories(clean_test)

encoded_train.head(10)

Unnamed: 0,amt,city_pop,is_fraud,year,month,day,hour,dob_year,lat_delta,long_delta,...,category_travel,day_name_Friday,day_name_Monday,day_name_Saturday,day_name_Sunday,day_name_Thursday,day_name_Tuesday,day_name_Wednesday,gender_F,gender_M
0,4.97,3495,0,2019,1,1,0,1988,0.067507,0.870215,...,False,False,False,False,False,False,True,False,True,False
1,107.23,149,0,2019,1,1,0,1978,0.271247,0.024038,...,False,False,False,False,False,False,True,False,True,False
2,220.11,4154,0,2019,1,1,0,1962,0.969904,0.107519,...,False,False,False,False,False,False,True,False,False,True
3,45.0,1939,0,2019,1,1,0,1967,0.803731,0.447271,...,False,False,False,False,False,False,True,False,False,True
4,41.96,99,0,2019,1,1,0,1986,0.254299,0.830441,...,False,False,False,False,False,False,True,False,False,True
5,94.63,2158,0,2019,1,1,0,1961,0.278382,0.948167,...,False,False,False,False,False,False,True,False,True,False
6,44.54,2691,0,2019,1,1,0,1993,0.830395,0.83593,...,False,False,False,False,False,False,True,False,True,False
7,71.65,6018,0,2019,1,1,0,1947,0.104889,0.060004,...,False,False,False,False,False,False,True,False,False,True
8,4.27,1472,0,2019,1,1,0,1941,0.015913,0.297446,...,False,False,False,False,False,False,True,False,True,False
9,198.39,151785,0,2019,1,1,0,1974,0.657198,0.136381,...,False,False,False,False,False,False,True,False,True,False


### Prepare Test Dataset

Repeat everything we have done to our training dataset to our test dataset.

In [7]:
x_train, y_train = xy_split(encoded_train)
x_test, y_test = xy_split(encoded_test)

### Logistic Regression

- [LogisticRegression documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

Logistic Regression takes approximately 8s and is 99.55% accurate. This model also reaches the maximum iteration limit. More preprocessing is required to improve this model.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr_model = LogisticRegression()
lr_model.fit(x_train, y_train)
lr_predictions = lr_model.predict(x_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)

print(f"Logistic Regression model accuracy: {(lr_accuracy * 100):.2f}%")

Logistic Regression model accuracy: 99.55%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Random Forest

- [RandomForestClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Random Forest Classifier takes approximately 2m 11s and is 99.88% accurate.

In [9]:
from sklearn.ensemble import RandomForestClassifier

rfc_model = RandomForestClassifier()
rfc_model.fit(x_train, y_train)
rfc_predictions = rfc_model.predict(x_test)
rfc_accuracy = accuracy_score(y_test, rfc_predictions)

print(f"Random Forest model accuracy: {(rfc_accuracy * 100):.2f}%")

Random Forest model accuracy: 99.88%


### Decision Tree Classifier

- [DecisionTreeClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

Decision Tree Classifier takes approximately 8s and is 99.82% accurate.

In [10]:
from sklearn.tree import DecisionTreeClassifier

# 03:01
dtc_model = DecisionTreeClassifier()
dtc_model.fit(x_train, y_train)
dtc_predictions = dtc_model.predict(x_test)
dtc_accuracy = accuracy_score(y_test, dtc_predictions)

print(f"Decision Tree Classifier model accuracy: {(dtc_accuracy * 100):.2f}%")

Decision Tree Classifier model accuracy: 99.82%
