## Data Dictionary<br>
<b>Training set for Credit Card Transactions<b>
<li>index - Unique Identifier for each row
<li>trans_date_trans_time - Transaction DateTime
<li>cc_num - Credit Card Number of Customer
<li>merchant - Merchant Name
<li>category - Category of Merchant
<li>amt - Amount of Transaction
<li>first - First Name of Credit Card Holder
<li>last - Last Name of Credit Card Holder
<li>gender - Gender of Credit Card Holder
<li>street - Street Address of Credit Card Holder
<li>city - City of Credit Card Holder
<li>state - State of Credit Card Holder
<li>zip - Zip of Credit Card Holder
<li>lat - Latitude Location of Credit Card Holder
<li>long - Longitude Location of Credit Card Holder
<li>city_pop - Credit Card Holder's City Population
<li>job - Job of Credit Card Holder
<li>dob - Date of Birth of Credit Card Holder
<li>trans_num - Transaction Number
<li>unix_time - UNIX Time of transaction
<li>merch_lat - Latitude Location of Merchant
<li>merch_long - Longitude Location of Merchant
<li>is_fraud - Fraud Flag `(Target Class)`

## Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Importing data

In [2]:
train_data=pd.read_csv("fraudTrain.csv")
test_data=pd.read_csv("fraudTest.csv")

### Exploring Data

In [3]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [4]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


In [5]:
train_data.info(), test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

(None, None)

In [6]:
train_data.describe(), test_data.describe()

(         Unnamed: 0        cc_num           amt           zip           lat  \
 count  1.296675e+06  1.296675e+06  1.296675e+06  1.296675e+06  1.296675e+06   
 mean   6.483370e+05  4.171920e+17  7.035104e+01  4.880067e+04  3.853762e+01   
 std    3.743180e+05  1.308806e+18  1.603160e+02  2.689322e+04  5.075808e+00   
 min    0.000000e+00  6.041621e+10  1.000000e+00  1.257000e+03  2.002710e+01   
 25%    3.241685e+05  1.800429e+14  9.650000e+00  2.623700e+04  3.462050e+01   
 50%    6.483370e+05  3.521417e+15  4.752000e+01  4.817400e+04  3.935430e+01   
 75%    9.725055e+05  4.642255e+15  8.314000e+01  7.204200e+04  4.194040e+01   
 max    1.296674e+06  4.992346e+18  2.894890e+04  9.978300e+04  6.669330e+01   
 
                long      city_pop     unix_time     merch_lat    merch_long  \
 count  1.296675e+06  1.296675e+06  1.296675e+06  1.296675e+06  1.296675e+06   
 mean  -9.022634e+01  8.882444e+04  1.349244e+09  3.853734e+01 -9.022646e+01   
 std    1.375908e+01  3.019564e+05  1.

In [7]:
train_data.isnull().sum()

Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64

In [8]:
test_data.isnull().sum()

Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64

### Cleaning Data

In [9]:
#droping duplicate values

train_data.drop_duplicates(inplace=True)

In [10]:
test_data.drop_duplicates(inplace=True)

In [11]:
#seperating date and time
train_data['trans_date_trans_time']=pd.to_datetime(train_data['trans_date_trans_time'])
train_data['trans_date']=train_data['trans_date_trans_time'].dt.strftime('%Y-%m-%d')
train_data['trans_date']=pd.to_datetime(train_data['trans_date'])
train_data['dob']=pd.to_datetime(train_data['dob'])

test_data['trans_date_trans_time']=pd.to_datetime(test_data['trans_date_trans_time'])
test_data['trans_date']=test_data['trans_date_trans_time'].dt.strftime('%Y-%m-%d')
test_data['trans_date']=pd.to_datetime(test_data['trans_date'])
test_data['dob']=pd.to_datetime(test_data['dob'])

In [12]:
#finding out age
train_data['age'] = (train_data.trans_date - train_data.dob)/np.timedelta64(1, 'Y')
train_data['age'] = train_data['age'].astype(int)

test_data['age'] = (test_data.trans_date - test_data.dob)/np.timedelta64(1, 'Y')
test_data['age'] = test_data['age'].astype(int)

`data` looks like:

In [13]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,trans_date,age
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0,2019-01-01,30
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0,2019-01-01,40
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0,2019-01-01,56
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0,2019-01-01,52
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0,2019-01-01,32


In [14]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,trans_date,age
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0,2020-06-21,52
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0,2020-06-21,30
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0,2020-06-21,49
3,3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,54767,Set designer,1987-07-25,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0,2020-06-21,32
4,4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,1126,Furniture designer,1955-07-06,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0,2020-06-21,65


dropping unecessary columns

In [15]:
train_data.drop(['Unnamed: 0','cc_num','merchant','trans_num','unix_time','first','last','street','trans_date_trans_time',
                'category','city','state','job'], inplace=True, axis=1)

In [16]:
test_data.drop(['Unnamed: 0','cc_num','merchant','trans_num','unix_time','first','last','street','trans_date_trans_time',
                'category','city','state','job'], inplace=True, axis=1)

In [17]:
train_data['distance_lat']=abs(round(train_data['merch_lat']-train_data['lat']))
test_data['distance_lat']=abs(round(test_data['merch_lat']-test_data['lat']))

In [18]:
train_data['distance_long']=abs(round(train_data['merch_long']-train_data['long']))
test_data['distance_long']=abs(round(test_data['merch_long']-test_data['long']))

In [19]:
train_data.drop(['lat','long','merch_lat','dob','merch_long','trans_date'],inplace=True, axis=1)

In [20]:
test_data.drop(['lat','long','dob','merch_lat','dob','merch_long','trans_date'],inplace=True, axis=1)

In [21]:
train_data.head()

Unnamed: 0,amt,gender,zip,city_pop,is_fraud,age,distance_lat,distance_long
0,4.97,F,28654,3495,0,30,0.0,1.0
1,107.23,F,99160,149,0,40,0.0,0.0
2,220.11,M,83252,4154,0,56,1.0,0.0
3,45.0,M,59632,1939,0,52,1.0,0.0
4,41.96,M,24433,99,0,32,0.0,1.0


In [22]:
test_data.head()

Unnamed: 0,amt,gender,zip,city_pop,is_fraud,age,distance_lat,distance_long
0,2.86,M,29209,333497,0,52,0.0,0.0
1,29.84,F,84002,302,0,30,1.0,0.0
2,41.28,F,11710,34496,0,49,0.0,1.0
3,60.05,M,32780,54767,0,32,0.0,0.0
4,3.19,M,49632,1126,0,65,1.0,1.0


altering categorical data like `gender`

In [23]:
train_data['gender']=pd.get_dummies(train_data['gender'],drop_first=True)
train_data['gender']

0          False
1          False
2           True
3           True
4           True
           ...  
1296670     True
1296671     True
1296672     True
1296673     True
1296674     True
Name: gender, Length: 1296675, dtype: bool

In [24]:
test_data['gender']=pd.get_dummies(test_data['gender'],drop_first=True)
test_data['gender'].head()

0     True
1    False
2    False
3     True
4     True
Name: gender, dtype: bool

In [25]:
train_data.head()

Unnamed: 0,amt,gender,zip,city_pop,is_fraud,age,distance_lat,distance_long
0,4.97,False,28654,3495,0,30,0.0,1.0
1,107.23,False,99160,149,0,40,0.0,0.0
2,220.11,True,83252,4154,0,56,1.0,0.0
3,45.0,True,59632,1939,0,52,1.0,0.0
4,41.96,True,24433,99,0,32,0.0,1.0


### Splitting Data

In [26]:
x_train=train_data.drop(['is_fraud'], axis=1)
y_train=train_data['is_fraud']
x_test=test_data.drop(['is_fraud'], axis=1)
y_test=test_data['is_fraud']

In [27]:
x_train

Unnamed: 0,amt,gender,zip,city_pop,age,distance_lat,distance_long
0,4.97,False,28654,3495,30,0.0,1.0
1,107.23,False,99160,149,40,0.0,0.0
2,220.11,True,83252,4154,56,1.0,0.0
3,45.00,True,59632,1939,52,1.0,0.0
4,41.96,True,24433,99,32,0.0,1.0
...,...,...,...,...,...,...,...
1296670,15.56,True,84735,258,58,1.0,1.0
1296671,51.70,True,21790,100,40,0.0,1.0
1296672,105.93,True,88325,899,52,1.0,1.0
1296673,74.90,True,57756,1126,39,1.0,1.0


In [28]:
y_train

0          0
1          0
2          0
3          0
4          0
          ..
1296670    0
1296671    0
1296672    0
1296673    0
1296674    0
Name: is_fraud, Length: 1296675, dtype: int64

## Training the model

### 1.) Logistic Regression

In [29]:
from sklearn.linear_model import LogisticRegression
logR=LogisticRegression()

In [30]:
logR

**Predicting**

In [31]:
logR.fit(x_train,y_train)
y_pred=logR.predict(x_test)

In [32]:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_test, y_pred)
print('Accuracy:',accuracy)
print("Model Score on Training data",logR.score(x_train,y_train))
print("Model Score on Testing data",logR.score(x_test ,y_test))

Accuracy: 0.9956128906875598
Model Score on Training data 0.9937185493666493
Model Score on Testing data 0.9956128906875598


### 2.) Decision Tree Classifier

In [33]:
from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier()

**Predicting**

In [34]:
dtc.fit(x_train,y_train)
y_pred=dtc.predict(x_test)

In [35]:
accuracy=accuracy_score(y_test, y_pred)
print('Accuracy:',accuracy)
print("Model Score on Training data",dtc.score(x_train,y_train))
print("Model Score on Testing data",dtc.score(x_test ,y_test))

Accuracy: 0.9931368191478067
Model Score on Training data 0.9999699230724738
Model Score on Testing data 0.9931368191478067


### 3.) Random Forest Classifier

In [36]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()

**Prediciting**

In [37]:
rfc.fit(x_train,y_train)
y_pred=rfc.predict(x_test)

In [38]:
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [39]:
accuracy=accuracy_score(y_test, y_pred)
print('Accuracy:',accuracy)
print("Model Score on Training data",rfc.score(x_train,y_train))
print("Model Score on Testing data",rfc.score(x_test ,y_test))

Accuracy: 0.9959385948653906
Model Score on Training data 0.9999614398365049
Model Score on Testing data 0.9959385948653906
