# Machine Learning Project - Predict Credit Card Fraud 

Credit card fraud is one of the leading causes of identify theft around the world. In 2018 alone, over $24 billion were stolen through fraudulent credit card transactions. Financial institutions employ a wide variety of different techniques to prevent fraud, one of the most common being Logistic Regression.

In this project, you are a Data Scientist working for a credit card company. You have access to a dataset (based on a synthetic financial dataset), that represents a typical set of credit card transactions. transactions_modified.csv is the original dataset containing 200k transactions. Your task is to use Logistic Regression and create a predictive model to determine if a transaction is fraudulent or not.



In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Let’s begin by loading the data into a pandas DataFrame named transactions. Take a peek at the dataset using .head() and you can use .info() to examine how many rows are there and what datatypes the are.

In [2]:
transact = pd.read_csv('transactions_modified.csv')
transact

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


In [3]:
transact.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
transact.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


## How many transactions are fraudulent?

In [5]:
transact['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

8123 transactions are fraudulent.

Looking at the dataset, combined with our knowledge of credit card transactions in general, we can see that there are a few interesting columns to look at. We know that the amount of a given transaction is going to be important. Calculate summary statistics for this column.

In [6]:
transact['amount'].describe()

count    6.362620e+06
mean     1.798619e+05
std      6.038582e+05
min      0.000000e+00
25%      1.338957e+04
50%      7.487194e+04
75%      2.087215e+05
max      9.244552e+07
Name: amount, dtype: float64

We have a lot of information about the type of transaction we are looking at. Let’s create a new column called isPayment that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

In [7]:
transact['isPayment'] = 0

In [8]:
transact.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,0


In [None]:
transact['isPayment'][transact['type'].isin(['PAYMENT','DEBIT'])] = 1

In [10]:
transact.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,1
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,1


Similarly, create a column called isMovement, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

In [11]:
transact['isMovement'] = 0

In [12]:
transact.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment,isMovement
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,1,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,1,0


In [None]:
transact['isMovement'][transact['type'].isin(['CASH_OUT','TRANSFER'])] = 1

In [14]:
transact.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment,isMovement
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,1,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,1
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,1
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,1,0


With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud. Let’s create a column called accountDiff with the absolute difference of the oldbalanceOrg and oldbalanceDest columns.

In [15]:
transact['accountDiff'] = abs(transact['oldbalanceDest'] - transact['oldbalanceOrg'])
transact['accountDiff'] 

0           170136.00
1            21249.00
2              181.00
3            21001.00
4            41554.00
              ...    
6362615     339682.13
6362616    6311409.28
6362617    6242920.44
6362618     850002.52
6362619    5660096.59
Name: accountDiff, Length: 6362620, dtype: float64

Before we can start training our model, we need to define our features and label columns. Our label column in this dataset is the isFraud field. Create a variable called features which will be an array consisting of the following fields:

    amount
    isPayment
    isMovement
    accountDiff

Also create a variable called label with the column isFraud.

In [16]:
features = transact[['amount','isPayment','isMovement','accountDiff']]

In [17]:
label = transact['isFraud']

Split the data into training and test sets using sklearn‘s train_test_split() method. We’ll use the training set to train the model and the test set to evaluate the model. Use a test_size value of 0.3.

In [18]:
x_train,x_test,y_train,y_test = train_test_split(features,label,test_size=0.3)
x_train,x_test,y_train,y_test

(            amount  isPayment  isMovement  accountDiff
 3885020  899238.43          0           1   2162122.87
 2673950   60833.79          0           0   5446438.26
 4655337    6229.07          1           0         0.00
 5267794  148603.72          0           1   9332609.65
 226415    26281.46          0           1    113696.98
 ...            ...        ...         ...          ...
 555264     6909.11          0           1     23608.31
 5567846   95143.04          0           1    143183.24
 376113    87485.79          0           1   1065173.33
 3560982   22293.80          0           1   1455315.43
 5853637    2977.91          1           0         0.00
 
 [4453834 rows x 4 columns],
             amount  isPayment  isMovement  accountDiff
 358789    17601.23          0           0    241202.40
 762125    11855.30          1           0         0.00
 3564714  325523.42          0           0    979314.26
 1771118  291417.95          0           0    496401.32
 1231927  189898.

Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

In [19]:
scaler = StandardScaler()

In [20]:
scaler.fit_transform(x_train)

array([[ 1.18280959, -0.72535339,  1.13865193,  0.14834208],
       [-0.19614882, -0.72535339, -0.87823151,  0.96365495],
       [-0.28595944,  1.37863836, -0.87823151, -0.38839279],
       ...,
       [-0.15231318, -0.72535339,  1.13865193, -0.12396948],
       [-0.25953712, -0.72535339,  1.13865193, -0.0271189 ],
       [-0.29130675,  1.37863836, -0.87823151, -0.38839279]])

In [21]:
scaler.transform(x_test)

array([[-0.26725518, -0.72535339, -0.87823151, -0.32851564],
       [-0.27670574,  1.37863836, -0.87823151, -0.38839279],
       [ 0.23919704, -0.72535339, -0.87823151, -0.14528352],
       ...,
       [ 0.2742046 , -0.72535339,  1.13865193, -0.30943787],
       [-0.03775594, -0.72535339, -0.87823151, -0.38073345],
       [-0.29598166,  1.37863836, -0.87823151, -0.3861067 ]])

Create a LogisticRegression model with sklearn and .fit() it on the training data.

Fitting the model find the best coefficients for our selected features so it can more accurately predict our label. We will start with the default threshold of 0.5.

In [22]:
model = LogisticRegression()
model

LogisticRegression()

In [23]:
model.fit(x_train,y_train)

LogisticRegression()

In [24]:
model.score(x_train,y_train)

0.9986932606828185

In [25]:
model.score(x_test,y_test)

0.9986860758618306

In [26]:
model.coef_

array([[ 4.41205056e-07, -5.28300155e+00, -3.76135989e-01,
        -2.20668712e-07]])