# Predict Credit Card Fraud
## Machine Learning Project
### Logistic Regression

In this project, it will investigate a synthstic financial dataset which represents a typical set of credit card transactions. The project goal is to predict whether a transaction is fraudulent or not.

### Import Python Modules

In [1]:
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
# Load the Data
transactions = pd.read_csv('transactions.csv')

In [5]:
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [6]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


Explore how many transactions are fraudulent using `isFraud`.

In [8]:
transactions['isFraud'].sum()

8213

Create a summary statistic with `amount`.

In [9]:
transactions['amount'].describe()

count    6.362620e+06
mean     1.798619e+05
std      6.038582e+05
min      0.000000e+00
25%      1.338957e+04
50%      7.487194e+04
75%      2.087215e+05
max      9.244552e+07
Name: amount, dtype: float64

Create a column `isPayment` to seperate `PAYMENT` or `DEBIT` and other types of transactions.
Assign a `1` when `type` is `PAYMENT` or `DEBIT`, and a `0` otherwise.

In [24]:
transactions['isPayment'] = transactions['type'].apply(\
                                    lambda x: 1 if x=='PAYMENT' or x=='DEBIT' else 0)
transactions.head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment,isMovement
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,1,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,1
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,1,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0,1,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0,1,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0,1,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0,1,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0,1,0


Create a column `isMovement` and assign a `1` when `type` is either `CASH_OUT` or `TRANSFER` and a `0` otherwise.

In [25]:
transactions['isMovement'] = transactions['type'].apply(\
                                    lambda x: 1 if x=='CASH_OUT' or x=='TRANSFER' else 0)
transactions.head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment,isMovement
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,1,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,1
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,1
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,1,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0,1,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0,1,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0,1,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0,1,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0,1,0


Next, investigate the difference in value between the origin and destination account. Create a column `accountDiff` with the absolute difference of the `oldbalanceOrg` and `oldbalanceDest` columns.

In [27]:
transactions['accountDiff'] = abs(transactions['oldbalanceOrg']\
                                  -transactions['oldbalanceDest'])
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment,isMovement,accountDiff
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,1,0,170136.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1,0,21249.0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,1,181.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,1,21001.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,1,0,41554.0


Select and split the data before training the model. Create a variable `features` which consists `amount`, `isPayment`, `isMovement`, `accountDiff`, and a variable `label` with the column `isFraud`.

In [30]:
features = transactions[['amount', 'isPayment', 'isMovement', 'accountDiff']]
label = transactions[['isFraud']]

Split the data into training set and testing set.

In [31]:
features_train, features_test, label_train, label_test = train_test_split(\
                                        features, label, train_size=0.7, test_size=0.3)

In [33]:
scaler = StandardScaler()
features_train = scaler.fit_transform(features_train)
features_test = scaler.transform(features_test)

In [36]:
lr = LogisticRegression()
lr.fit(features_train, label_train)
print(lr.coef_)

  return f(*args, **kwargs)


[[ 0.21794987 -0.90785413  3.65989696 -0.64283808]]


Scoring the model will process will process the data through the corresponding model and will predict which transactions are fraudulent. The score returned is the precentage of correct classifications, or the accuracy.

In [39]:
# Score the model using the training set of data
print(lr.score(features_train, label_train))

# Score the model using the testing set of data
print(lr.score(features_test, label_test))

0.998674400527725
0.9987264156380024


The coefficients would indicate the importance of each feature column was for prediction. It seems `isMovement` is the most important factor while `amount` is the least important factor for prediction.

In [40]:
# Print the model coefficients
print(lr.coef_)

[[ 0.21794987 -0.90785413  3.65989696 -0.64283808]]


Use the model to process more transcations.

In [41]:
# New transaction data
transaction1 = np.array([12345.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
your_transaction = np.array([345321.56, 0.0, 1.0, 200500.0])

In [43]:
# Combine the new transactions into a single array
sample_transactions = np.stack([transaction1, transaction2, transaction3, your_transaction])
sample_transactions

array([[1.2345780e+04, 0.0000000e+00, 1.0000000e+00, 5.4670100e+04],
       [9.8765430e+04, 1.0000000e+00, 0.0000000e+00, 8.5247500e+03],
       [5.4367831e+05, 1.0000000e+00, 0.0000000e+00, 5.1002550e+05],
       [3.4532156e+05, 0.0000000e+00, 1.0000000e+00, 2.0050000e+05]])

To use this Logistic Regression model on the new transaction data, the data must be scaled using the same `.transform()` method earlier. Then, predict the result with the model.

In [46]:
sample_transactions = scaler.transform(sample_transactions)
lr.predict(sample_transactions)

array([0, 1, 0, 0])

Calculate the probability of those transactions being fraudulent or not. The first column is the probability of a transaction not being fraudulent, and the second column is the prbability of a transaction being fraudulent. The results were calculated by the above model to make the final classification decision.

In [49]:
print('Probability (not fraudulent) & Probability (fradudulent) \n', lr.predict_proba(sample_transactions))

Probability (not fraudulent) & Probability (fradudulent) 
 [[1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]
