# Predict Credit Card Fraud


In [1]:
import os
os.chdir(r'C:\Users\Pedram\Documents\GitHub\Credit_Card_Fraud_Prediction')

The data on 1000 simulated credit card transactions is found in the `file transactions_modified.csv`. We'll start by loading this data into a pandas DataFrame named `transactions`.

In [2]:
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the data
transactions = pd.read_csv('transaction_modified.csv')
print(transactions.head())
print(transactions.info())

   step      type      amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT    62927.08   C473782114           0.00            0.00   
1   380   PAYMENT    32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT  1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT    60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN    46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2           1    818679.85  


In [3]:
# How many fraudulent transactions?
print('The num of fraudulent transactions: ', transactions.isFraud.sum())

The num of fraudulent transactions:  282


2. Summary statistics of `amount` column to have a general view of the distribution.

In [4]:
# Summary statistics on amount column
transactions['amount'].describe()

count    1.000000e+03
mean     5.373080e+05
std      1.423692e+06
min      0.000000e+00
25%      2.933705e+04
50%      1.265305e+05
75%      3.010378e+05
max      1.000000e+07
Name: amount, dtype: float64

3. A new column named `isPayment` will be introduced. This column will be assigned a value of `1` in cases where the type is designated as either "PAYMENT" or "DEBIT," and a value of `0` in other instances.

In [5]:
transactions['isPayment'] = 0
transactions.loc[transactions['type'].isin(['PAYMENT', 'DEBIT']), 'isPayment'] = 1

In [6]:
payment = transactions[transactions.type == 'PAYMENT']
payment.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,247.906542,13442.746308,52711.04972,46573.490701,0.0,0.0,0.0,1.0,0.0,52711.04972
std,127.425429,14006.458934,105727.989882,102165.163351,0.0,0.0,0.0,0.0,0.0,105727.989882
min,1.0,206.69,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,177.25,3442.995,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,252.5,8719.545,3493.0,0.0,0.0,0.0,0.0,1.0,0.0,3493.0
75%,331.75,19097.3525,52815.6725,44299.2325,0.0,0.0,0.0,1.0,0.0,52815.6725
max,596.0,76894.32,610962.0,603792.35,0.0,0.0,0.0,1.0,0.0,610962.0


In [7]:
payment = transactions[transactions.type == 'DEBIT']
payment.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff
count,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0
mean,213.0,11848.195,48101.151667,37769.358333,1196419.0,1208268.0,0.0,1.0,0.0,1148318.0
std,132.152942,20965.386989,41091.540004,35222.822343,1782576.0,1776041.0,0.0,0.0,0.0,1788929.0
min,11.0,449.19,0.0,0.0,57661.94,66760.35,0.0,1.0,0.0,57661.94
25%,144.75,1597.6375,28783.25,26032.195,159295.3,175303.6,0.0,1.0,0.0,90189.31
50%,217.5,3096.445,31489.5,29453.945,446190.7,473509.8,0.0,1.0,0.0,372360.0
75%,325.5,7644.1825,77329.6825,35960.115,1218794.0,1219776.0,0.0,1.0,0.0,1170148.0
max,350.0,54188.96,105137.0,104687.81,4688481.0,4691393.0,0.0,1.0,0.0,4655318.0


In [8]:
#double checking if the functions worked well
transactions.isPayment.sum()
#214 + 6 = 220

220

4. Likewise, a column named `isMovement` should be generated to show whether funds shifted away from the source account. In this column, a value of `1` will be assigned if the type is either "CASH_OUT" or "TRANSFER," and a value of `0` otherwise.

In [9]:
transactions['isMovement'] = 0
transactions.loc[transactions['type'].isin(['CASH_OUT', 'TRANSFER']), 'isMovement'] = 1

In [10]:
transactions.isMovement.sum()

605

5. In the investigation of financial fraud, an additional crucial aspect to consider would involve the distinction in value between the source and destination accounts. In this instance, our theory suggests that destination accounts showing a notably distinct value could raise suspicions of potential fraud. To proceed, we will establish a column termed `accountDiff`, which will contain the absolute difference between the values in the `oldbalanceOrg` and `oldbalanceDest` columns.

In [11]:
# Create accountDiff field
transactions['accountDiff'] = abs(transactions['oldbalanceDest'] - transactions['oldbalanceOrg'])

print(transactions.columns)

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isPayment',
       'isMovement', 'accountDiff'],
      dtype='object')


##### Feature selection, labeling, train and test

6. Before the initiation of model training, the establishment of our features and label columns is a prerequisite. Within the context of this dataset, the column designated as our label is identified as `isFraud`. To proceed, a variable named `features` constructed as an array including the variables:

- `amount`
- `isPayment`
- `isMovement`
- `accountDiff`


In [12]:
# Create features and label variables
features = transactions[['amount','isPayment','isMovement','accountDiff']]
label = transactions['isFraud']

7. Split the data into training and test sets using sklearn‘s `train_test_split()` method.

In [13]:
# Split dataset
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    label, 
                                                    test_size=0.3)

#### Data Normalization

8. Because sklearn's Logistic Regression implementation incorporates Regularization, it is essential to scale our feature data. This has been achieved using the StandardScaler object. The process includes applying `.fit_transform()` to the training features and later using `.transform()` on the test features.

In [14]:
# Normalize the features variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

9. A `LogisticRegression` model is generated using sklearn and is fitted onto the training data through the use of `.fit()`.
 The initial threshold used is the default value of 0.5.

In [15]:
# Fit the model to the training data
model = LogisticRegression()
fraud_lr = model.fit(X_train, y_train)

10. The model's `.score()` method is executed on the train and test data, and their relative score is then displayed.

By scoring the model with the training data, the trained model evaluates the transactions and predicts which ones are fraudulent. The resulting score represents the accuracy, or the percentage of accurate classifications.

In [17]:
# Save and print the predicted outcomes
y_pred = fraud_lr.predict(X_test)

In [19]:
# Print out the confusion matrix
from sklearn.metrics import confusion_matrix
print('confusion matrix: ')
print(confusion_matrix(y_test, y_pred))

# Print F1 score here:
from sklearn.metrics import f1_score
print('f1 score :', f1_score(y_test, y_pred))


confusion matrix: 
[[201   0]
 [ 50  49]]
f1 score : 0.6621621621621622


In [15]:
# Score the model on the test data
print('Train accuracy :', fraud_lr.score(X_train, y_train))
print('Test accuracy :', fraud_lr.score(X_test, y_test))

Train accuracy : 0.8428571428571429
Test accuracy : 0.8433333333333334


11. The coefficients for our model are printed to assess the significance of each feature column in the prediction process. It allows us to determine which feature holds the greatest importance and which one has the least impact in the prediction.

In [16]:
# Print the model coefficients
fraud_lr.coef_

array([[ 2.30897274, -0.63733573,  2.08183223, -0.9664791 ]])

In [37]:
features.columns

Index(['amount', 'isPayment', 'isMovement', 'accountDiff'], dtype='object')

The most impactful features based on the ranking: Amount, isMovement, isPayment, accountDif

#### Model Prediction

12. Our model will now be applied to process additional transactions that have passed through our systems. Four randomly created numpy arrays hold information about new sample transactions, labeled as "New transaction data."

In [17]:
# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
transaction4 = np.array([28178, 0, 0, 254])

# Combine new transactions into a single array
sample_transactions = np.stack((transaction1, transaction2, transaction3, transaction4))
sample_transactions

13. Considering that this Logistic Regression model was trained using scaled feature data, it is necessary to also scale the feature data for our prediction task .The `.transform()` method of `StandardScaler` object will be applied to the `sample_transactions` array.

In [24]:
# Normalize the new transactions
new_transactions_scaled = scaler.transform(sample_transactions)



14. Fraudulent transactions are determined by utilizing the .predict() method of the model on the `new_transactions_scaled` array. The results are then printed for observation.

To view the probabilities behind these predictions, the model's `.predict_proba()` method has been chosen. The 1st column signifies the probability of a non-fraudulent transaction, while the 2nd column indicates the probability of a fraudulent transaction—calculated by the model for the classification decision. With the threshold set at 0.5, none of the new transactions have been categorized as fraudulent transactions.

In [28]:
# Predict fraud on the new transactions
print(model.predict(new_transactions_scaled))
print(model.predict_proba(new_transactions_scaled))
# Show probabilities on the new transactions

[0 0 0 0]
[[0.60792627 0.39207373]
 [0.99803366 0.00196634]
 [0.99644741 0.00355259]
 [0.99201883 0.00798117]]
