# Predict Credit Card Fraud

## Kaggle [Synthetic Financial Dataset For Fraud Detection](https://www.kaggle.com/datasets/ealaxi/paysim1)

In [4]:
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [5]:
# Subset of transactions with 1000 entries (rather than 200,000)
transactions = pd.read_csv("Data/transactions_modified.csv")
columns = transactions.columns.tolist()
print(F"Columns:\n\n{columns}\n\n")
print(transactions.info())
transactions.head()

Columns:

['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isPayment', 'isMovement', 'accountDiff']


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            1000 non-null   int64  
 1   type            1000 non-null   object 
 2   amount          1000 non-null   float64
 3   nameOrig        1000 non-null   object 
 4   oldbalanceOrg   1000 non-null   float64
 5   newbalanceOrig  1000 non-null   float64
 6   nameDest        1000 non-null   object 
 7   oldbalanceDest  1000 non-null   float64
 8   newbalanceDest  1000 non-null   float64
 9   isFraud         1000 non-null   int64  
 10  isPayment       1000 non-null   int64  
 11  isMovement      1000 non-null   int64  
 12  accountDiff     1000 non-null   float64
dtypes: float64(6), int64(4), ob

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff
0,206,CASH_OUT,62927.08,C473782114,0.0,0.0,C2096898696,649420.67,712347.75,0,0,1,649420.67
1,380,PAYMENT,32851.57,C1915112886,0.0,0.0,M916879292,0.0,0.0,0,1,0,0.0
2,570,CASH_OUT,1131750.38,C1396198422,1131750.38,0.0,C1612235515,313070.53,1444820.92,1,0,1,818679.85
3,184,CASH_OUT,60519.74,C982551468,60519.74,0.0,C1378644910,54295.32,182654.5,1,0,1,6224.42
4,162,CASH_IN,46716.01,C1759889425,7668050.6,7714766.61,C2059152908,2125468.75,2078752.75,0,0,0,5542581.85


In [6]:
print(F"Fraudulent Transactions: {transactions.isFraud[transactions.isFraud == 1].sum()}")

Fraudulent Transactions: 282


----
&nbsp;
### Clean the Data

In [7]:
print(F"Amount column\nSummary Statistics:\n\n{transactions['amount'].describe()}")

Amount column
Summary Statistics:

count    1.000000e+03
mean     5.373080e+05
std      1.423692e+06
min      0.000000e+00
25%      2.933705e+04
50%      1.265305e+05
75%      3.010378e+05
max      1.000000e+07
Name: amount, dtype: float64


In [8]:
transactions['isPayment'] = 0
transactions['isPayment'][transactions['type'].isin(['CASH_OUT', "TRANSFER"])] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transactions['isPayment'][transactions['type'].isin(['CASH_OUT', "TRANSFER"])] = 1


In [9]:
transactions['accountDiff'] = transactions['oldbalanceOrg'] - transactions['oldbalanceDest']

----
&nbsp;
### Select and Split the Data

In [10]:
features = transactions[["amount", "isPayment", "isMovement", "accountDiff"]]
label = transactions["isFraud"]

In [11]:
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.3)

#### Normalise the Data

In [12]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [13]:
# Fit the model to the training data
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [14]:
print(F"Training score: {model.score(X_train, y_train)}")

Training score: 0.8585714285714285


In [15]:
print(F"Test score: {model.score(X_test, y_test)}")

Test score: 0.8433333333333334


In [16]:
# Model coefficients
print(F"Model coefficients:\n{model.coef_}")

Model coefficients:
[[2.16734789 1.68383637 1.68383637 1.34600829]]


These correspond to `amount`, `isPayment`, `isMovement`, `accountDiff`

----
&nbsp;
### Predict with the Model

In [17]:
# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
mytransaction = np.array([1565743.21, 0.0, 1.0, 14362432.31])

# Combine in single Numpy array
sample_transactions = np.stack((transaction1, transaction2, transaction3, mytransaction))
print(sample_transactions)

[[1.23456780e+05 0.00000000e+00 1.00000000e+00 5.46701000e+04]
 [9.87654300e+04 1.00000000e+00 0.00000000e+00 8.52475000e+03]
 [5.43678310e+05 1.00000000e+00 0.00000000e+00 5.10025500e+05]
 [1.56574321e+06 0.00000000e+00 1.00000000e+00 1.43624323e+07]]


In [18]:
# Scale the feature data
sample_transactions = scaler.transform(sample_transactions)
print(sample_transactions)

[[-2.92828464e-01 -1.21387736e+00  8.23806454e-01  1.02181885e-02]
 [-3.09126694e-01  8.23806454e-01 -1.21387736e+00 -9.08661714e-05]
 [-1.54492502e-02  8.23806454e-01 -1.21387736e+00  1.11946395e-01]
 [ 6.59193856e-01 -1.21387736e+00  8.23806454e-01  3.20662915e+00]]




In [19]:
fraud_prediction = model.predict(sample_transactions)
print(fraud_prediction)

[0 0 0 1]


So my fake transaction seems fraudulent.
Why?

In [20]:
probabilities = model.predict_proba(sample_transactions)
print(probabilities)

[[0.97810733 0.02189267]
 [0.97913645 0.02086355]
 [0.95527059 0.04472941]
 [0.07133959 0.92866041]]


First column is probabilities of a transaction NOT being fraudulent

Second columns is probability of a transaction being fraudulent
- Mine has 89% probability of being fraud by the model!

----
&nbsp;
# Next step, `transactions.csv` with 200,000+ rows of data ...

In [21]:
# Subset of transactions with 1000 entries (rather than 200,000)
transactions_real = pd.read_csv("Data/transactions.csv")
columns2 = transactions_real.columns.tolist()
print(F"Columns:\n\n{columns2}\n\n")
print(transactions_real.info())
transactions_real.head()

Columns:

['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud']


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199999 entries, 0 to 199998
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            199999 non-null  int64  
 1   type            199999 non-null  object 
 2   amount          199999 non-null  float64
 3   nameOrig        199999 non-null  object 
 4   oldbalanceOrg   199999 non-null  float64
 5   newbalanceOrig  199999 non-null  float64
 6   nameDest        199999 non-null  object 
 7   oldbalanceDest  199999 non-null  float64
 8   newbalanceDest  199999 non-null  float64
 9   isFraud         199999 non-null  int64  
dtypes: float64(5), int64(2), object(3)
memory usage: 15.3+ MB
None


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud
0,8,CASH_OUT,158007.12,C424875646,0.0,0.0,C1298177219,474016.32,1618631.97,0
1,236,CASH_OUT,457948.3,C1342616552,0.0,0.0,C1323169990,2720411.37,3178359.67,0
2,37,CASH_IN,153602.99,C900876541,11160428.67,11314031.67,C608741097,3274930.56,3121327.56,0
3,331,CASH_OUT,49555.14,C177696810,10865.0,0.0,C462716348,0.0,49555.14,0
4,250,CASH_OUT,29648.02,C788941490,0.0,0.0,C1971700992,56933.09,86581.1,0


In [22]:
fraud = transactions_real.isFraud[transactions_real.isFraud == 1].sum()
print(F"Full dataset\nFraudulent Transactions: {fraud}")
print(F"This is {round(fraud/len(transactions_real) * 100, 6)}%")

Full dataset
Fraudulent Transactions: 282
This is 0.141001%


In [23]:
print(F"In the modified dataset, {round((transactions.isFraud[transactions.isFraud == 1].sum())/len(transactions) * 100, 3)}% of transactions were considered fraudulent")

In the modified dataset, 28.2% of transactions were considered fraudulent
