# Credit Fraud Classificaiton Tutorial

### 1. Basic data manipulation with pandas
### 2. Creating a classificaiton model using tensorflow and keras
### 3. Addresssing class imbalance using class weights
### 4. Evaluating the model with the f1 score

## Import Packages

In [1]:
import numpy as np #linear algebra and array manipulation
import pandas as pd #DataFrame, data manipulation
import tensorflow as tf #Deep Learning Framework
import tensorflow.keras as keras #built on top of tensorflow
import tensorflow.keras.layers as tfl #Neural Network layers

# Data Processing

## Load Data
In the right bar there is a Input section containing all of the data needed. Expand the input and copy the file path so you can load in the data with pandas.

In [2]:
#copy file path
input_file_path = '/kaggle/input/creditcardfraud/creditcard.csv'

#use read_csv command to load the data into a DataFrame
credit_data = pd.read_csv(input_file_path)

In a new cell you can see the data in the dataframe just by putting it on the last line.

In [3]:
credit_data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


## DataFrames Operations
DataFrames have basic properties that can be useful to know about. One is the dataframes shape, as well basic info about its columns, and a statistical summary of its columns. You don't need to memorize these just know they exist and look up anything you don't know. 

In [4]:
credit_data.shape

(284807, 31)

In [5]:
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [6]:
credit_data.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


## Split data into X and y

Using .iloc treates the dataframe like an array, putting : includes the entire dimension. so .iloc[:,:] includes the entire dataframe.

In [7]:
X = credit_data.iloc[:,:30]
y = credit_data.iloc[:,30]

# Create Model

## Make a Model Function
Making a function makes it easier to tune the model.

To create a model we can use the keras Sequential API. 
1. Create the sequential model and add layers. First add an Input() layer then in this case use Dense() layers which are normal neural network layers. For Dense layers you can specify the number of units and the activation function.
2. Compile the model. Here you specify your loss function (mean square error, 'mse') and the optimizer. Use 'adam' which is gradient descent algorithim that it tweaked to improve convergence, don't worry about the details for now. You can also specify metrics such as mean absulote error 'mae'.

In [8]:
def create_model():
    
    model = keras.Sequential([
        tfl.Input(shape = (30,)),
        tfl.Dense(units = 512, activation = 'sigmoid'),
        tfl.Dense(units = 512, activation = 'sigmoid'),
        tfl.Dense(units = 1, activation = 'sigmoid')
    ])
    
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    
    return model

## View Model Summary
The number of parameters is usually a good indicator of your model's complexity and how long it will take to train. The number of parameters is important but it's usually better initially to pay attention to the general architecture (like choice of activation function). 

In [9]:
model = create_model()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 512)               15872     
                                                                 
 dense_1 (Dense)             (None, 512)               262656    
                                                                 
 dense_2 (Dense)             (None, 1)                 513       
                                                                 
Total params: 279,041
Trainable params: 279,041
Non-trainable params: 0
_________________________________________________________________


# Iteratively Train Model

## Fitting the Model
model.fit trains the model by going through the entire dataset and preforming gradient descent. You can specify the epoch number, the number of loops through the training data.

In [10]:
history = model.fit(X,y, epochs = 1)



## Evaluating The Model
The model looks really accurate, lets see how often it can recognize fraud.

In order to only use fraudulent cases we need to see where in the dataframe (what indexes) the Class is 1. The list of that tells you whether the index meets some condition is called a mask. We can put the mask on the dataframe to get the dataframe of only fraudulent cases.

In [11]:
fraud_mask = credit_data['Class'] == 1 #Set a condition, series of boolean values
fraud_data = credit_data[fraud_mask] #Get only fraud cases

#Split into X and y
X_fraud = fraud_data.iloc[:,:30] 
y_fraud = fraud_data.iloc[:,30]

Use model.evaluate to get the loss and other metrics (accuracy) on the data.

In [12]:
model.evaluate(X_fraud,y_fraud)



[7.2602386474609375, 0.0]

0 percent accuracy whats going on, lets look at the model's predictions.

In [13]:
model.predict(X_fraud[:10])



array([[0.00070294],
       [0.00070288],
       [0.00070294],
       [0.00070294],
       [0.00070294],
       [0.00070294],
       [0.00070294],
       [0.00070294],
       [0.00070294],
       [0.00070294]], dtype=float32)

It looks like the model is only guessing that the case is not fraudulent, so the model only seems like its accurate because there is a major class imbalance.

## Addressing Class Imbalance
Class Weights are a common method of addressing imbalance. What it does is weight the under represented class higher in the loss function.

In [14]:
class_totals = [len(y)-len(y_fraud),len(y_fraud)]
class_weight = {i:len(y)/(2*total) for i,total in enumerate(class_totals)}

It seems like our model isn't complex enough to learn how to detect fradulent behavior. Lets add more layers and try using a better activation function. We will also use the AUC metric which accounts for class imbalance. AUC will be between 0 and 1, the higher the better. (In actual code you wouldn't recreate the model function just edit the original)

In [15]:
AUC = keras.metrics.AUC(curve = 'PR') # get the auc metric, don't worry about what it is only its meaning for now

def create_model():
    
    model = keras.Sequential([
        tfl.Input(shape = (30,)),
        tfl.Dense(64, activation = 'relu', kernel_initializer = 'he_uniform'), #When using relu specify the kernel initializer to be
        tfl.Dense(64, activation = 'relu', kernel_initializer = 'he_uniform'), #'he_uniform'
        tfl.Dense(1, activation = 'sigmoid')
    ])
    
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy',AUC]) #Use better metric AUC 
    
    return model

In [16]:
model = create_model() #creating the model again
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 64)                1984      
                                                                 
 dense_4 (Dense)             (None, 64)                4160      
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 6,209
Trainable params: 6,209
Non-trainable params: 0
_________________________________________________________________


In [17]:
model.fit(X,y,epochs = 10, class_weight = class_weight) #training with new model

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7cf79d6350>

It looks like the loss and accuracy are jumping all over the place. This corresponds the having a learning rate that's too high unable to converge.

## Update Model Function and Use Step Decay

In [18]:
def create_model(learning_rate):
    
    model = keras.Sequential([
        tfl.Input(shape = (30,)),
        tfl.Dense(64, activation = 'leaky_relu', kernel_initializer = 'he_uniform'), 
        tfl.Dense(128, activation = 'leaky_relu', kernel_initializer = 'he_uniform'), 
        tfl.Dense(256, activation = 'leaky_relu', kernel_initializer = 'he_uniform'), # Use more layers and units
        tfl.Dense(128, activation = 'leaky_relu', kernel_initializer = 'he_uniform'),
        tfl.Dense(64, activation = 'leaky_relu', kernel_initializer = 'he_uniform'),
        tfl.Dense(1, activation = 'sigmoid')
    ])
    
    optimizer = keras.optimizers.Adam(learning_rate = learning_rate) # Specify learning rate by creating adam optimizer object
    
    model.compile(loss = 'binary_crossentropy', optimizer = optimizer, metrics = ['accuracy',AUC])
    
    return model

Lets use a more complex model and also incorporate step decay in order to decrease the learning rate when we get to more sensitive regions.

In [19]:
step_decay = keras.callbacks.LearningRateScheduler(
    lambda epoch: 10**(- int(epoch/10) - 3)
)
model = create_model(1e-3)
model.fit(X,y,epochs = 40, callbacks = [step_decay])

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7f7cd8707ee0>

It seems like the model accuractely understands how to classify cases as fraudulent or non fraudulent. Let's test how good our final model is.

# Model Evaluation

## Precision and Recall

So far we know that accuracy isn't always reliable because of major class imbalance because the model can just guess for the majority class. Instead using AUC helps see how well our model can actaully classify both classes reliably. When we tested the model on only fraudulent cases we saw that the model did extremely poorly, instead of looking at the models accuracy on the entire dataset we can instead compute the accuracy on positive cases which is called recall. However the recall can also be unreliable if our model always guesses that the case is fraudulent, therefore we can look at only the cases where the model predicts a positive case and compute the accuracy called precision. Both precision and recall are valuable, in medicine recall is usually valued more because even if precision is a bit lower its urgent to identify most positive cases, however if we want shoot an airplane down depending on if its from the correct military you would want to consider precision more.

$$
    recall = \frac{T_p}{T_p + F_n}
$$
<br>
$$
    precision = \frac{T_p}{T_p + F_p}
$$
![image.png](attachment:fffd6a4f-9a28-4592-88d5-756be26ea7e6.png)

## AUC

Since our model outputs probabilities instead of 1s and 0s we need to set a threshold value meaning that our accuracy, precision, and recall can change depending on the choice of thershold. A choice of 0.5 for the threshold isn't always the best one because precision and recall can vary according to the threshold. In order to understand the general effectiveness of the model in terms of the precision and recall we can plot their relationship across varying thersholds from 0 to 1. A generally high precision and recall means the curve will be oriented towards the top right, we can measure this by getting the area underneath the curve which is called the pr auc which stands for precision recall area under the curve. 

![image.png](attachment:f3938d9c-0c38-419c-bbe0-3631aaf9ed90.png)

## F Score

Now that we have the trained model we don't need to use AUC anymore because we can just determine the best threshold, but it's difficult to determine what the best thershold is based viewing precision and recall individually therefore we can combine them into a metric called the F beta score. The F beta score works like AUC and accuracy in the sense that values are constrained between 0 and 1 and higher is better. Gives precision and recall by choosing a threshold the F beta score returns a value so to determine the best threshold we just need to find the maximum. This metric also requires a choice of beta, higher beta corresponds with higher importance to recall and lower corresponds with more importance on precision with a default value of 1 for equal importance. <br>
![image.png](attachment:15cbc467-dc74-4794-8101-afd1d4889acc.png)

## Finding the Best Threshold

In [20]:
from sklearn.metrics import fbeta_score
best_threshold = -1
best_F1 = 0

for threshold in np.linspace(0,1,101):
    
    #batch_size = len(y) means it'll do the prediciton on the entire dataset at once, verbose = 0 means no progress bar
    y_pred = model.predict(X, batch_size = len(y), verbose = 0) 
    y_pred = y_pred > threshold #True corresponds with 1 and false corresponds with 0
    
    F1 = fbeta_score(y, y_pred, beta = 1)
    
    if best_F1 < F1:
        best_threshold = threshold
        best_F1 = F1
        
print('best threshold:', best_threshold, 'best F1:', best_F1)

best threshold: 0.5700000000000001 best F1: 0.8294314381270903


# Summary

1. basic data manipulation with pandas .iloc[row index, column index]
2. creating a classificaiton model
    1. using a better activation function relu with he uniform initialization
    2. adding more layers and units to increase complexity
    3. using step decay during training
3. addresssing class imbalance
    1. using class weights to weight positive and negative cases equally in the loss function
    2. using AUC metric to get a reliable metric for performance
4. evaluating the model
    1. using the F1 score to find the best threshold and get the best metric for final model performance