# Challenge

Using this credit card fraud dataset, develop an algorithm to predict fraud. Prioritize correctly finding fraud rather than correctly labeling non-fraudulent transactions.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.metrics import recall_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn import ensemble
%matplotlib inline

Import the data and look at the columns.

In [5]:
raw = pd.read_csv('creditcard.csv')
print(raw.head())

   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...         V21       V22       V23       V24  \
0  0.098698  0.363787  ...   -0.018307  0.277838 -0.110474  0.066928   
1  0.085102 -0.255425  ...   -0.225775 -0.638672  0.101288 -0.339846   
2  0.247676 -1.514654  ...    0.247998  0.771679  0.909412 -0.689281   
3  0.377436 -1.387024  ...   -0.108300  0.005274 -0.190321 -1.175575   
4 -0.270533  0.817739  ...   -0.009431  0.798278 -0.137458  0.141267   

        V25       V26       V27       V28  Amount  Class  
0  0.128539 -0.189115

Let's see what kind of data we are working with, and how many are in each Class.

In [3]:
print(raw['Class'].value_counts())

0    284315
1       492
Name: Class, dtype: int64


We are working with serious class imbalance here -- there are many more legitimate transactions than fraudulent transactions. This might make it difficult to classify the fraudulent transactions.<br><br>

Convert all to numeric and define X features and Y output.

In [6]:
raw = raw.apply(pd.to_numeric, errors="coerce")

In [7]:
X = raw.drop('Class', axis=1)
Y = raw['Class']

We want to maximize the number of true positives, aka the number of accurately predicted fraudulent transactions. We should also minimize the number of false negatives, or the number of fraudulent transactions that we missed. This means we should be measuring recall (TP/TP + FN).

Let's split our data into training and validation groups.

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=77)

In [7]:
print(Y_train.value_counts())
print(Y_test.value_counts())

0    227451
1       394
Name: Class, dtype: int64
0    56864
1       98
Name: Class, dtype: int64


Each class is being represented proportionately.<br>

Now, train an initial model so we can evaluate different scoring techniques on that model.

In [8]:
neighbors = KNeighborsClassifier(n_neighbors=5)
neighbors.fit(X_train,Y_train)
Y_pred = neighbors.predict(X_train)

crosstab = pd.crosstab(Y_train, Y_pred)
print(crosstab)

macro = recall_score(Y_train, Y_pred, average='macro')
micro = recall_score(Y_train, Y_pred, average='micro')
weighted = recall_score(Y_train, Y_pred, average='weighted')
binary = recall_score(Y_train, Y_pred, average='binary')
raw = recall_score(Y_train, Y_pred, average=None)

print(f'\nMacro Recall Score: {macro}')
print(f'Micro Recall Score: {micro}')
print(f'Weighted Recall Score: {weighted}')
print(f'Binary Recall Score: {binary}')
print(f'Raw Recall Scores: {raw}')

col_0       0   1
Class            
0      227451   0
1         352  42

Macro Recall Score: 0.5532994923857868
Micro Recall Score: 0.9984550900831706
Weighted Recall Score: 0.9984550900831706
Binary Recall Score: 0.1065989847715736
Raw Recall Scores: [ 1.          0.10659898]


Macro is a parameter that calculates metrics for each label, and finds their unweighted mean. This does not take label imbalance into account. This means that each label has an equal effect on the final score, so positive and negative recall are weighted the same.<br>

Using "weighted" calculates metrics for each label, and finds their average weighted by support (the number of true instances for each label). This is similar to the 'macro' style, but accounts for label imbalance. This means that the more common outcome dominates the score, which falsely leads us to believe the score is near perfect.<br>

Binary score only reports results for the positive class, which is the class we care about here. However, it is less convenient to employ binary weighting specifically within the cross-validation (scoring=recall_macro exists but scoring=recall_binary does not.)<br>

We will continue using the macro scoring metric.

In [11]:
def run_score(model, X_train, Y_train, X_test, Y_test, score_type):
    
    Y_pred = model.predict(X_train)
    Y_pred_test = model.predict(X_test)
    cm = pd.crosstab(Y_test, Y_pred_test)
    
    train = recall_score(Y_train, Y_pred, average=score_type)
    validation = recall_score(Y_test, Y_pred_test, average=score_type)
    binary = recall_score(Y_train, Y_pred, average='binary')
    test = cross_val_score(model, X_train, Y_train, cv=5, scoring = 'recall_macro').mean()
    
    print(f'Training Score: {train}')
    print(f'Cross-Val Testing Score: {test}')
    print(cm)
    print(f'Validation Score: {validation} \n\n\n')

    results = {'Training Score':train, 'Cross-Val Testing Score':test, 'Validation Score':validation}
    
    return results

In [12]:
runs = []

def run_neighbors(num):

    print(f'n_neighbors = {num}')
    
    neighbors = KNeighborsClassifier(n_neighbors = num)
    neighbors.fit(X_train,Y_train)

    results = run_score(neighbors, X_train, Y_train, X_test, Y_test)
    results['K'] = num
    
    runs.append(results)
    
nums = {1,5,15,50,100}

for num in nums:
    run_neighbors(num)
    
runs.sort(key = lambda run: run['Validation Score'])
print(f'Best Run: {runs[-1]}')

n_neighbors = 1
Training Score: 1.0 

Cross-Val Testing Score: 0.5848103860280839 

col_0      0   1
Class           
0      56837  27
1         77  21
Validation Score: 0.6069054485891149 



n_neighbors = 100
Training Score: 0.5 

Cross-Val Testing Score: 0.5 

col_0      0
Class       
0      56864
1         98
Validation Score: 0.5 



n_neighbors = 5
Training Score: 0.5532994923857868 

Cross-Val Testing Score: 0.5190154021959593 

col_0      0  1
Class          
0      56864  0
1         93  5
Validation Score: 0.5255102040816326 



n_neighbors = 15
Training Score: 0.5063451776649747 

Cross-Val Testing Score: 0.5025641025641026 

col_0      0  1
Class          
0      56864  0
1         97  1
Validation Score: 0.5051020408163265 



n_neighbors = 50
Training Score: 0.5 

Cross-Val Testing Score: 0.5 

col_0      0
Class       
0      56864
1         98
Validation Score: 0.5 



Best Run: {'Training Score': 1.0, 'Cross-Val Testing Score': 0.58481038602808388, 'Validation Score':

Using K-nearest neighbors with k values of 1, 5, 15, 50, and 100, we were only able to obtain a validation score of **0.607**. This was using 1 neighbor, and the score was significantly higher than validation scores using more neighbors.

In [12]:
runs = []

def run_logistic(alpha, pen):
    
    print(f'C = {alpha}\n')
    
    lr = LogisticRegression(C = alpha, penalty = pen)
    fit = lr.fit(X_train, Y_train)

    results = run_score(fit, X_train, Y_train, X_test, Y_test)
    results['Lambda'] = alpha
    
    runs.append(results)
    
alpha_nums = [0.0001, 0.001, 0.01, 0.1, 0.3, 0.5, 1, 100, 1000, 1e5, 1e7, 1e9]

for alpha in alpha_nums:
    run_logistic(alpha, 'l1')
        
runs.sort(key = lambda run: run['Validation Score'])
print(f'Best Run: {runs[-1]}')

C = 0.0001

Training Score: 0.5
Cross-Val Testing Score: 0.5
col_0      0
Class       
0      56864
1         98
Validation Score: 0.5 



C = 0.001

Training Score: 0.6800667525935803
Cross-Val Testing Score: 0.6546211782252276
col_0      0   1
Class           
0      56849  15
1         64  34
Validation Score: 0.6733374941141341 



C = 0.01

Training Score: 0.7892807480796835
Cross-Val Testing Score: 0.782993698798742
col_0      0   1
Class           
0      56854  10
1         41  57
Validation Score: 0.7907283974366337 



C = 0.1

Training Score: 0.8171973315298993
Cross-Val Testing Score: 0.8134317647671374
col_0      0   1
Class           
0      56853  11
1         37  61
Validation Score: 0.8111277677925419 



C = 0.3

Training Score: 0.8209978433018663
Cross-Val Testing Score: 0.8184766291237995
col_0      0   1
Class           
0      56851  13
1         36  62
Validation Score: 0.8162122227900728 



C = 0.5

Training Score: 0.8222646805591887
Cross-Val Testing Score: 0.

The best macro recall score using Lasso regression was a validation score of **0.821**. This was achieved without regularization, lambda was 1e9.

In [None]:
runs = []
for alpha in alpha_nums:
    run_logistic(alpha, 'l2')
        
runs.sort(key = lambda run: run['Validation Score'])
print(f'Best Run: {runs[-1]}')

C = 0.0001

Training Score: 0.647063035633022
Cross-Val Testing Score: 0.6317689324167493
col_0      0   1
Class           
0      56847  17
1         77  21
Validation Score: 0.6069933776830935 



C = 0.001

Training Score: 0.7485792834455982
Cross-Val Testing Score: 0.7459876175962858
col_0      0   1
Class           
0      56845  19
1         49  49
Validation Score: 0.7498329347214406 



C = 0.01

Training Score: 0.8259926492339612
Cross-Val Testing Score: 0.8070400647657154
col_0      0   1
Class           
0      56844  20
1         34  64
Validation Score: 0.8263547540569407 



C = 0.1

Training Score: 0.8310907741226665
Cross-Val Testing Score: 0.8083199177624291
col_0      0   1
Class           
0      56848  16
1         32  66
Validation Score: 0.8365940073271851 



C = 0.3

Training Score: 0.8285988668458004
Cross-Val Testing Score: 0.8083199177624291
col_0      0   1
Class           
0      56850  14
1         31  67
Validation Score: 0.8417136339623075 



C = 0.5

T

The best score using Ridge regression was a validation score of **0.847**. This was achieved with moderate regularization, lambda was 0.5. This is slightly better than the Lasso score of 0.821, so going forward I would choose ridge over lasso.

In [None]:
runs = []

def run_tree (depth_max, random, feature_max, min_split, score_type):

    print(f'Max Depth: {depth_max}, Random State = {random}, Max Feature = {feature_max}, Min Samples Split = {min_split}')
    
    decision_tree = tree.DecisionTreeClassifier(
        criterion = 'entropy',
        max_depth = depth_max,
        random_state = random,
        max_features = feature_max,
        min_samples_split = min_split
    )
    decision_tree.fit(X_train, Y_train)
    
    results = run_score(decision_tree, X_train, Y_train, X_test, Y_test, score_type)
    results['Max Depth'] = depth_max
    results['Random State'] = random
    results['Max Feature'] = feature_max
    results['Min Samples Split'] = min_split
    results['Score Type'] = score_type
    
    runs.append(results)
        
depth_maxs = [3,4,5]
randoms = [1]
feature_maxs = [4,5,6,7]
min_splits = [4,5,6,7]

for depth_max in depth_maxs:
    for random in randoms:
        for feature_max in feature_maxs:
            for min_split in min_splits:
                run_tree (depth_max, random, feature_max, min_split, 'macro')
                    
runs.sort(key = lambda run: run['Validation Score'])
print(f'Best Run: {runs[-1]}')

Max Depth: 3, Random State = 1, Max Feature = 4, Min Samples Split = 4
Training Score: 0.8222361029754454 

Cross-Val Testing Score: 0.809646968984123 

col_0      0   1
Class           
0      56847  17
1         43  55
Validation Score: 0.7804627654381955 



Max Depth: 3, Random State = 1, Max Feature = 4, Min Samples Split = 5
Training Score: 0.8222361029754454 

Cross-Val Testing Score: 0.809646968984123 

col_0      0   1
Class           
0      56847  17
1         43  55
Validation Score: 0.7804627654381955 



Max Depth: 3, Random State = 1, Max Feature = 4, Min Samples Split = 6
Training Score: 0.8222361029754454 

Cross-Val Testing Score: 0.809646968984123 

col_0      0   1
Class           
0      56847  17
1         43  55
Validation Score: 0.7804627654381955 



Max Depth: 3, Random State = 1, Max Feature = 4, Min Samples Split = 7
Training Score: 0.8222361029754454 

Cross-Val Testing Score: 0.809646968984123 

col_0      0   1
Class           
0      56847  17
1         

The best score using a singular decision tree was a validation score of **0.867**, which is significantly better than that of ridge and lasso regression. This was achieved with Max Depth = 5, Max Feature = 6, and Min Samples Split = 7. These were the upper limits of the parameters I set, and the score might even be better if I increased these upper limits. However, in the interest of time, I will keep them as is.

In [None]:
runs = []

def run_forest(depth_max, feature_max, min_split, n, random, score_type):
    
    print(f'Max Depth: {depth_max}, Max Feature = {feature_max}, Number of Trees = {n}, Random State = {random}, Min Samples Split: {min_split}')
    
    rfc = ensemble.RandomForestClassifier(
    max_depth = depth_max,
    max_features = feature_max,
    min_samples_split = min_split,
    n_estimators = n,
    random_state = random
    )

    rfc.fit(X_train,Y_train)

    results = run_score(rfc, X_train, Y_train, X_test, Y_test, score_type)
    results['Max Depth'] = depth_max
    results['Random State'] = random
    results['Max Feature'] = feature_max
    results['Number of Trees'] = n
    results['Min Samples Split'] = min_split
    
    runs.append(results)
    print("\n")
        
depth_maxs = [3,4,5]
randoms = [1]
feature_maxs = [4,5,6,7]
min_splits = [4,5,6,7]
ns = [10, 50, 100]

for depth_max in depth_maxs:
    for feature_max in feature_maxs:
        for min_split in min_splits:
            for n in ns:
                for random in randoms:
                    run_forest (depth_max, feature_max, min_split, n, random, 'macro')
                        
runs.sort(key = lambda run: run['Validation Score'])
print(f'Best Run: {runs[-1]}')

Max Depth: 3, Max Feature = 4, Number of Trees = 10, Random State = 1, Min Samples Split: 4
Training Score: 0.8273386244154958 

Cross-Val Testing Score: 0.8134716572141582 

col_0      0   1
Class           
0      56851  13
1         40  58
Validation Score: 0.7958040595247666 





Max Depth: 3, Max Feature = 4, Number of Trees = 50, Random State = 1, Min Samples Split: 4
Training Score: 0.812118991122247 

Cross-Val Testing Score: 0.7982212663781469 

col_0      0   1
Class           
0      56852  12
1         43  55
Validation Score: 0.7805067299851849 





Max Depth: 3, Max Feature = 4, Number of Trees = 100, Random State = 1, Min Samples Split: 4
Training Score: 0.8019667068582876 

Cross-Val Testing Score: 0.7994892874482945 

col_0      0   1
Class           
0      56852  12
1         42  56
Validation Score: 0.7856087708015114 





Max Depth: 3, Max Feature = 4, Number of Trees = 10, Random State = 1, Min Samples Split: 5
Training Score: 0.8273386244154958 

Cross-Val Tes

As expected, random forest took the longest to run by far. I had to leave the computer running and come back to it multiple times to get through all the permutations of parameters. The best score using random forest was **0.862**, which was not as strong as the singular decision tree score of 0.867.

# Conclusion

Looking deeper into our best performing model (the decision tree), we can see that we had a training score of 0.891 with a cross-validation testing score of 0.858. This shows that overfitting is present but not strong, which is good. The validation score shows how the model generalizes to unseen data, and returns a score of **0.867.** Of the 56,962 observations, 56927 of them were classified correctly. For the cross-validation testing set, 26 observations were labeled as false negatives (out of 98 true positives) and 9 were labeled as false positives (out of 56864 true negatives). This is a 99.98% accuracy rate for the true positives, which is the measure we care most about. This is paired with a 73% accuracy rate for true negatives, but this case is much more infrequent, and the consequences far less severe so overall there is a good balance between the two. Going forward, I would recommend further tuning the decision tree model to see if it can perform even better. Overall, I am happy with the final score of 0.867 and would feel confident recommending this model to a financial institution (or at least to my supervisor for review!)