# Credit Card Fraud Detection

### 1. Goal

Train, evaluate and optimize models to predict fraudulent credit card transactions using (i) data on credit card transactions from Kaggle and (ii) LogisticRegression and RandomForestClassifiers

### 2. Approach
1. Start with a logistic regression classifier, and then move on to use a random forest classifier
2. Undersample the number of non-fraudulent cases in the training set so as to reduce the skew
3. Apply GridSearchCV or RandomizedSearchCV to find the most optimal hyperparameters for each model
4. Increase the train_test_split ratio from the default of 25% to 40%. If the recall rate of the test set doesn't dip, we know that we're not overfitting

### 3. Summary of experiments
![image](https://image.ibb.co/hyXJ7v/Screen_Shot_2017_07_15_at_07_16_23.png)

### 4. TL;DR / Key findings
1. A **"vanilla" random forest model (i.e. with default hyperparameters) trained on a balanced dataset (See [Iteration 7](#iteration_7)) performs the best.** (Well, random forest models optimized with RandomizedSearchCV perform better, but not by much)
2. This model is able to **accurately predict fraud cases with a 95% recall rate** (i.e. of the 492 fraud cases, it only misclassified 5% (47 cases) as false negatives). This is true even when the model given only 60% of the data as training data. 
3. Variables `V17`, `V12`, `V11`, `V14`, `V16` and `V10` (anonymized because banks) have the highest feature importance score in determining fraud cases (see [chart](#feature_importances))

### 5. Other observations from a machine learning standpoint

1. LogisticRegression models are a great starting point for building classification models, and can reach the same recall rate after some hyperparameter optimization with `GridSearchCV`
2. RandomForestClassifier models performs better than LogisticRegression out of the box, even without any tuning/optimisation
3. Undersampling is a useful technique for training models with highly skewed data (and `imblearn.under_sampling.RandomUnderSampler` from the `imbalanced_learn` library has a nice API that makes resampling as easy as calling a method


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import imblearn

%matplotlib inline
pd.options.display.max_columns = 40

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Load data

In [None]:
df = pd.read_csv('./data/creditcard.csv')

In [None]:
df.head()

In [None]:
df.describe()

### Exploratory data analysis

In [None]:
pd.value_counts(df['Class'])

In [None]:
non_fraud_percentage = (284315-492)/284315.0
print(non_fraud_percentage)

We have a highly unbalanced dataset, and this will make it hard to train our model to detect fraud.
If we wrote a function to always predict 0 (y=not_fraud), we would be correct **99.8%** of the time, but would not have detected any of the fraud cases.

To deal with this, we have 3 options:
1. **Weighting**: Assign the under-represented class a higher weight. However, this is unlikely to be effective given the significant skew in the dataset.
2. **Thresholding**: Override the model's `.predict()` method to classify something as 0 or 1 based on a probability threshold (e.g. 0.90), rather than the probability with the higher value (e.g. 0.50001)
3. **Sampling**: For each training set, sample it in such a way that the instances of 0 and 1 are roughly equal

[Read more](https://stackoverflow.com/questions/26221312/dealing-with-the-class-imbalance-in-binary-classification/26244744#26244744)

### Preparing our data for modeling

In [None]:
X = df.ix[:, df.columns != 'Class']
y = df.ix[:, df.columns == 'Class'].values.ravel()

[`.ravel()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) is a method that helps us convert y (which is originally a column-vector) to a 1-dimensional array, so that scikit-learn won't throw a DataConversionWarning. The code will work without transforming it with `.values.ravel()` as well, but we'll have a warning message, which is not so nice. 

In [None]:
### Split our data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=0)

In [None]:
# Defining some utility methods for printing model metrics

def print_header(title):
    print("\n" + title + ":\n")
    
def print_metrics(model, X, y):
    expected = y
    predicted = model.predict(X)
    
    print_header('CONFUSION MATRIX')
    print(metrics.confusion_matrix(expected, predicted))
    
    print_header('CLASSIFICATION REPORT')
    print(metrics.classification_report(expected, predicted))

<a id='iteration_1'></a>
## Iteration 1: Logistic regression model (with no sampling or thresholding)

### Train our model

In [None]:
model_1 = LogisticRegression()

In [None]:
model_1.fit(X_train, y_train)

### Evaluate our model

In [None]:
# 1. .score()
train_score_1 = model_1.score(X_train, y_train)
test_score_1 = model_1.score(X_test, y_test)

print("training set score: %f" % train_score_1)
print("test set score:     %f" % test_score_1)

Given the skewed of the data, even if we always predicted 'no fraud', we will get a score of 99.8%. As such, score is not a useful metric and this is the last time we will use it to evaluate our model

In [None]:
print_metrics(model_1, X, y)

Looking at the 2nd nested array (`[false_positives, true_positives]`), we see that we've correctly predicted **322** fraudulent transactions, and we've misclassified **170** fraudulent transactions as non-fraudulent.

Looking at the precision score, we can see that **74% of our predictions of y=1 (fraud) were were**.

Looking at the recall score, we can see that only **65% of the fraudulent cases in reality were correctly classified**.

Note: remember our helpful mnemonic:
- **pre**cision: a measure of our accuracy with our **pre**dictions as the baseline
- **re**call: a measure of our accuracy with the **re**ality as the baseline

<a id='iteration_2'></a>
## Iteration 2: Logistic regression model (with undersampled data)

To improve the accuracy of the model, we can undersample the data such that the proportion of cases of y=0 and y=1 are 50-50, instead of 99.8-0.2.

**`imblearn`** (imbalanced_learn) is a nice library that has methods for doing this undersampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler

import collections

In [None]:
rus = RandomUnderSampler(return_indices=True)
X_undersampled, y_undersampled, idx_resampled = rus.fit_sample(X, y)
print('length of X and y:', len(X_undersampled), len(y_undersampled))
print('Count of y values:', collections.Counter(y_undersampled))

In [None]:
X_train_undersampled_25_percent_split, X_test_undersampled_25_percent_split, y_train_undersampled_25_percent_split,\
    y_test_undersampled_25_percent_split = train_test_split(X_undersampled, y_undersampled, random_state=0)

In [None]:
model_2 = LogisticRegression()
model_2.fit(X_train_undersampled_25_percent_split, y_train_undersampled_25_percent_split)

In [None]:
print_metrics(model_2, X, y)

<a id='iteration_3'></a>
## Iteration 3: Logistic regression model (with GridSearchCV)

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
logistic_regression_model = LogisticRegression()

param_grid = {'C': [0.01, 0.1, 1, 10],
              'class_weight': [{
                  0: 1, 
                  1: 2
              },
              {
                  0: 1, 
                  1: 1.2
              },
              {
                  0: 1, 
                  1: 1.4
              }
              ]}

model_3 = GridSearchCV(estimator=logistic_regression_model, param_grid=param_grid, cv=5)
model_3.fit(X_train, y_train)

print("Best estimator:", model_3.best_estimator_)
print("Best score:", model_3.best_score_)

In [None]:
print_metrics(model_3, X, y)

<a id='iteration_4'></a>
## Iteration 4: Logistic regression model (with undersampled data and GridSearchCV)

In [None]:
logistic_regression_model = LogisticRegression()

param_grid = {'C': [0.01, 0.1, 1, 10],
              'class_weight': [{
                  0: 1, 
                  1: 2
              },
              {
                  0: 1, 
                  1: 1.2
              },
              {
                  0: 1, 
                  1: 1.4
              }]}

model_4 = GridSearchCV(estimator=logistic_regression_model, param_grid=param_grid, cv=5)
model_4.fit(X_train_undersampled_25_percent_split, y_train_undersampled_25_percent_split)

print("Best estimator:", model_4.best_estimator_)
print("Best score:", model_4.best_score_)

In [None]:
print_metrics(model_4, X, y)

<a id='iteration_5'></a>
## Iteration 5: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_5 = RandomForestClassifier(random_state=0)
model_5.fit(X_train, y_train)

In [None]:
print_metrics(model_5, X, y)

<a id="feature_importances"></a>
### Bonus step: View/plot feature\_importances\_ in a random forest classifier

In [None]:
model_5.feature_importances_

In [None]:
plt.plot(model_5.feature_importances_, 'o')
plt.xticks(range(32), df.columns.values, rotation=90);

<a id='iteration_6'></a>
## Iteration 6: Random Forest (with undersampling)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_6 = RandomForestClassifier(random_state=0)
model_6.fit(X_train_undersampled_25_percent_split, y_train_undersampled_25_percent_split)

In [None]:
print_metrics(model_6, X, y)

<a id='iteration_7'></a>
## Iteration 7: Random Forest (with undersampled data, and 40% train_test_split ratio)

To ensure that we're not overfitting, let's try it with 40% train_test_split ratio (i.e. 40% of the data will be held off for testing/validating), instead of the default ratio of 25%

In [None]:
X_train_undersampled_40_percent_split, X_test_undersampled_40_percent_split, y_train_undersampled_40_percent_split,\
    y_test_undersampled_40_percent_split = train_test_split(X_undersampled,
                                                            y_undersampled,
                                                            test_size=0.4,
                                                            random_state=0)

In [None]:
model_7 = RandomForestClassifier(random_state=0)
model_7.fit(X_train_undersampled_40_percent_split, y_train_undersampled_40_percent_split)

In [None]:
print_metrics(model_7, X, y)

We see that our recall score has dropped from 0.97 to **0.91**, which confirms our suspicion that our earlier score of 0.97 was due to overfitting! 😢😢

<a id='iteration_8'></a>
## Iteration 8: Random Forest (with undersampling, and 40% train_test_split ratio, and optimization with GridSearchCV)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
random_forest_classifier_model = RandomForestClassifier(random_state=0)

To know which params we can tune, you can use the `.get_params` property. As for what values to put in, this will require some reading and general googling :-)

Generally, random forests are tuned by tweaking the following hyperparameters:
- max_features
- n_estimators
- min_samples_leaf
- class_weight

In [None]:
random_forest_classifier_model.get_params

In [None]:
random_forest_classifier_model = RandomForestClassifier(random_state=0)

param_grid = {'max_features': [None, 'auto', 'sqrt', 'log2'],
              'n_estimators': [1, 2, 4, 8, 10, 20, 30, 50],
              'min_samples_leaf': [1,5,10,50],
              'class_weight': [{
                  0: 1, 
                  1: 1
              },
              {
                  0: 1, 
                  1: 1.5
              },
              {
                  0: 1, 
                  1: 2
              },
              {
                  0: 1, 
                  1: 2.5
              }
              ]}

model_8 = GridSearchCV(estimator=random_forest_classifier_model, 
                       param_grid=param_grid, cv=5)
model_8.fit(X_train_undersampled_40_percent_split, y_train_undersampled_40_percent_split)

print("Best estimator:", model_8.best_estimator_)
print("Best score:", model_8.best_score_)

In [None]:
print_metrics(model_8, X, y)

<a id='iteration_9'></a>
## Iteration 9: Random Forest (with undersampling, and 40% train_test_split ratio, and optimization with RandomizedSearchCV)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint

In [None]:
rfc_model = RandomForestClassifier(random_state=0)
param_dist = {"max_depth": [3, None],
              "max_features": list(range(1, 12)),
              "min_samples_split": list(range(2, 12)),
              "min_samples_leaf": list(range(1, 12)),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

model_9 = RandomizedSearchCV(rfc_model, 
                             n_iter=30, 
                             param_distributions=param_dist, 
                             scoring='recall', 
                             cv=5,
                             n_jobs=-1)

model_9.fit(X_train_undersampled_40_percent_split, y_train_undersampled_40_percent_split)

In [None]:
print_metrics(model_9, X, y)

<a id='iteration_10'></a>
## Iteration_10: Random Forest (without undersampling, with a 25% train_test_split ratio and optimized with RandomizedSearchCV)

In [None]:
rfc_model = RandomForestClassifier(random_state=0)
param_dist = {"max_depth": [3, None],
              "max_features": list(range(1, 12)),
              "min_samples_split": list(range(2, 12)),
              "min_samples_leaf": list(range(1, 12)),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

model_10 = RandomizedSearchCV(rfc_model, 
                             n_iter=30, 
                             param_distributions=param_dist, 
                             scoring='recall', 
                             cv=5,
                             n_jobs=-1)

model_10.fit(X_train, y_train)

In [None]:
print_metrics(model_10, X, y)

### Conclusion:
1. Wahoo! model_6 gave us a 0.95 recall rate without any optimization at all! In other words, our model can **identify fraud with up to 95% accuracy**, even when only given 60% of the data as training data. We achieved this by using the following techniques:
  - Resampling with `imblearn.under_sampling.RandomUnderSampler` to get a balanced training dataset with an equal number of fraud (492 cases) and non-fraud (also 492 cases).
  - Randomized search cross validation
  
2. LogisticRegression models are a great starting point for building classification models
3. RandomForestClassifier models performs better than LogisticRegression out of the box, even without any tuning/optimisation
4. Undersampling is a useful technique for training models with highly skewed data
5. GridSearchCV and RandomizedSearchCV allow us search the hyperparameter space to find the most optimal hyperparameters