# Project: Final Report
Study by: Saurabh, Augustine, Gelesh, 
Indiana University

### Changes to Phase 1 after review with Professor

1. <a href='#context'>Adding data source, hyperlink, more context</a>
2. <a href='#evaluation'>Additonal Metrics: precision, recall, FNR, f1, f0.5</a>
3. <a href='#roc'>ROC Curve: ROC, AUC</a>
4. <a href='#smpl_vary'>Impact of different sampling ratios</a>
5. <a href='#feature_imp'>Variable importance:decision tree approach(not complete)</a>
6. <a href='#feature_eng'>Feature Engineering(Investigating)</a>

### Changes to Phase 2 report

1. <a href='#feature_imp'>Variable importance:decision tree approach</a>
2. <a href='#feature_sel'>Feature Selection</a>
3. <a href='#pipeline_full'>Full Pipeline</a>
4. <a href='#feature_eng'>Feature Engineering</a>
5. <a href='#stats'>Statistical Significance</a>

### Changes to Phase 3 report
2. <a href='#feature_eng1'>New Features(Feature Engineering)</a>
3. <a href='#stats1'>Final Statistical Test</a>

<a id='context'></a>
# Credit Card Fraud Detection

The dataset used for this Machine Learning project is obained from Kaggle. A detailed description about the data set could be found at https://www.kaggle.com/agpickersgill/credit-card-fraud-detection/data . 


The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.

The dataset contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, the dataset does not provide the original features or the background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA.  The only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, which can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. 

# Setup
First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline.


In [5]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import pandas as pd
from IPython.display import Image
import os

import warnings
warnings.simplefilter('ignore')


# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Explore Data
The data has been downloaded from the following website. 
https://www.kaggle.com/agpickersgill/credit-card-fraud-detection/data 

The data is stored as creditcard.csv file, to a local folder where this notebook file is stored.

In [6]:
# Read data into a panda dataframe

trainFile = data = pd.read_csv("../input/creditcard.csv") # pd.read_csv("creditcard.csv")
trainFile.info()

There are 30 columns and 284807 rows where column class is the target variable. It is a binary value, which can have either 0 (not fraud) or 1 (fraud) value. Column "Amount" is the amount of the transaction and "Time" is the time of the transaction. The rest of the features V1 to V28 are not described. There are no missing values in the data
set.

In [157]:
trainFile.head()

As shown above, the dataset has only one categorical variable "Class", which is the target variable. Features V1, V2, ... V28 are numerical values corresponding to the principal components obtained with PCA. Hence there is no scope for categorical pipeline in this case. Numerical pipelines are implemented as part of the ML pipelines, in later sections. 

## Let us split fradulent and non-fraudlent transactions

In [158]:
non_fraud = trainFile.loc[(trainFile["Class"] ==0)]
fraud = trainFile.loc[(trainFile["Class"] ==1)]
print ("Size of Fraud data:", fraud.shape)
print ("Size of non-fraud data: ", non_fraud.shape)
class_count = trainFile["Class"].value_counts()
class_count.plot(kind = 'bar')
plt.title("Transaction class histogram")
plt.xlabel("Class")
plt.ylabel("Count")

The data is highly unbalanced with respect of Class variable values i.e. Fraud transaction and Non-Fraud transactions. There are only 0.17% of the rows with value Class = 1.

## Strategy

** There could be multiple approach for this classification problem taking into consideration the highly unbalanced data. **

  1. OVER-SAMPLING: In this approach under-represented class are copied multiple times to match with the count of over
     represented data (Class 0 in this case)
               
  2. UNDER-SAMPLING: In this approach instances of over-represented class are deleted.
  
  3. Ratio Match: In this approach each classes are split in 50-50 ratio from the dataset

## Approach

1. We will use resampling by strategy 3 above i.e. Ratio matching and test this approach using a simple logistic regression classifier.

2. After fitting the model, several performance metrics would be tested and analysed.

3. We will repeat the best resampling, by tuning the parameters in the logistic regression classifier.

4. We will finally perform classifications model using other classification algorithms.

## Correlation matrix Visualization

In [159]:
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import roc_auc_score

### Correlations for all the data

In [160]:
trainFile["Class"].astype('float')
trainFile_corr = trainFile.corr()


In [161]:
sns.heatmap(trainFile_corr, cbar = True,  square = True, annot=False, fmt= '.2f',annot_kws={'size': 15},
           cmap= 'coolwarm')
plt.show()

** From the correlation matrix it can be observed that most of the data features are not correlated.This is because,most of the features (V1-V28) are the result of Principal Component Analysis (PCA) algorithm.** 

In [162]:
abs(trainFile_corr["Class"]).sort_values(ascending=False)

In [163]:
from pandas.tools.plotting import scatter_matrix

# Top three correlated inputs with housing_median_age
attributes = ["Class", "V17", "V14","V12","V10","V16","V3","V7"]

scatter_matrix(trainFile[attributes], figsize=(12, 8))

In [164]:
trainFile.plot(kind="scatter", x="V11", y="Class",
             alpha=0.1)

### Correlations for non fraud class.

As shown above the correlation of PCA values in general is of little significance. So here we did a correlation of data separately for both the classes. 


In [165]:
non_fraud_corr =  non_fraud.corr()
sns.heatmap(non_fraud_corr, cbar = True,  square = True, annot=False, fmt= '.2f',annot_kws={'size': 15},
           cmap= 'coolwarm')
plt.show()

### Correlations for fraud class.

The fraud transactions show significant correlations between several features which is contrast to the non-fraud class. 

In [166]:
fraud_corr =  fraud.corr()
sns.heatmap(fraud_corr, cbar = True,  square = True, annot=False, fmt= '.2f',annot_kws={'size': 15},
           cmap= 'coolwarm')
plt.show()

In [167]:
# Print Correlations above threshold of 0.15 for non fraud class
rows, cols = non_fraud.shape
flds = list(non_fraud.columns)

corr = non_fraud_corr.values

for i in range(cols):
    for j in range(i+1, cols):
        if abs(corr[i,j]) > 0.15:
            print (flds[i], ' ', flds[j], ' ', corr[i,j])

In [168]:
# Print Correlations above threshold of 0.8 for fraud class
rows, cols = fraud.shape
flds = list(fraud.columns)

corr = fraud_corr.values

for i in range(cols):
    for j in range(i+1, cols):
        if abs(corr[i,j]) > 0.8:
            print (flds[i], ' ', flds[j], ' ', corr[i,j])

In [169]:
%matplotlib inline
import matplotlib.pyplot as plt

fraud.hist( color='red', label='Fraud', bins=50, figsize=(20,15))
non_fraud.hist( color='blue', label='Non Fraud', bins=50, figsize=(20,15))

plt.show()

# Resampling imbalanced dataset with equal ratio of binary classes

** Here we transform the dataset to have the minority class count match the majority class count.**

In [170]:
# random_state=42

fraud_count = len(fraud)
# fraud_count
smpl_non_fraud = non_fraud.sample(n=fraud_count, random_state=42)
# len(smpl_non_fraud)
train_data=smpl_non_fraud.append(fraud, ignore_index=True)

train_data = shuffle(train_data)
train_data.reset_index(drop=True)

len(train_data)

In [171]:
train_data.info()

In [172]:
%matplotlib inline
sns.countplot(x='Class', data=train_data)

# Preprocessing pipeline

In [173]:
from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [174]:
# Select features to use for modeling.

cc_num_attribs = list(train_data)[1:-1] # To select all features except Time

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(cc_num_attribs)),
        ('std_scaler', StandardScaler()),
    ])


# Create a held out dataset 

Creating a held out dataset using the train_test_split(70 / 30)

In [175]:
X = train_data.loc[:,train_data.columns != 'Class']
y = train_data.loc[:,train_data.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

In [176]:
cc_prepared = num_pipeline.fit_transform(X_train)
cc_prepared

# Base model with logistic regression

In [177]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(cc_prepared, y_train)

In [178]:
# let's try the full pipeline on the testing data set

test_prepared = num_pipeline.fit_transform(X_test)
y_pred = log_reg.predict(test_prepared)


<a id='evaluation'></a>
## Evaluation

We plan to evaluate the ML model by analyzing the true positive (TP), true negative (TN), false positive (FP) and false negative (FN) predictions using the following metrics.

1.	Precision : This shows the ability of the ML model to predict positive cases(fraud). This is expressed as TP / (TP+FP)	
2.	Recall : This measures the sensitivity of the model to predict positive cases(fraud), This is expressed as TP / (TP+FN). This is an important metric for this ML problem and our objective is to maximize this metric.   	
3.	False Negative Rate : This is shows the amount of fraud cases missed by the ML model. This is expressed as FN / (TP+FN). This is also an important metric for this ML problem and our objective is to minimize this metric.
4.	F1 Score : This is the harmonic mean of precision and recall. This is expressed as 2*TP / (2*TP+FP+FN)	
5.	F0.5 Score : This is the F-beta score, where beta is 0.5. The expression to calculate the F-beta is shown below.

6.	AUC : The AUC represents a model’s ability to discriminate between fraud and non-fraud classes. An area of 1.0 represents a model that made all predictions perfectly. 

Our objective is to detect maximum number of fraud transactions(true positives). Hence we intend to maximize the recall and minimize the False Negative Rate. 


In [179]:
# Let us calculate the False negative rate (FNR), Miss rate
def fnr(y_test, y_pred):
# from sklearn.metrics import confusion_matrix

    log_cm = confusion_matrix(y_test, y_pred)
    #tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
    tn, fp, fn, tp = log_cm.ravel()
    # (tn, fp, fn, tp)
    if isinstance(y_test, pd.DataFrame):
        true_pos = len(y_test.loc[(y_test["Class"] ==1)])
    if isinstance(y_test, np.ndarray):
        true_pos = np.count_nonzero(y_test == 1)
    
    #tp/true_pos # recall home_grown
    #tp/(tp+fp) # precision home_grown
    fnr = fn / true_pos
    return fnr

In [180]:
metrics = [precision_score, 
           recall_score,
           fnr,
           f1_score,
           lambda y_true, y_pred: fbeta_score(y_true, y_pred, beta=0.5),
           roc_auc_score]
metrics_names = ["Precision", 
                 "Recall", 
                 "False Negative",
                 "F1",
                 "F0.5",
                 "AUC"]

In [181]:
samples = [(test_prepared, y_test)]
models_names = ["Logistic, Ratio(1F:1NF)"]

In [182]:
def evaluate(models, metrics, samples, metrics_names, models_names):
    results = np.zeros((len(samples) * len(models), len(metrics)))
    samples_names = []
#     for m in models_names:
#         samples_names.extend([m + " Train", m + " Test"])
    for m_num, model in enumerate(models):
        for row, sample in enumerate(samples):
            for col, metric in enumerate(metrics):
                results[row + m_num * 2, col] = metric(sample[1], model.predict(sample[0]))
    results = pd.DataFrame(results, columns=metrics_names, index=models_names)
    return results

In [183]:
models = [log_reg]

In [184]:
res = evaluate(models, metrics, samples, metrics_names, models_names)
res

In [185]:
from sklearn.metrics import accuracy_score

log_acc = accuracy_score(y_test, y_pred)
log_acc

In [186]:
from sklearn.metrics import recall_score

log_recall = recall_score(y_test, y_pred)
log_recall

In [187]:
from sklearn.metrics import precision_score
log_pre = precision_score(y_test, y_pred)
log_pre

In [188]:
from sklearn.metrics import roc_auc_score
log_roc = roc_auc_score(y_test, y_pred)
log_roc

In [189]:
# Let us calculate the False negative rate (FNR), Miss rate
# def fnr(y_test, y_pred):
# from sklearn.metrics import confusion_matrix
log_cm = confusion_matrix(y_test, y_pred)
#tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
tn, fp, fn, tp = log_cm.ravel()
# (tn, fp, fn, tp)
true_pos = len(y_test.loc[(y_test["Class"] ==1)])
#tp/true_pos # recall home_grown
#tp/(tp+fp) # precision home_grown
fnr = fn / true_pos
fnr

In [190]:
import itertools

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)


    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [191]:
# Plot confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(log_cm
                      , classes=class_names
                      , title='Confusion matrix for base model')
plt.show()

<a id='roc'></a>
## ROC Curve



In [192]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

In [193]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

In [194]:

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.grid()
plt.show()

<a id='smpl_vary'></a>
# Impact of different sampling ratios
Here we study the impact of the different sampling ratios on the model performance.

## Sampling ratio of 1Fraud : 10 Non-fraud


In [195]:

fraud_count = len(fraud)
# fraud_count
smpl_non_fraud = non_fraud.sample(n=fraud_count*10, random_state=42)
# len(smpl_non_fraud)
train_data=smpl_non_fraud.append(fraud, ignore_index=True)

train_data = shuffle(train_data)
train_data.reset_index(drop=True)

len(train_data)

In [196]:
train_data.info()

In [197]:
X = train_data.loc[:,train_data.columns != 'Class']
y = train_data.loc[:,train_data.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

In [198]:
cc_prepared = num_pipeline.fit_transform(X_train)
test_prepared = num_pipeline.fit_transform(X_test)


In [199]:
log_reg.fit(cc_prepared, y_train)
y_pred = log_reg.predict(test_prepared)

In [200]:
log_pre = precision_score(y_test, y_pred)
log_pre

In [201]:
samples = [(test_prepared, y_test)]
models_names = ["Logistic, Ratio(1F:10NF)"]

In [202]:
models = [log_reg]

In [203]:
res_10 = evaluate(models, metrics, samples, metrics_names, models_names)
res = res.append(res_10)
res

## Sampling ratio of 1Fraud : 20 Non-fraud


In [204]:

fraud_count = len(fraud)
# fraud_count
smpl_non_fraud = non_fraud.sample(n=fraud_count*20, random_state=42)
# len(smpl_non_fraud)
train_data=smpl_non_fraud.append(fraud, ignore_index=True)

train_data = shuffle(train_data)
train_data.reset_index(drop=True)

len(train_data)

In [205]:
X = train_data.loc[:,train_data.columns != 'Class']
y = train_data.loc[:,train_data.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)

In [206]:
cc_prepared = num_pipeline.fit_transform(X_train)
test_prepared = num_pipeline.fit_transform(X_test)


In [207]:
log_reg.fit(cc_prepared, y_train)
y_pred = log_reg.predict(test_prepared)

In [208]:
log_pre = precision_score(y_test, y_pred)
log_pre

In [209]:
samples = [(test_prepared, y_test)]
models_names = ["Logistic, Ratio(1F:20NF)"]

In [210]:
res_20 = evaluate(models, metrics, samples, metrics_names, models_names)
res = res.append(res_20)
res

** So far the best results are obtained 
when using equal number of fraud and non fraud classes. So we proceed with equal sampling of data.**

# Train a Random Forest Classifier

In [211]:
from  sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
rf_clf.fit(cc_prepared, y_train)

In [212]:
samples = [(test_prepared, y_test)]
models_names = ["RandomForest, Ratio(1F:1NF)"]
models = [rf_clf]
res_rf = evaluate(models, metrics, samples, metrics_names, models_names)
res = res.append(res_rf)
res

# Objective evaluation via K-fold crossfold validation

## K-fold crossfold validation for Random Forest Classifier

In [213]:
from sklearn.model_selection import cross_val_score


score_randomforest = cross_val_score(rf_clf,test_prepared,y_test['Class'],scoring='recall',cv=10)
score_randomforest


In [214]:
validation_scores = pd.DataFrame(columns=['Model Name','mean','SD'])

validation_scores.loc[0] = ['RF',score_randomforest.mean(),score_randomforest.std()]

validation_scores

## K-fold crossfold validation for Logistic Classifier

In [215]:
score_LR = cross_val_score(log_reg,test_prepared,y_test['Class'],scoring='recall',cv=10)
score_LR
validation_scores.loc[1] = ['LC',score_LR.mean(),score_LR.std()]

In [216]:
validation_scores

<a id='stats'></a>
# Statistical Significance

In [217]:
def stat_test(control, treatment):
    #paired t-test; two-tailed p-value      A   ,    B
    (t_score, p_value) = stats.ttest_rel(control, treatment)

    if p_value > 0.05/2:  #Two sided 
        print('There is no significant difference between the two machine learning pipelines (Accept H0)')
    else:
        print('The two machine learning pipelines are different (reject H0) \n(t_score, p_value) = (%.2f, %.5f)'%(t_score, p_value) )
        if t_score > 0.0: #in the case of regression lower RMSE is better; A is lower 
            print('Machine learning pipeline A is better than B')
        else:
            print('Machine learning pipeline B is better than A')
    return p_value

In [218]:
from sklearn.model_selection import cross_val_score
from scipy import stats
# from sklearn.tree import DecisionTreeRegressor
# from sklearn.linear_model import LinearRegression

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

# A sampling based bakeoff using *K-fold cross-validation*: 
# it randomly splits the training set into K distinct subsets (k=30)
# this bakeoff framework can be used for regression or classification
#Control system is a linear regression based pipeline

kFolds=10

y_test_ctrl = y_test
# Logistic Regression as base
control = cross_val_score(log_reg, test_prepared, y_test_ctrl['Class'],
                             scoring='recall', cv=kFolds)

# control_acc = control.mean()
# # control = control.mean()
# display_scores(control)

# display_scores(lin_rmse_scores)
#Treatment system is a random forest based pipeline

treatment = cross_val_score(rf_clf, test_prepared, y_test['Class'],
                         scoring='recall', cv=kFolds)

treatment_acc = treatment.mean()

pval = stat_test(control, treatment)

pval


# Finetune model/pipeline hyperparameters


Let’s assume at this point that you now have a shortlist of promising models. You now need to fine-tune them. Let’s look at a few ways you can do that:

* GridSearch
* RandomSearch

## Finetune via GridSearch

In [61]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [5, 10, 20, 29]},
    # then try 8 (2×4) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [10, 20], 'max_features': [5, 10, 20,29]},
  ]

# train across 5 folds, that's a total of (12+8)*5=100 rounds of training 
grid_search = GridSearchCV(rf_clf, param_grid, cv=5,
                           scoring='recall')
grid_search.fit(cc_prepared, y_train['Class'])

The best hyperparameter combination found:

In [62]:
grid_search.best_params_

In [63]:
grid_search.best_estimator_

Let's look at the score of each hyperparameter combination tested during the grid search:

In [64]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

In [65]:
pd.DataFrame(grid_search.cv_results_)

<a id='feature_imp'></a>

## Input variable importance

In [66]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

In [67]:
attributes = list(X_train.columns.values)
attributes

In [68]:
sortedFeatures = sorted(zip(feature_importances,attributes), reverse=False)
sortedFeatures

In [69]:
np.array(sortedFeatures)[:, 0]

In [70]:
# Plot the feature importances of the forest
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.figure() 
plt.title("Feature importances")
sortedNames = np.array(sortedFeatures)[:, 1]
sortedImportances = np.array(sortedFeatures)[:, 0]

plt.title('Feature Importances')
plt.barh(range(len(sortedNames)), sortedImportances, color='b', align='center')
plt.yticks(range(len(sortedNames)), sortedNames)
plt.xlabel('Relative Importance')
plt.grid()
plt.show()

<a id='feature_sel'></a>
## Feature Selection

In [71]:
from sklearn.base import BaseEstimator, TransformerMixin

def indices_of_top_k(arr, k):
    return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class BestFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importances, k):
        self.feature_importances = feature_importances
        self.k = k
    def fit(self, X, y=None):
        self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
        return self
    def transform(self, X):
        return X[:, self.feature_indices_]

In [72]:
k=5
top_k_feature_indices = indices_of_top_k(feature_importances, k)
top_k_feature_indices 

In [73]:
np.array(attributes)[top_k_feature_indices]

Let's double check that these are indeed the top k features:

In [74]:
sorted(zip(feature_importances, attributes), reverse=True)[:k]

** Let's create a new pipeline that runs the previously defined preparation pipeline, and adds top k feature selection: **

In [75]:
preparation_and_feature_selection_pipeline = Pipeline([
    ('preparation', num_pipeline),
    ('feature_selection', BestFeatureSelector(feature_importances, k))
])

In [76]:
trainFile_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(X_train)

In [77]:
y_pred = preparation_and_feature_selection_pipeline.fit_transform(X_test)

## Comparison of different models

In [79]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from time import time
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot

X=trainFile_prepared_top_k_features
Y=y_train



models = []
models.append(('LR', LogisticRegression(max_iter=1000)))
models.append(('LR_L1', LogisticRegression(C=1,penalty='l1',max_iter=1000) ))
models.append(('LDA', LinearDiscriminantAnalysis()))
#models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
#models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
models.append(('RF_10',RandomForestClassifier(n_estimators=10)))
models.append(('RF_100',RandomForestClassifier(n_estimators=100)))
#models.append(('RF_5.21',RandomForestClassifier(max_features=5,n_estimators=21)))

#models.append(('KNN_5', KNeighborsClassifier(n_neighbors=5,n_jobs=-1)))

#models.append(('CART', DecisionTreeClassifier()))
# evaluate each model in turn
results = []
names = []
scoring ='recall'#'roc_auc' #'recall' #'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = pyplot.figure(figsize=(8, 6))
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.grid()
pyplot.show()

<a id='pipeline_full'></a>
## Full pipeline for data prep, feature selection and modeling

In [80]:
k=20
prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', num_pipeline),
    ('feature_selection', BestFeatureSelector(feature_importances, k)),
    ('rf', RandomForestClassifier(bootstrap= False, n_estimators = 10, max_features=k))
   
])

In [81]:
prepare_select_and_predict_pipeline.fit(X_train, y_train)

In [82]:
prepare_select_and_predict_pipeline.predict(X_test)

In [83]:

samples = [(X_test, y_test)]
models_names = ["RandomForest, Feautres=20, Ratio(1F:1NF)"]


In [84]:
models = [prepare_select_and_predict_pipeline]

In [None]:
res_fe = evaluate(models, metrics, samples, metrics_names, models_names)
res = res.append(res_fe)
res

In [None]:
# res.drop(res.tail(1).index,inplace=True)
# res

## Kitchen Sink with VotingClassifier

In [85]:
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier

vot_clf = VotingClassifier(
    estimators=[
                ('rf', RandomForestClassifier(bootstrap= False, n_estimators = 10, max_features=20)),
#                 ('svc', SVC()),
                ('lr', LogisticRegression()),
                ('CART', DecisionTreeClassifier()),
#                 ('LDA', LinearDiscriminantAnalysis())
                ], 
    voting='hard')


In [86]:
k=20
prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', num_pipeline),
    ('feature_selection', BestFeatureSelector(feature_importances, k)),
    ('vot_clf', vot_clf)
   
])

In [87]:
prepare_select_and_predict_pipeline.fit(X_train, y_train)

In [88]:
prepare_select_and_predict_pipeline.predict(X_test)

In [89]:

samples = [(X_test, y_test)]
models_names = ["Voting(LR,RF,DT), Feautres=20(1F:1NF)"]


In [90]:
models = [prepare_select_and_predict_pipeline]

In [481]:
res_vot = evaluate(models, metrics, samples, metrics_names, models_names)
res = res.append(res_vot)
res

## Conclusions derived after Kitchen-sink analysis
Our objective is to maximize the recall score and minimize the False Negative Rate. From the results table, it is clear that the simple model based on logistic regression is performing better. This is confirmed through statistical significance test. Models based on SVC, LDA, Random Forest, Decision Tree etc are not performing better even after several steps of feature selections. The kitchen sink model based on a VotingClassifier (Several ensembles of models based on LR, SVC, LDA, Random Forest and Decision Tree) might have improved the accuracy but did not improve the recall score. Here we used the kitchen sink analysis for demonstration purposes only. Hence we propose a feature engineering method where we generate new features after clustering analysis.

<a id='feature_eng'></a>
# Feature Engineering (Experimental)

The original dataset contains 28 principal components identified as V1, V2, V3...,V28. The data is completely anonymized and hence domain specific feature engineering cannot be performed for this dataset. From pair plot analysis we could find that the feature "Time" is not significant in predicting the target variable. Also We could See that on a pair-wise comparison, the classes are forming clusters and some of them are even linearly seperable.

## Pair Plot Analysis

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
sns.pairplot(trainFile, hue="Class", size=2);
# Dont Run it Again
#Image(filename='.\pair_plot.jpg', width=500)

In [591]:
Image(filename='.\image2.jpeg', width=500)

In [592]:
Image(filename='.\image1.jpeg', width=500)

In [219]:
non_fraud = trainFile.loc[(trainFile["Class"] ==0)]
non_fraud = non_fraud.drop(["Class","Time"],axis=1)

fraud = trainFile.loc[(trainFile["Class"] ==1)]
fraud = fraud.drop(["Class","Time"],axis=1)

nfrdStd = num_pipeline.fit_transform(non_fraud)
frdStd = num_pipeline.fit_transform(fraud)
#non_fraud

<a id='feature_eng1'></a>
## Feature Extraction ( Gelesh G Omathil )

Here we attempt to cluster the Fraud and non-fraud data into diffrent sets of clusters. Then we create features as distance to the centroids of the clusters.

In [220]:
from sklearn.cluster import KMeans
nfrdCentro = KMeans(n_clusters=6, random_state=0).fit(nfrdStd)
frdCentro = KMeans(n_clusters=4, random_state=0).fit(frdStd)


In [221]:
#kmeans = np.concatenate((nfrdCentro, frdCentro), axis=0)
kmeans = np.concatenate((nfrdCentro.cluster_centers_, frdCentro.cluster_centers_), axis=0)


In [222]:
#kmeans.labels_
#kmeans.cluster_centers_
import numpy as np
import scipy
#X=trainFile_0.values
dist_0=[]
dist_1=[]
dist_2=[]
dist_3=[]
dist_4=[]
dist_5=[]
dist_6=[]
dist_7=[]
dist_8=[]
dist_9=[]
isFraud=[]

for x in frdStd:
    dist_0.append(np.sqrt(np.sum((x-kmeans[0])**2,axis=0)))
    dist_1.append(np.sqrt(np.sum((x-kmeans[1])**2,axis=0)))
    dist_2.append(np.sqrt(np.sum((x-kmeans[2])**2,axis=0)))
    dist_3.append(np.sqrt(np.sum((x-kmeans[3])**2,axis=0)))
    dist_4.append(np.sqrt(np.sum((x-kmeans[4])**2,axis=0)))
    dist_5.append(np.sqrt(np.sum((x-kmeans[5])**2,axis=0)))
    dist_6.append(np.sqrt(np.sum((x-kmeans[6])**2,axis=0)))
    dist_7.append(np.sqrt(np.sum((x-kmeans[7])**2,axis=0)))
    dist_8.append(np.sqrt(np.sum((x-kmeans[8])**2,axis=0)))
    dist_9.append(np.sqrt(np.sum((x-kmeans[9])**2,axis=0)))
    isFraud.append(1)

distDf_frd = pd.DataFrame({
        "dist_0": dist_0,
        "dist_1": dist_1,
        "dist_2": dist_2,
        "dist_3": dist_3,
        "dist_4": dist_4,
        "dist_5": dist_5,
        "dist_6": dist_6,
        "dist_7": dist_7,
        "dist_8": dist_9,
        "dist_9": dist_9,
        "Class": isFraud})

ndist_0=[]
ndist_1=[]
ndist_2=[]
ndist_3=[]
ndist_4=[]
ndist_5=[]
ndist_6=[]
ndist_7=[]
ndist_8=[]
ndist_9=[]
nisFraud=[]

for x in nfrdStd:
    ndist_0.append(np.sqrt(np.sum((x-kmeans[0])**2,axis=0)))
    ndist_1.append(np.sqrt(np.sum((x-kmeans[1])**2,axis=0)))
    ndist_2.append(np.sqrt(np.sum((x-kmeans[2])**2,axis=0)))
    ndist_3.append(np.sqrt(np.sum((x-kmeans[3])**2,axis=0)))
    ndist_4.append(np.sqrt(np.sum((x-kmeans[4])**2,axis=0)))
    ndist_5.append(np.sqrt(np.sum((x-kmeans[5])**2,axis=0)))
    ndist_6.append(np.sqrt(np.sum((x-kmeans[6])**2,axis=0)))
    ndist_7.append(np.sqrt(np.sum((x-kmeans[7])**2,axis=0)))
    ndist_8.append(np.sqrt(np.sum((x-kmeans[8])**2,axis=0)))
    ndist_9.append(np.sqrt(np.sum((x-kmeans[9])**2,axis=0)))
    nisFraud.append(0)
    
distDf_nfrd = pd.DataFrame({
        "dist_0": ndist_0,
        "dist_1": ndist_1,
        "dist_2": ndist_2,
        "dist_3": ndist_3,
        "dist_4": ndist_4,
        "dist_5": ndist_5,
        "dist_6": ndist_6,
        "dist_7": ndist_7,
        "dist_8": ndist_8,
        "dist_9": ndist_9,
        "Class": nisFraud})

In [223]:
distDf_nfrd.describe()

In [224]:
distDf_frd.describe()

In [225]:
#dft_1=dft.loc[(dft["isFraud"] ==1)]
#print(len(dft_1))
#dft_0=dft.loc[(dft["isFraud"] ==0)]
distDf_nfrdSubSet=distDf_nfrd.sample(frac=0.03)
#print(len(dft_0))
trainFile=distDf_nfrdSubSet.append(distDf_frd, ignore_index=True)

#dft_1=dft_1.sample(frac=0.3)
#trainFile=dft_1.append(dft_0, ignore_index=True)
#trainFile=dft
trainFile= shuffle(trainFile)
#trainFile.reset_index(drop=True)
trainFile= shuffle(trainFile)
#trainFile.reset_index(drop=False)
trainFile= shuffle(trainFile)
dataY=trainFile[("Class")]
dataX=trainFile.drop(["Class"],axis=1)
#X,Y

In [226]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')

%matplotlib inline

from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from time import time
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier

#from sklearn import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Y=dataY.values
X=dataX.values


#X_train=X
#y_train=Y
# prepare models
models = []
models.append(('LR', LogisticRegression(max_iter=1000)))
models.append(('LR_L1', LogisticRegression(C=1,penalty='l1',max_iter=1000) ))
models.append(('LDA', LinearDiscriminantAnalysis()))
#models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
models.append(('RF_10',RandomForestClassifier(n_estimators=10)))
models.append(('RF_100',RandomForestClassifier(n_estimators=100)))
#models.append(('RF_5.21',RandomForestClassifier(max_features=5,n_estimators=21)))

#models.append(('KNN_5', KNeighborsClassifier(n_neighbors=5,n_jobs=-1)))

#models.append(('CART', DecisionTreeClassifier()))
# evaluate each model in turn
results = []
names = []
scoring ='recall'#'roc_auc' #'recall' #'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = pyplot.figure(figsize=(8, 8))
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.grid()
pyplot.show()

In [227]:
from sklearn.metrics import recall_score
from sklearn.model_selection import RandomizedSearchCV
param_distribs = {
        'penalty': ('l1','l2'),
        'C': (0.001,0.01,1,10,100),
    }

# scoring ='roc_auc' #'recall'
scoring ='recall'
log_reg = LogisticRegression(max_iter=10000)
rnd_search = RandomizedSearchCV(log_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring=scoring, random_state=42) #'neg_mean_squared_error'

In [228]:
X_train, X_test, y_train, y_test = train_test_split(dataX.values,dataY.values,test_size = 0.3, random_state = 42)


In [229]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3,10,30,100], 'max_features': [3,4,5,6,7,8,9]},
    # then try 8 (2×4) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators':  [3,10,30,100], 'max_features': [3,4,5,6,7,8,9]},
  ]

# train across 5 folds, that's a total of (12+8)*5=100 rounds of training 
grid_search = GridSearchCV(rf_clf, param_grid, cv=5,
                           scoring='recall')
#X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size = 0.3, random_state = 42)
grid_search.fit(X_train, y_train)

print("BEST PARAMS")
print(grid_search.best_params_)
grid_search.best_estimator_

In [230]:
cvres = grid_search.cv_results_
print("mean_score  params")
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)


In [231]:

pd.DataFrame(grid_search.cv_results_)

## Input features and their importance for best model

 We developed 10 new features after feature transformation. Their importance is shown in the plot below.

In [232]:

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

attributes = list(dataX.columns.values)
attributes

sortedFeatures = sorted(zip(feature_importances,attributes), reverse=False)
sortedFeatures

# Plot the feature importances of the forest
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
plt.figure() 
plt.title("Feature importances")
sortedNames = np.array(sortedFeatures)[:, 1]
sortedImportances = np.array(sortedFeatures)[:, 0]

plt.title('Feature Importances')
plt.barh(range(len(sortedNames)), sortedImportances, color='b', align='center')
plt.yticks(range(len(sortedNames)), sortedNames)
plt.xlabel('Relative Importance')
plt.grid()
plt.show()

## Evaluation of the best model

In [233]:

grid_search.best_estimator_.fit(X_train, y_train)


samples = [(X_test, y_test)]
models_names = ["RandomForest, Feature Transform"]
models = [grid_search.best_estimator_]
res_ft = evaluate(models, metrics, samples, metrics_names, models_names)
res = res.append(res_ft)
res

In [234]:
from sklearn.metrics import roc_curve

y_pred = grid_search.best_estimator_.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

In [235]:
plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.grid()
plt.show()

In [236]:

# Plot confusion matrix
best_cm = confusion_matrix(y_test, y_pred)

class_names = [0,1]
plt.figure()
plot_confusion_matrix(best_cm
                      , classes=class_names
                      , title='Confusion matrix for best model')
plt.show()

<a id='stats1'></a>
## Statistical Significance of the best model 

In [237]:
from sklearn.model_selection import cross_val_score
from scipy import stats
# from sklearn.tree import DecisionTreeRegressor
# from sklearn.linear_model import LinearRegression

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

# A sampling based bakeoff using *K-fold cross-validation*: 
# it randomly splits the training set into K distinct subsets (k=30)
# this bakeoff framework can be used for regression or classification
#Control system is a linear regression based pipeline

kFolds=10
# Logistic Regression as base
control = cross_val_score(log_reg, test_prepared, y_test_ctrl['Class'],
                             scoring='recall', cv=kFolds)

# control_acc = control.mean()
# control = control.mean()
# display_scores(lin_rmse_scores)
# display_scores(control)


#Treatment system is a random forest based pipeline

treatment = cross_val_score(grid_search.best_estimator_, X_test, y_test,
                         scoring='recall', cv=kFolds)

# treatment_acc = treatment.mean()
# treatment = treatment.mean()
# display_scores(treatment)
# treatment = tree_rmse_scores = np.sqrt(-scores)
# display_scores(tree_rmse_scores)


pval = stat_test(control, treatment)

pval
    

## Discussion on best model

The results from base model(Logistic Regression) is below. 

Precision: 0.937500	
Recall: 0.895522 	
False Negative Rate: 0.104478	
F1 Score: 0.916031  	
F0.5 Score: 0.928793	
AUC: 0.923070

The results from best model(Random Forest) after feature transformation is below.

Precision: 1.000000	
Recall: 0.979866	
False Negative Rate: 0.020134 	
F1 Score: 0.989831  	
F0.5 Score: 0.995907	
AUC: 0.989933

To detect maximum number of fraud transacations, we need to have model that maximizes the recall and minimizes the false negative rate. As you can see from the results, the best model reduced the false negative rate from 10% to 2%, while improving the recall from 0.89 to 0.97. The feature transformation is key factor in this achievement.  		 	 	