# Exercise 15

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [23]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score

In [2]:
import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/15_fraud_detection.csv.zip'
df = pd.read_csv(url, index_col=0)
df.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [3]:
df.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [6]:
df.shape, df.Label.sum(), df.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

# Exercise 15.1

Estimate a Logistic Regression and a Decision Tree

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [7]:
seed = 42
##Train and Test before balancing test database
Y = df.Label
X = df.drop(['Label'], axis=1)
print("Y sample: ", Y.shape)
print("X sample: ", X.shape)

Y sample:  (138721,)
X sample:  (138721, 15)


In [19]:
#Train the model selecting the best found parameters
from sklearn.model_selection import train_test_split
size = 0.3
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y,test_size=size).copy()
print ("Rows train: ", len(Y_train))
print ("Sum_Label train: ", sum(Y_train))

Rows X:  97104
Y_Label sum:  545


### DecisionTreeClassifier before balancing test database

In [48]:
#Train DecisionTreeClassifier and metrics
regTree = DecisionTreeClassifier()
regTree.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [46]:
Y_pred = regTree.predict(X_validation)
print ("accuracy_score: ",accuracy_score(Y_validation, Y_pred))
print ("f1_score: ",f1_score(Y_validation, Y_pred))
print ("fbeta_score: ",fbeta_score(Y_validation, Y_pred, average=None, beta=0.5))

accuracy_score:  0.9888507100463753
f1_score:  0.14705882352941177
fbeta_score:  [0.99467738 0.14084507]


### LogisticRegression before balancing test database

In [39]:
#Train LogisticRegression and metrics
logreg = LogisticRegression(solver='liblinear', max_iter=200)
logreg.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [44]:
Y_pred = logreg.predict(X_validation)
print ("accuracy_score: ",accuracy_score(Y_validation, Y_pred))
print ("f1_score: ",f1_score(Y_validation, Y_pred))
print ("fbeta_score: ",fbeta_score(Y_validation, Y_pred, average=None, beta=0.5))

accuracy_score:  0.993944782180359
f1_score:  0.0
fbeta_score:  [0.99514995 0.        ]


# Exercise 15.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [49]:
print ("Rows train: ", len(Y_train))
print ("Sum_Label train: ", sum(Y_train))

Rows train:  97104
Sum_Label train:  545


In [83]:
# Get indices of Y_train = 0
non_fraud_indices = Y_train[Y_train == 0].index
len(non_fraud_indices)

96559

In [101]:
# Define the Percentage_balance = 0.5
Percentage_balance = 0.5
n_samples_0_new = int(sum(Y_train) / Percentage_balance - sum(Y_train))
n_samples_0_new

545

In [102]:
#Random sample non fraud indices
random_indices = np.random.choice(non_fraud_indices, n_samples_0_new, replace=False)
len(random_indices)

545

In [103]:
#Find the indices of fraud samples
fraud_indices = Y_train[Y_train== 1].index
len(random_indices)

545

In [104]:
#Concat fraud indices with sample non-fraud ones
under_sample_indices = np.concatenate([fraud_indices,random_indices])
print(len(under_sample_indices))
under_sample_indices

1090


array([133225, 113967,  73255, ...,   1836,  67388,  61466], dtype=int64)

In [98]:
#Get Balance Dataframe
Y_under_sample = Y_train[under_sample_indices]
X_under_sample = X_train.loc[under_sample_indices]

# Balancing test database
### DecisionTreeClassifier

In [109]:
#Train DecisionTreeClassifier and metrics
regTree = DecisionTreeClassifier()
regTree.fit(X_under_sample, Y_under_sample)
Y_pred = regTree.predict(X_validation)
print ("accuracy_score: ",accuracy_score(Y_validation, Y_pred))
print ("f1_score: ",f1_score(Y_validation, Y_pred))
print ("fbeta_score: ",fbeta_score(Y_validation, Y_pred, average=None, beta=0.5))

accuracy_score:  0.6589855107287887
f1_score:  0.023665382498624106
fbeta_score:  [0.90423167 0.01498571]


### LogisticRegression

In [107]:
#Train LogisticRegression and metrics
logreg = LogisticRegression(solver='liblinear', max_iter=200)
logreg.fit(X_under_sample, Y_under_sample)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [108]:
Y_pred = logreg.predict(X_validation)
print ("accuracy_score: ",accuracy_score(Y_validation, Y_pred))
print ("f1_score: ",f1_score(Y_validation, Y_pred))
print ("fbeta_score: ",fbeta_score(Y_validation, Y_pred, average=None, beta=0.5))

accuracy_score:  0.5885095033279669
f1_score:  0.02159629777752385
fbeta_score:  [0.87528542 0.01364503]


# Exercise 15.3

Same analysis using random-over-sampling

In [125]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the negative
    n_samples = sum(y)
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_1 / (target_percentage- 1)
    print(n_samples_1_new)
    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    print(len(filter_))
    
    #return X[filter_], y[filter_]
    return 1, 1

X_over_sample, Y_over_sample = OverSampling(X_train, Y_train, target_percentage, 42)

545.0
97104


In [116]:
n_samples_1_new =  -0.5 * (len(Y_train)-sum(Y_train)) / (0.5 - 1)
n_samples_1_new

96559.0

In [None]:
target_percentage = 0.5
X_over_sample, Y_over_sample = OverSampling(X_train, Y_train, target_percentage, 42)

# Exercise 15.4 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?

# Exercise 15.5 (3 points)

Evaluate the results using Adaptive Synthetic Sampling Approach for Imbalanced
Learning (ADASYN)

http://www.ele.uri.edu/faculty/he/PDFfiles/adasyn.pdf
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#rf9172e970ca5-1

In [None]:
## create a result table and add to it all the methods that was used, each row diferten measures and colums are the diferent models


# Exercise 15.6 (3 points)

Compare and comment about the results