# Exercise 15

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

In [2]:
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/15_fraud_detection.csv.zip'
df = pd.read_csv(url, index_col=0)
df.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [3]:
df.shape, df.Label.sum(), df.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [4]:
df.dtypes

accountAge                                        int64
digitalItemCount                                  int64
sumPurchaseCount1Day                              int64
sumPurchaseAmount1Day                           float64
sumPurchaseAmount30Day                          float64
paymentBillingPostalCode - LogOddsForClass_0    float64
accountPostalCode - LogOddsForClass_0           float64
paymentBillingState - LogOddsForClass_0         float64
accountState - LogOddsForClass_0                float64
paymentInstrumentAgeInAccount                   float64
ipState - LogOddsForClass_0                     float64
transactionAmount                               float64
transactionAmountUSD                            float64
ipPostalCode - LogOddsForClass_0                float64
localHour - LogOddsForClass_0                   float64
Label                                             int64
dtype: object

# Exercise 15.1

Estimate a Logistic Regression, Decision Tree and Random Forest

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [5]:
# specify seed for reproducable results
seed = 42

# define X and y
X = df.drop(['Label'], axis=1).values
y = df['Label'].values

# standarize features
scaler = StandardScaler()
X = scaler.fit_transform(X.astype(np.float))

# split dataframe into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

In [6]:
# define models
models = {'Logistic Regression':LogisticRegression(solver='liblinear', random_state=seed), 
         'Decision Tree':DecisionTreeClassifier(random_state=seed), 
         'Random Forest':RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=seed)}

y_pred = pd.DataFrame(columns=models.keys())
results = []

# train, predict and evaluate each model
for model in models.keys():
    models[model].fit(X_train, y_train)
    y_pred[model] = models[model].predict(X_test)
    results.append({'Accuracy': metrics.accuracy_score(y_test, y_pred[model]),
                    'F1-Score': metrics.f1_score(y_test, y_pred[model]),
                    'Model': str(model),
                    'F-Beta Score': metrics.fbeta_score(y_test, y_pred[model], beta=10)}) #f-beta > 1 favors recall (punishing FN)

# store results on dataframe
results = pd.DataFrame(results)
results.sort_values('F-Beta Score', inplace=True, ascending=False)
results

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,Model
1,0.988971,0.146696,0.135593,Decision Tree
2,0.994065,0.074089,0.127208,Random Forest
0,0.994113,0.0,0.0,Logistic Regression


In this specific scenario (fraud detection), it is more important to correctly label an instance as fraudulent, as opposed to labeling the non-fraudulent one. Based on this I'll use F-Beta Score as the main metric for evaluating the model's performance: 

- **Decision Tree:** is the one that's performing better, so far.
- **Decision Tree** and **Random Forest** are capable of predicting actual fraud (TP), but they performance is quite limited.
- **Logistic Regression:** is not capable of classifying true positive values (predicting actual fraud).



# Exercise 15.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [7]:
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [8]:
y_pred_undrsamp = pd.DataFrame(columns=models.keys())
results_undrsamp = []

for target_pct in np.arange(0.1,0.6,0.1):
    for model in models.keys():
        X_undrsamp, y_undrsamp = UnderSampling(X_train, y_train, target_percentage=target_pct, seed=seed)
        models[model].fit(X_undrsamp, y_undrsamp)
        y_pred_undrsamp[model] = models[model].predict(X_test)
        results_undrsamp.append({'Accuracy': metrics.accuracy_score(y_test, y_pred_undrsamp[model]),
                        'F1-Score': metrics.f1_score(y_test, y_pred_undrsamp[model]),
                        'F-Beta Score': metrics.fbeta_score(y_test, y_pred_undrsamp[model], beta=10),
                        'Target percentage': target_pct,
                        'Model': str(model)+'_UnderSampling'})

results_undrsamp = pd.DataFrame(results_undrsamp)
results_undrsamp.sort_values('F-Beta Score', inplace=True, ascending=False)
results_undrsamp

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,Model,Target percentage
11,0.858471,0.523883,0.050918,Random Forest_UnderSampling,0.4
14,0.759401,0.522122,0.034519,Random Forest_UnderSampling,0.5
8,0.9091,0.492265,0.067997,Random Forest_UnderSampling,0.3
5,0.953961,0.45511,0.110492,Random Forest_UnderSampling,0.2
7,0.789894,0.454969,0.033171,Decision Tree_UnderSampling,0.3
13,0.629959,0.449326,0.022595,Decision Tree_UnderSampling,0.5
12,0.615542,0.425426,0.020928,Logistic Regression_UnderSampling,0.5
10,0.744768,0.422138,0.026933,Decision Tree_UnderSampling,0.4
4,0.846217,0.382563,0.035273,Decision Tree_UnderSampling,0.2
1,0.917462,0.351348,0.053458,Decision Tree_UnderSampling,0.1


When applying **under-sampling**, the model that's performing better is Random Forest with 0.4 as parameter for target_percentage

In [9]:
# resample training data
X_res, y_res = UnderSampling(X_train, y_train, target_percentage=0.4, seed=seed)

print('Resampled dataset shape %s' % Counter(y_res))
print('y.shape = ',y_res.shape[0], 'y.mean() = ', y_res.mean())

Resampled dataset shape Counter({0: 806, 1: 552})
y.shape =  1358 y.mean() =  0.406480117820324


In [10]:
results_undrsamp = results_undrsamp[results_undrsamp['Target percentage'] == 0.4].drop('Target percentage', axis=1)
results_undrsamp

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,Model
11,0.858471,0.523883,0.050918,Random Forest_UnderSampling
10,0.744768,0.422138,0.026933,Decision Tree_UnderSampling
9,0.845544,0.297781,0.027534,Logistic Regression_UnderSampling


# Exercise 15.3

Same analysis using random-over-sampling

In [11]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    return X[filter_], y[filter_]

In [12]:
y_pred_ovrsamp = pd.DataFrame(columns=models.keys())
results_ovrsamp = []

for target_pct in np.arange(0.1,0.6,0.1):
    for model in models.keys():
        X_ovrsamp, y_ovrsamp = OverSampling(X_train, y_train, target_percentage=target_pct, seed=seed)
        models[model].fit(X_ovrsamp, y_ovrsamp)
        y_pred_ovrsamp[model] = models[model].predict(X_test)
        results_ovrsamp.append({'Accuracy': metrics.accuracy_score(y_test, y_pred_ovrsamp[model]),
                        'F1-Score': metrics.f1_score(y_test, y_pred_ovrsamp[model]),
                        'F-Beta Score': metrics.fbeta_score(y_test, y_pred_ovrsamp[model], beta=10),
                        'Target percentage': target_pct,
                        'Model': str(model)+'_OverSampling'})

results_ovrsamp = pd.DataFrame(results_ovrsamp)
results_ovrsamp.sort_values('F-Beta Score', inplace=True, ascending=False)
results_ovrsamp

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,Model,Target percentage
12,0.659178,0.427271,0.022602,Logistic Regression_OverSampling,0.5
9,0.877646,0.280641,0.031202,Logistic Regression_OverSampling,0.4
6,0.94668,0.190035,0.043122,Logistic Regression_OverSampling,0.3
4,0.989908,0.138787,0.139344,Decision Tree_OverSampling,0.2
1,0.988683,0.130375,0.119626,Decision Tree_OverSampling,0.1
7,0.990076,0.122533,0.12685,Decision Tree_OverSampling,0.3
10,0.989668,0.122449,0.122449,Decision Tree_OverSampling,0.4
13,0.989716,0.118386,0.119342,Decision Tree_OverSampling,0.5
8,0.993704,0.115086,0.176101,Random Forest_OverSampling,0.3
3,0.977846,0.104087,0.053388,Logistic Regression_OverSampling,0.2


When applying **random-over-sampling**, the model that's performing better is Logistic Regression with 0.5 as parameter for target_percentage

In [13]:
# resample training data
X_res, y_res = OverSampling(X_train, y_train, target_percentage=0.5, seed=seed)

print('Resampled dataset shape %s' % Counter(y_res))
print('y.shape = ',y_res.shape[0], 'y.mean() = ', y_res.mean())

Resampled dataset shape Counter({1: 96552, 0: 96552})
y.shape =  193104 y.mean() =  0.5


In [14]:
results_ovrsamp = results_ovrsamp[results_ovrsamp['Target percentage'] == 0.5].drop('Target percentage', axis=1)
results_ovrsamp

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,Model
12,0.659178,0.427271,0.022602,Logistic Regression_OverSampling
13,0.989716,0.118386,0.119342,Decision Tree_OverSampling
14,0.993753,0.098685,0.155844,Random Forest_OverSampling


# Exercise 15.4 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?

In [16]:
def SMOTE(X, y, target_percentage=0.5, k=5, seed=None):
    # Calculate the NearestNeighbors
    from sklearn.neighbors import NearestNeighbors
    nearest_neighbour_ = NearestNeighbors(n_neighbors=k + 1)
    nearest_neighbour_.fit(X[y==1])
    nns = nearest_neighbour_.kneighbors(X[y==1], 
                                    return_distance=False)[:, 1:]
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()
    
    # New samples
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__=[]
    # Select one random neighbor for each example to use as base
    for i, sel in enumerate(sel_):
        nn__.append(np.random.choice(nns[sel]))
    
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==1][sel] - step * (X[y==1][sel] - X[y==1][nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_1_new))
    
    return X, y

In [17]:
y_pred_smote = pd.DataFrame(columns=models.keys())
results_smote = []

for target_pct in [0.25, 0.5]:
    for k in [5, 15]:
        for model in models.keys():
            X_smote, y_smote = SMOTE(X_train, y_train, target_percentage=target_pct, k=k, seed=seed)
            models[model].fit(X_smote, y_smote)
            y_pred_smote[model] = models[model].predict(X_test)
            results_smote.append({'Accuracy': metrics.accuracy_score(y_test, y_pred_smote[model]),
                                'F1-Score': metrics.f1_score(y_test, y_pred_smote[model]),
                                'F-Beta Score': metrics.fbeta_score(y_test, y_pred_smote[model], beta=10),
                                'Target percentage': target_pct,
                                'Model': str(model)+'_SMOTE',
                                'K': k})

results_smote = pd.DataFrame(results_smote)
results_smote.sort_values('F-Beta Score', inplace=True, ascending=False)
results_smote

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,K,Model,Target percentage
6,0.619122,0.434419,0.021483,5,Logistic Regression_SMOTE,0.5
9,0.602638,0.427175,0.02061,15,Logistic Regression_SMOTE,0.5
4,0.981089,0.172835,0.098511,15,Decision Tree_SMOTE,0.25
7,0.980104,0.156552,0.086093,5,Decision Tree_SMOTE,0.5
11,0.990389,0.15519,0.159664,15,Random Forest_SMOTE,0.5
5,0.991998,0.143361,0.173697,15,Random Forest_SMOTE,0.25
10,0.978927,0.140267,0.073918,15,Decision Tree_SMOTE,0.5
0,0.965279,0.137175,0.046205,5,Logistic Regression_SMOTE,0.25
3,0.964702,0.137047,0.045484,15,Logistic Regression_SMOTE,0.25
8,0.992047,0.135202,0.166247,5,Random Forest_SMOTE,0.5


When using **SMOTE**, the model that's performing better is Logistic Regression with 0.5 as parameter for target_percentage and 5 for K.

In [18]:
# Resample training data
X_res, y_res = SMOTE(X_train, y_train, target_percentage=0.50, k=5, seed=seed)

print('Resampled dataset shape %s' % Counter(y_res))
print('y.shape = ',y_res.shape[0], 'y.mean() = ', y_res.mean())

Resampled dataset shape Counter({0.0: 96552, 1.0: 96552})
y.shape =  193104 y.mean() =  0.5


In [19]:
results_smote = results_smote[(results_smote['Target percentage'] == 0.5) & (results_smote['K'] == 5)].drop(['Target percentage', 'K'], axis=1)
results_smote

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,Model
6,0.619122,0.434419,0.021483,Logistic Regression_SMOTE
7,0.980104,0.156552,0.086093,Decision Tree_SMOTE
8,0.992047,0.135202,0.166247,Random Forest_SMOTE


# Exercise 15.5 (3 points)

Evaluate the results using Adaptive Synthetic Sampling Approach for Imbalanced
Learning (ADASYN)

http://www.ele.uri.edu/faculty/he/PDFfiles/adasyn.pdf
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#rf9172e970ca5-1

In [20]:
from imblearn.over_sampling import ADASYN

# Resample training data
ada = ADASYN(sampling_strategy='minority', random_state=seed, n_neighbors=5, n_jobs=-1)
X_res, y_res = ada.fit_resample(X_train, y_train)

print('Resampled dataset shape %s' % Counter(y_res))
print('y.shape = ',y_res.shape[0], 'y.mean() = ', y_res.mean())

Resampled dataset shape Counter({1: 96604, 0: 96552})
y.shape =  193156 y.mean() =  0.5001346062250202


In [21]:
y_pred_ada = pd.DataFrame(columns=models.keys())
results_ada = []

# train, predict and evaluate each model
for model in models.keys():
    models[model].fit(X_res, y_res)
    y_pred_ada[model] = models[model].predict(X_test)
    results_ada.append({'Accuracy': metrics.accuracy_score(y_test, y_pred_ada[model]),
                        'F1-Score': metrics.f1_score(y_test, y_pred_ada[model]),
                        'Model': str(model)+'_ADASYN',
                        'F-Beta Score': metrics.fbeta_score(y_test, y_pred_ada[model], beta=10)}) #f-beta > 1 favors recall (punishing FN)

# store results on dataframe
results_ada = pd.DataFrame(results_ada)
results_ada

Unnamed: 0,Accuracy,F-Beta Score,F1-Score,Model
0,0.589903,0.426549,0.020208,Logistic Regression_ADASYN
1,0.980369,0.18063,0.099228,Decision Tree_ADASYN
2,0.992215,0.131153,0.164948,Random Forest_ADASYN


In this specific scenario (fraud detection), it is more important to correctly label an instance as fraudulent, as opposed to labeling the non-fraudulent one. Based on this I'll use F-Beta Score as the main metric for evaluating the model's performance: 

- **Logistic Regression:** is the one that's performing better.
- **Decision Tree** and **Random Forest** are capable of predicting actual fraud (TP), and they accuracy is quite good. Nevertheless, they struggle to predict actual fraud.

# Exercise 15.6 (3 points)

Compare and comment about the results

In [22]:
df_final = results.merge(results_undrsamp, how='outer').merge(results_ovrsamp, how='outer').merge(results_smote, how='outer').merge(results_ada, how='outer').set_index('Model')
df_final.sort_values('F-Beta Score', inplace=True, ascending=False)
df_final

Unnamed: 0_level_0,Accuracy,F-Beta Score,F1-Score
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Random Forest_UnderSampling,0.858471,0.523883,0.050918
Logistic Regression_SMOTE,0.619122,0.434419,0.021483
Logistic Regression_OverSampling,0.659178,0.427271,0.022602
Logistic Regression_ADASYN,0.589903,0.426549,0.020208
Decision Tree_UnderSampling,0.744768,0.422138,0.026933
Logistic Regression_UnderSampling,0.845544,0.297781,0.027534
Decision Tree_ADASYN,0.980369,0.18063,0.099228
Decision Tree_SMOTE,0.980104,0.156552,0.086093
Decision Tree,0.988971,0.146696,0.135593
Random Forest_SMOTE,0.992047,0.135202,0.166247


After training **three different machine learning models** (Logistic Regression, Decision Tree and Random Forest) and implemeting **two resampling techniques**: Under-sampling the majority class and Over-sampling the minority class, the **best performance** was achieved by a Random Forest using UnderSampling technique, with 0.4 as parameter for target_percentage. **F-Beta Score** was used as the main metric for evaluating the model's performance, because in this specific scenario (fraud detection), it is more important to correctly label an instance as fraudulent, as opposed to labeling the non-fraudulent one.