# Exercise 15

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
import seaborn as sns

In [2]:
import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/15_fraud_detection.csv.zip'
df = pd.read_csv(url, index_col=0)
df.head()

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

In [None]:
df.head()

In [None]:
df.shape, df.Label.sum(), df.Label.mean()

# Exercise 15.1

Estimate a Logistic Regression and a Decision Tree

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [None]:
seed = 42
##Train and Test before balancing test database
Y = df.Label
X = df.drop(['Label'], axis=1)
print("Y sample: ", Y.shape)
print("X sample: ", X.shape)

In [None]:
#Train the model selecting the best found parameters
from sklearn.model_selection import train_test_split
size = 0.30
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y,test_size=size, random_state=seed).copy()
print ("Rows train: ", len(Y_train))
print ("Sum_Label train: ", sum(Y_train))

### DecisionTreeClassifier before balancing test database

In [None]:
#Train DecisionTreeClassifier and metrics
regTree = DecisionTreeClassifier()
regTree.fit(X_train, Y_train)

In [None]:
Y_pred = regTree.predict(X_validation)
print ("accuracy_score: ",accuracy_score(Y_validation, Y_pred))
print ("f1_score: ",f1_score(Y_validation, Y_pred))
print ("fbeta_score: ",fbeta_score(Y_validation, Y_pred, beta=10))

### LogisticRegression before balancing test database

In [None]:
#Train LogisticRegression and metrics
logreg = LogisticRegression(solver='liblinear', max_iter=200)
logreg.fit(X_train, Y_train)

In [None]:
Y_pred = logreg.predict(X_validation)
print ("accuracy_score: ",accuracy_score(Y_validation, Y_pred))
print ("f1_score: ",f1_score(Y_validation, Y_pred))
print ("fbeta_score: ",fbeta_score(Y_validation, Y_pred, beta=10))

# Exercise 15.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [None]:
print ("Total rows in train dataset: ", len(Y_train))
print ("Sum frauds in train dataset: ", sum(Y_train))
non_fraud_indices = Y_train[Y_train == 0].index
print ("Sum non in frauds train: ", len(non_fraud_indices))

In [None]:
def GetInidicesUnderSampling(Y_train, target_percentage=0.5, seed=42):
    n_samples_0_new = int(sum(Y_train) / target_percentage - sum(Y_train))
    np.random.seed(seed)
    random_indices = np.random.choice(non_fraud_indices, n_samples_0_new, replace=False)#, random_state=seed)
    #print("count under-sampled non fraud indices:", len(random_indices))

    #Find the indices of fraud samples
    fraud_indices = Y_train[Y_train== 1].index
    #print("count total fraud indices:", len(fraud_indices))

    #Concat fraud indices with sample non-fraud ones
    indices_train = np.concatenate([fraud_indices,random_indices])
    #print("total indices to train: ", len(indices_train))

    #print("indices array: ", indices_train)
    
    return indices_train

In [None]:
def UnderSampling(X, y, target_percentage=0.5, seed=42):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

# Test balanced model
### DecisionTreeClassifier  under-sampling 

In [None]:
#Train DecisionTreeClassifier and metrics
regTree = DecisionTreeClassifier()
eval_DTC = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    #indices_for_train = GetInidicesUnderSampling(Y_train, target_percentage)
    X_under_sample, Y_under_sample = UnderSampling(X_train, Y_train, target_percentage)
    regTree.fit(X_under_sample, Y_under_sample)
    Y_pred = regTree.predict(X_validation)
    eval_DTC.append([len(Y_under_sample),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_DTC = pd.DataFrame(eval_DTC, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_DTC

### LogisticRegression  under-sampling

In [None]:
#Train DecisionTreeClassifier and metrics
logreg = LogisticRegression(solver='liblinear', max_iter=200)
eval_logreg = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    #indices_for_train = GetInidicesUnderSampling(Y_train, target_percentage)
    X_under_sample, Y_under_sample = UnderSampling(X_train, Y_train, target_percentage)
    logreg.fit(X_under_sample, Y_under_sample)
    Y_pred = logreg.predict(X_validation)
    eval_logreg.append([len(Y_under_sample),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_logreg = pd.DataFrame(eval_logreg, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_logreg

In [None]:
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_DTC)
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_logreg)

In [None]:
sns.lineplot(x="un ", y="F1 Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="F1 Score", data=eval_logreg)

In [None]:
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_logreg)

Se selecciona un target_percentage del 50% dado que es el punto donde se puede observar que el FBeta Score y el F1 Score adquieren un comportamiento estable, aunque el accuracy sigue decendiento a mayor target_percentage esto se debe a que la base de datos de test tiene e su mayoría resultados negativos.

El error en el accuracy se debe principalmente a que el modelo está generando falso fraudes positivos como se puede ver en la siguiente tabla de confusión 

In [None]:
X_under_sample, Y_under_sample = UnderSampling(X_train, Y_train, 0.5)
logreg.fit(X_under_sample, Y_under_sample)
Y_pred = logreg.predict(X_validation)
# Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_validation, Y_pred)

# Exercise 15.3

Same analysis using random-over-sampling

In [None]:
filter_ = np.random.choice(X_train[Y_train == 1].index, 10)
filter_
#X_train.iloc[[8626]]
X_train.loc[filter_]
#np.nonzero(Y_train == 1)[0][filter_]
X_train[Y_train == 1]
Y_train[137504]

In [None]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=42):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].index, int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    #filter_ = np.nonzero(y == 1)[0][filter_]
    #filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    filter_ = indices_train = np.concatenate([filter_,X[y == 0].index])    
    return X.loc[filter_], y[filter_]

### DecisionTreeClassifier random-over-sampling

In [None]:
#Train DecisionTreeClassifier and metrics
regTree = DecisionTreeClassifier()
eval_DTC = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    #indices_for_train = GetInidicesUnderSampling(Y_train, target_percentage)
    X_over_sample, Y_over_sample = OverSampling(X_train, Y_train, target_percentage)
    regTree.fit(X_over_sample, Y_over_sample)
    Y_pred = regTree.predict(X_validation)
    eval_DTC.append([len(Y_over_sample),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_DTC = pd.DataFrame(eval_DTC, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_DTC

### LogisticRegression random-over-sampling

In [None]:
#Train DecisionTreeClassifier and metrics
logreg = LogisticRegression(solver='liblinear')
eval_logreg = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    #indices_for_train = GetInidicesUnderSampling(Y_train, target_percentage)
    X_over_sample, Y_over_sample = OverSampling(X_train, Y_train, target_percentage)
    logreg.fit(X_over_sample, Y_over_sample)
    Y_pred = logreg.predict(X_validation)
    eval_logreg.append([len(Y_over_sample),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_logreg = pd.DataFrame(eval_logreg, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_logreg

In [None]:
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_DTC)
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_logreg)

In [None]:
sns.lineplot(x="target_percentage", y="F1 Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="F1 Score", data=eval_logreg)

In [None]:
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_logreg)

Se observa que el arbol de decisión se ve poco afectado por el over sampling dado que sus medidas de error se mantienen constantes, por el contrario la regresión logística tiene una alta variación en el accuracy, con un F1 y FBeta Score que se estabilizan desde el 50%.

Por lo cual el balance para cada uno de los modelos debería ser:

- DecisionTreeClassifier: 20%
- LogisticRegression: 50%


# Exercise 15.4 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?

In [None]:
from collections import Counter
from imblearn.over_sampling import SMOTE
X_smote, Y_smote = SMOTE(random_state=42,  k_neighbors=5, sampling_strategy="auto").fit_resample(X_train, Y_train)
print('Original dataset shape %s' % Counter(Y_train))
print('Resampled dataset shape auto %s' % Counter(Y_smote))

### DecisionTreeClassifier SMOTE

In [None]:
#Train DecisionTreeClassifier and metrics
regTree = DecisionTreeClassifier()
eval_DTC = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]:
    X_smote, Y_smote = SMOTE(random_state=42,  k_neighbors=5, sampling_strategy=target_percentage).fit_resample(X_train, Y_train)
    regTree.fit(X_smote, Y_smote)
    Y_pred = regTree.predict(X_validation)
    eval_DTC.append([len(Y_smote),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_DTC = pd.DataFrame(eval_DTC, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_DTC

### LogisticRegression SMOTE

In [None]:
#Train DecisionTreeClassifier and metrics
logreg = LogisticRegression(solver='liblinear')
eval_logreg = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]:
    X_smote, Y_smote = SMOTE(random_state=42,  k_neighbors=5, sampling_strategy=target_percentage).fit_resample(X_train, Y_train)
    logreg.fit(X_smote, Y_smote)
    Y_pred = logreg.predict(X_validation)
    eval_logreg.append([len(Y_smote),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_logreg = pd.DataFrame(eval_logreg, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_logreg

In [None]:
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_DTC)
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_logreg)

In [None]:
sns.lineplot(x="target_percentage", y="F1 Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="F1 Score", data=eval_logreg)

In [None]:
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_logreg)

Se observa que el arbol de decisión se ve poco afectado por el over sampling dado que sus medidas de error se mantienen casi constantes, por el contrario la regresión logística tiene una alta variación en el accuracy y el FBeta Score, con un F1 Score que se estabiliza desde el 50%, aunque conserva una pequeña caida despues del 50%.

Por lo cual el balance para cada uno de los modelos debería ser:

- DecisionTreeClassifier: 20%
- LogisticRegression: 50%


# Exercise 15.5 (3 points)

Evaluate the results using Adaptive Synthetic Sampling Approach for Imbalanced
Learning (ADASYN)

http://www.ele.uri.edu/faculty/he/PDFfiles/adasyn.pdf
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#rf9172e970ca5-1

In [None]:
#import library and sample the train dataset
from collections import Counter
from imblearn.over_sampling import ADASYN 
X_ada, y_ada = ADASYN(random_state=42, sampling_strategy=0.5).fit_resample(X_train, Y_train)
print('Resampled dataset shape %s' % Counter(Y_train))
print('Resampled dataset shape %s' % Counter(y_ada))

### DecisionTreeClassifier ADASYN

In [None]:
#Train DecisionTreeClassifier and metrics
regTree = DecisionTreeClassifier()
eval_DTC = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]:
    X_ada, y_ada = ADASYN(random_state=42, sampling_strategy=target_percentage).fit_resample(X_train, Y_train)
    regTree.fit(X_ada, y_ada)
    Y_pred = regTree.predict(X_validation)
    eval_DTC.append([len(y_ada),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_DTC = pd.DataFrame(eval_DTC, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_DTC

### LogisticRegression SMOTE

In [None]:
#Train DecisionTreeClassifier and metrics
logreg = LogisticRegression(solver='liblinear')
eval_logreg = []

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]:
    X_ada, y_ada = SMOTE(random_state=42, sampling_strategy=target_percentage).fit_resample(X_train, Y_train)
    logreg.fit(X_ada, y_ada)
    Y_pred = logreg.predict(X_validation)
    eval_logreg.append([len(y_ada),
                 target_percentage,
                 accuracy_score(Y_validation, Y_pred), 
                 f1_score(Y_validation, Y_pred), 
                 fbeta_score(Y_validation, Y_pred, beta=10)])

eval_logreg = pd.DataFrame(eval_logreg, columns=['number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_logreg

In [None]:
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_DTC)
sns.lineplot(x="target_percentage", y="Accuracy", data=eval_logreg)

In [None]:
sns.lineplot(x="target_percentage", y="F1 Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="F1 Score", data=eval_logreg)

In [None]:
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_DTC)
sns.lineplot(x="target_percentage", y="FBeta Score", data=eval_logreg)

# Exercise 15.6 (3 points)

Compare and comment about the results

### DecisionTreeClassifier ADASYN

In [None]:
#Train DecisionTreeClassifier
regTreeADASYN = DecisionTreeClassifier()
regTreeADASYN.fit(X_ada, y_ada)
Y_pred = regTreeADASYN.predict(X_validation)
eval_models =[]

eval_models.append(["DecisionTreeClassifier",
             "ADASYN",       
             len(y_ada),
             target_percentage,
             accuracy_score(Y_validation, Y_pred), 
             f1_score(Y_validation, Y_pred), 
             fbeta_score(Y_validation, Y_pred, beta=10)])

### LogisticRegression ADASYN

In [None]:
#Train DecisionTreeClassifier and metrics
logregADASYN = LogisticRegression(solver='liblinear')

logregADASYN.fit(X_ada, y_ada)
Y_pred = logregADASYN.predict(X_validation)

eval_models.append(["LogisticRegression",
             "ADASYN",
             len(y_ada),
             target_percentage,
             accuracy_score(Y_validation, Y_pred), 
             f1_score(Y_validation, Y_pred), 
             fbeta_score(Y_validation, Y_pred, beta=10)])

eval_models = pd.DataFrame(eval_models, columns=['model','Balanced by','number_records','target_percentage', 'Accuracy', 'F1 Score', 'FBeta Score'])
eval_models

Se observa que el arbol de decisión se ve poco afectado por el over sampling dado que sus medidas de error se mantienen casi constantes con un maximo en el 50%, por el contrario la regresión logística tiene una alta variación en el accuracy y el FBeta Score, con un F1 Score que se estabiliza desde el 50%, aunque conserva una pequeña caida despues del 50%.

Por lo cual el balance para cada uno de los modelos debería ser:

- DecisionTreeClassifier: 50%
- LogisticRegression: 50%