# Exercise 04

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [132]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [133]:
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')

In [134]:
data.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [135]:
data.shape, data.Label.sum(), data.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [136]:
X = data.drop(['Label'], axis=1)
y = data['Label']

# Exercice 04.1

Estimate a Logistic Regression

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [137]:
# check for missing values
#Comprobamos la existencia de nulos
data.isnull().sum()

accountAge                                      0
digitalItemCount                                0
sumPurchaseCount1Day                            0
sumPurchaseAmount1Day                           0
sumPurchaseAmount30Day                          0
paymentBillingPostalCode - LogOddsForClass_0    0
accountPostalCode - LogOddsForClass_0           0
paymentBillingState - LogOddsForClass_0         0
accountState - LogOddsForClass_0                0
paymentInstrumentAgeInAccount                   0
ipState - LogOddsForClass_0                     0
transactionAmount                               0
transactionAmountUSD                            0
ipPostalCode - LogOddsForClass_0                0
localHour - LogOddsForClass_0                   0
Label                                           0
dtype: int64

In [138]:
#partimos la base en train y test
import pandas as pd
import zipfile
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =train_test_split(X,y,random_state=2)

In [139]:
# fit a logistic regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
Log1=logreg.fit(x_train, y_train)
Log1

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [140]:
# predecimos sobre test
y_pred=Log1.predict(x_test)

In [141]:
#metricas de desempeño
from sklearn.metrics import accuracy_score
#accuracy
m1=accuracy_score(y_test, y_pred) 
#F1-Score
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np
print('Acuracy',accuracy_score(y_test, y_pred) )
print('precision',precision_score(y_test, y_pred))
print('recall',recall_score(y_test, y_pred))
print('f1 score', f1_score(y_test, y_pred))

#np.y_pred.summary()

Acuracy 0.993829474352
precision 0.0
recall 0.0
f1 score 0.0


In [142]:
#Detalle sobre la predicción
y_pred.sum()

2

debido a lo invalanceada que se encuentra la varibale objetivo, el modelo predice muy pocos casos positivos, por esta razón la precision y el f1 score dan cero. El accuracy es alto pero no necesariamente esto implica que logramos predecir apropiadamente el fraude.

# Exercice 04.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [143]:
#se define funcion de undersampling
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [144]:
n_positivo_original = (y_train == 1).sum()
n_negativo_original = (y_train == 0).sum()
float(n_positivo_original)/(float(n_positivo_original)+float(n_negativo_original))


0.005622837370242215

menos del 1% de la base es fraude. se escoge un balanceo del 30%


In [145]:
#Se calcula aleatoriamente las bases balanceadas
X_u, y_u = UnderSampling(x_train, y_train, target_percentage= 0.3)


In [146]:
n_positivo_original = (y_u == 1).sum()
n_negativo_original = (y_u == 0).sum()
float(n_positivo_original)/(float(n_positivo_original)+float(n_negativo_original))

0.2950075642965204

In [147]:
#Construimos modelo con base balanceada
logreg_undersampling = LogisticRegression()
logreg_under=logreg_undersampling.fit(X_u, y_u)
logreg_under

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [148]:
#predecimos sobre la base de test normal
y_pred_under=logreg_under.predict(x_test)

In [149]:
#medidas de desempeno
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np
print('Accuracy',accuracy_score(y_test, y_pred_under))
print('precision',precision_score(y_test, y_pred_under))
print('recall',recall_score(y_test, y_pred_under))
print('f1 score', f1_score(y_test, y_pred_under))


Accuracy 0.952971367608
precision 0.0316831683168
recall 0.22641509434
f1 score 0.0555877243775


Al utilizar under sampling el modelo logra identificar de forma más acertada el fraude, el F1 score aumenta

# Exercice 04.3

Now using random-over-sampling

In [150]:
#Establecemos la función
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    return X.iloc[filter_], y.iloc[filter_]

In [152]:
#Se calcula aleatoriamente las bases balanceadas, siendo consistentes con el ejercicio anterior, establecemos la misma proporcion
X_o, y_o = OverSampling(x_train, y_train, target_percentage = 0.3)

In [153]:
n_positivo_original = (y_o == 1).sum()
n_negativo_original = (y_o == 0).sum()
float(n_positivo_original)/(float(n_positivo_original)+float(n_negativo_original))

0.29999594024033777

In [154]:
#Construimos modelo con base balanceada
logreg_overundersampling = LogisticRegression()
logreg_over=logreg_overundersampling.fit(X_o, y_o)
logreg_over

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [155]:
#predecimos sobre la base de test normal
y_pred_over=logreg_over.predict(x_test)

In [156]:
#medidas de desempeno
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np
print('Accuracy',accuracy_score(y_test, y_pred_over))
print('precision',precision_score(y_test, y_pred_over))
print('recall',recall_score(y_test, y_pred_over))
print('f1 score', f1_score(y_test, y_pred_over))

Accuracy 0.962400161472
precision 0.0333333333333
recall 0.183962264151
f1 score 0.0564399421129


# Exercice 04.4*
Evaluate the results using SMOTE

Which parameters did you choose?

# Exercice 04.5

Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree **Classifiers**

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

Combine the classifiers and comment

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

# Exercice 04.6

Using the under-sampled dataset

Evaluate a RandomForestClassifier and compare the results

change n_estimators=100, what happened

In [160]:
from sklearn.ensemble import RandomForestClassifier
Random_forest_model = RandomForestClassifier()
Forest=Random_forest_model.fit(X_u, y_u)

In [161]:
#predecimos sobre la base de test normal
y_pred_random=Forest.predict(x_test)

In [163]:
#medidas de desempeno
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np
print('Accuracy',accuracy_score(y_test, y_pred_random))
print('precision',precision_score(y_test, y_pred_random))
print('recall',recall_score(y_test, y_pred_random))
print('f1 score', f1_score(y_test, y_pred_random))

Accuracy 0.922407081687
precision 0.0366355140187
recall 0.462264150943
f1 score 0.0678905438171


Modelo con n_estimators=100

In [164]:
Random_forest_model_100 = RandomForestClassifier(n_estimators=100)
Forest_100=Random_forest_model_100.fit(X_u, y_u)

In [165]:
#predecimos sobre la base de test normal
y_pred_random_100=Forest_100.predict(x_test)

In [166]:
#medidas de desempeno
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np
print('Accuracy',accuracy_score(y_test, y_pred_random_100))
print('precision',precision_score(y_test, y_pred_random_100))
print('recall',recall_score(y_test, y_pred_random_100))
print('f1 score', f1_score(y_test, y_pred_random_100))

Accuracy 0.925809521063
precision 0.0436799381523
recall 0.533018867925
f1 score 0.0807431225438


Las metricas de desempeño mejoran el escojer n_estimators = 100, los arboles utilizan más variables