# Exercise 04

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [57]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [58]:
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')

In [59]:
data.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [60]:
data.shape, data.Label.sum(), data.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [110]:
X = data.drop(['Label'], axis=1)
y = data['Label']

# Exercice 04.1

Estimate a Logistic Regression

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [138]:
feature_cols = X.columns.values.tolist()
bestX = X[feature_cols]

logreg = LogisticRegression()

resultsF1 = cross_val_score(logreg, bestX, y, cv=10, scoring='f1')

print("The mean F1-score is: ", resultsF1.mean())

resultsAUC = cross_val_score(logreg, bestX, y, cv=10, scoring='roc_auc')

print("The mean Accuracy is: ", resultsAUC.mean())

#resultsFB = cross_val_score(logreg, bestX, y, cv=10, scoring=metrics.make_scorer(metrics.fbeta_score,beta=2))

#print("The mean F-Beta is: ", resultsFB.mean())



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


The mean F1-score is:  0.0
The mean Accuracy is:  0.48366505689243466


#### Tal como se observa el modelo no resulta muy bueno para la prediccion del fraude. Lo anterior, es consecuencia de que el set de datos de entrenamiento no se encuentra balanceado. 

# Exercice 04.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [63]:
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [64]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    Xp, yp = UnderSampling(X_train, y_train, target_percentage, 1)
    
    feature_cols = Xp.columns.values.tolist()
    bestX = Xp[feature_cols]

    logreg = LogisticRegression()
    
    results = cross_val_score(logreg, bestX, yp, cv=10, scoring='roc_auc')
    
    print('Target percentage', target_percentage, '---> AUC: ', results.mean())
    
    

Target percentage 0.1 ---> AUC:  0.5827339218902433
Target percentage 0.2 ---> AUC:  0.7032032753846844
Target percentage 0.3 ---> AUC:  0.6992821407304561
Target percentage 0.4 ---> AUC:  0.6917675340153595
Target percentage 0.5 ---> AUC:  0.7110848790591323


# Exercice 04.3

Now using random-over-sampling

In [83]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    return X.iloc[filter_], y.iloc[filter_]

In [84]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [85]:
Xp, yp = OverSampling(X_train, y_train, target_percentage=0.25)

In [86]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    Xp, yp = OverSampling(X_train, y_train, target_percentage, 1)
    
    feature_cols = Xp.columns.values.tolist()
    bestX = Xp[feature_cols]

    logreg = LogisticRegression()
    
    results = cross_val_score(logreg, bestX, yp, cv=10, scoring='roc_auc')
    
    print('Target percentage', target_percentage, '---> AUC: ', results.mean())

Target percentage 0.1 ---> AUC:  0.5602807306722446
Target percentage 0.2 ---> AUC:  0.6469435511297384
Target percentage 0.3 ---> AUC:  0.7357758791077718
Target percentage 0.4 ---> AUC:  0.7334482884607392
Target percentage 0.5 ---> AUC:  0.7297432131850267


# Exercice 04.4*
Evaluate the results using SMOTE

Which parameters did you choose?

In [91]:
def SMOTE(X, y, target_percentage=0.5, k=5, seed=None):
    
    # New samples
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()
    
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X_train.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__ = np.random.choice(k, n_samples_1_new)
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==1].iloc[sel] - step * (X[y==1].iloc[sel] - X[y==1].iloc[nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_1_new))
    
    return X, y

In [93]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

for target_percentage in [0.25, 0.5]:
    
    Xp, yp = SMOTE(X_train, y_train, target_percentage,5, 1)
    
    bestX = Xp

    logreg = LogisticRegression()
    
    results = cross_val_score(logreg, bestX, yp, cv=10, scoring='roc_auc')
    
    print('Target percentage', target_percentage, '---> AUC: ', results.mean())
    

Target percentage 0.25 ---> AUC:  0.8096407315366954
Target percentage 0.5 ---> AUC:  0.8089055972309597


# Exercice 04.5

Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree **Classifiers**

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

Combine the classifiers and comment

In [121]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

models = {'LogisticRegression': LogisticRegression(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'GaussianNB': GaussianNB(),
          'KNeighborsRegressor': KNeighborsClassifier()}

In [125]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)


In [130]:
# Testing the accuracy of each model. 

y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())

for model in models.keys():
    models[model].fit(X_train, y_train)
    y_pred[model] = models[model].predict(X_test)
    
    print('Modelo: ', model,'---> Accuracy: ', str(accuracy_score(y_test, y_pred[model]))[:5], '---> F1: ', str(f1_score(y_test, y_pred[model]))[:5])


Modelo:  DecisionTreeClassifier ---> Accuracy:  0.988 ---> F1:  0.136
Modelo:  KNeighborsRegressor ---> Accuracy:  0.993 ---> F1:  0.084
Modelo:  LogisticRegression ---> Accuracy:  0.993 ---> F1:  0.0
Modelo:  GaussianNB ---> Accuracy:  0.923 ---> F1:  0.031


  'precision', 'predicted', average, warn_for)


In [135]:
# Combining the classifiers. 
from sklearn.metrics import mean_squared_error
print('Total Result ---> MSE: ', mean_squared_error(y_test, y_pred.mean(axis=1)))

Total Result ---> MSE:  0.010212724546581701


# Exercice 04.6

Using the under-sampled dataset

Evaluate a RandomForestClassifier and compare the results

change n_estimators=100, what happened

In [136]:
from sklearn.ensemble import RandomForestClassifier
rfclassificator = RandomForestClassifier()

In [137]:
rfclassificator.fit(X_train, y_train)
y_predRF = rfclassificator.predict(X_test)
    
print('Modelo: RandomForest ---> Accuracy: ', str(accuracy_score(y_test, y_predRF))[:5], '---> F1: ', str(f1_score(y_test, y_pred[model]))[:5])


Modelo: RandomForest ---> Accuracy:  0.993 ---> F1:  0.031


In [None]:
rfclassificator.fit(X_train, y_train)
y_predRF = rfclassificator.predict(X_test)
    
print('Modelo: RandomForest ---> Accuracy: ', str(accuracy_score(y_test, y_predRF))[:5], '---> F1: ', str(f1_score(y_test, y_pred[model]))[:5])