# Exercise 04

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [2]:
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')

In [3]:
data.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [4]:
data.shape, data.Label.sum(), data.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [5]:
X = data.drop(['Label'], axis=1)
y = data['Label']

# Exercice 04.1

Estimate a Logistic Regression

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [6]:
X = data.drop('Label',axis=1)
y = data.Label

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)



In [7]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [18]:
# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.9940024797439521


#### Se puede observar que el modelo logístico posee un nivel de predicción del 99% 

# Exercice 04.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [21]:
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [22]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 111)
    print('Target percentage', target_percentage)
    print('y.shape = ',y_u.shape[0], 'y.mean() = ', y_u.mean())

Target percentage 0.1
y.shape =  5963 y.mean() =  0.09877578400134161
Target percentage 0.2
y.shape =  3000 y.mean() =  0.19633333333333333
Target percentage 0.3
y.shape =  1994 y.mean() =  0.29538615847542626
Target percentage 0.4
y.shape =  1494 y.mean() =  0.39424364123159306
Target percentage 0.5
y.shape =  1190 y.mean() =  0.49495798319327733


In [23]:
X_u, y_u = UnderSampling(X_train, y_train, 0.5, 111)

In [24]:
X_u.shape,y_u.shape

((1190, 15), (1190,))

In [25]:
logreg.fit(X_u, y_u)
treeclf.fit(X_u,y_u)

y_pred_class_log = logreg.predict(X_test)
y_pred_class_tree = treeclf.predict(X_test)

In [26]:
metrics.accuracy_score(y_pred_class_log,y_test), metrics.accuracy_score(y_pred_class_tree,y_test)

(0.631758023125054, 0.6855627000374845)

In [27]:
metrics.precision_score(y_pred_tree,y_test), metrics.recall_score(y_pred_tree,y_test)

(0.1346153846153846, 0.10894941634241245)

In [28]:
metrics.f1_score(y_pred_tree,y_test)

0.12043010752688173

In [29]:
metrics.fbeta_score(y_pred_tree,y_test,beta=10)

0.1091554732129072

# Exercice 04.3

Now using random-over-sampling

In [30]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    return X[filter_], y[filter_]

In [31]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = OverSampling(X_train, y_train, target_percentage, 1)

KeyError: '[  8050  43171  13723 ... 104037 104038 104039] not in index'

# Exercice 04.4*
Evaluate the results using SMOTE

Which parameters did you choose?

In [32]:
def SMOTE(X, y, target_percentage=0.5, k=5, seed=None):
    
    # New samples
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__ = np.random.choice(k, n_samples_1_new)
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==1][sel] - step * (X[y==1][sel] - X[y==1][nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_1_new))
    
    return X, y

In [33]:
import numpy as np
for target_percentage in [0.25, 0.5]:
    for k in [5, 15]:
        X_u, y_u = SMOTE(X_train.values, y_train, target_percentage, k, seed=111)
        print('Target percentage', target_percentage, 'k ', k)
        print('y.shape = ',y_u.shape[0], 'y.mean() = ', y_u.mean())

Target percentage 0.25 k  5
y.shape =  137934 y.mean() =  0.24999637507793582
Target percentage 0.25 k  15
y.shape =  137934 y.mean() =  0.24999637507793582
Target percentage 0.5 k  5
y.shape =  206902 y.mean() =  0.5
Target percentage 0.5 k  15
y.shape =  206902 y.mean() =  0.5


In [34]:
X_u, y_u = SMOTE(X_train, y_train, 0.1, 5, seed=111)

KeyError: 384

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

lru =LogisticRegression()
dtu = DecisionTreeClassifier()

lru.fit(X_u,y_u)
dtu.fit(X_u,y_u)

y_pred_class_log = lru.predict(X_test)
y_pred_class_tree = dtu.predict(X_test)

In [None]:
metrics.fbeta_score(y_pred_class_log,y_test,beta=10),metrics.fbeta_score(y_pred_class_tree,y_test,beta=10)

In [None]:
metrics.accuracy_score(y_pred_class_log,y_test), metrics.accuracy_score(y_pred_class_tree,y_test)

# Exercice 04.5

Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree **Classifiers**

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

Combine the classifiers and comment

In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

In [36]:
X = data.drop('Label',axis=1)
y = data.Label

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [37]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [38]:
# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.9940024797439521


In [39]:
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier()
treeclf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [40]:
y_pred_tree = treeclf.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_pred_tree, y_test))

0.9882933018079063


In [41]:
metrics.precision_score(y_pred_tree,y_test),metrics.recall_score(y_pred_tree,y_test)

(0.12980769230769232, 0.10714285714285714)

In [42]:
metrics.f1_score(y_pred_tree,y_test)

0.1173913043478261

In [43]:
metrics.fbeta_score(y_pred_tree,y_test,beta=10)

0.10732840050377833

In [46]:
metrics.confusion_matrix(y_pred_tree,y_test)

array([[34248,   181],
       [  225,    27]], dtype=int64)

In [47]:
metrics.confusion_matrix(y_pred_class,y_test)

array([[34473,   208],
       [    0,     0]], dtype=int64)

# Exercice 04.6

Using the under-sampled dataset

Evaluate a RandomForestClassifier and compare the results

change n_estimators=100, what happened