## Procedure

1. Clean and transform data
2. Exploratory Data Analysis (EDA)
3. Handle imbalanced classes
4. **Modeling & evaluation**

## Adjusting decision boundaries
Oversampling takes observed rarae samples and applies boostrapping to generate new random data based on a distribution function. If cross-validation is applied after over-sampling, we are basically overfitting our model to a specific artificial boostrapping result. 

I suspect there is are better ways to handle class imbalance that I have not yet explored. In the interest of time, I will stop here.

In [24]:
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)


In [2]:
X_train = joblib.load('../work/data/X_train')
y_train = joblib.load('../work/data/y_train')
X_train_resampled = joblib.load('../work/data/X_train_resampled')
y_train_resampled = joblib.load('../work/data/y_train_resampled')
X_test = joblib.load('../work/data/X_test')
y_test = joblib.load('../work/data/y_test')

## Logistic Regression
Draws a line (hyperplane) between different classes of points. The further a point is from the boundary line, the more its score (estimate) increases (nearing 0 or 1). 

In [17]:
class Model_Evaluation:
    def __init__(self):
        self.name = name
    
    def fit_model(estimator, X, y):
        start = datetime.now()
        model = estimator
        model.fit(X, y)
        print("Time: ", datetime.now() - start)
        return model
    
    def score_model(model, X, y):
        prediction = model.predict(X)
        print("Accuracy Score: ", model.score(X, y))
        print("F1 Score: ", f1_score(y, prediction))

In [25]:
# Logistic Regression on resampled data, with varying values of C

lr1 = Model_Evaluation.fit_model(LogisticRegression(C=0.01), X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(lr1, X_test, y_test)

Time:  0:00:00.449208
Accuracy Score:  0.904200915687
F1 Score:  0.0105088495575


In [26]:
# Logistic Regression on resampled data, with varying values of C

lr1 = Model_Evaluation.fit_model(LogisticRegression(C=1), X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(lr1, X_test, y_test)

Time:  0:00:00.450063
Accuracy Score:  0.904227690166
F1 Score:  0.0105117565698


In [27]:
# Logistic Regression on resampled data, with varying values of C

lr1 = Model_Evaluation.fit_model(LogisticRegression(C=100), X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(lr1, X_test, y_test)

Time:  0:00:00.452244
Accuracy Score:  0.904227690166
F1 Score:  0.0105117565698


In [28]:
# Logistic Regression on non-resampled data

lr2 = Model_Evaluation.fit_model(LogisticRegression(C=0.01), X_train, y_train)
Model_Evaluation.score_model(lr2, X_test, y_test)

Time:  0:00:00.207569
Accuracy Score:  0.999357412514
F1 Score:  0.0


  'precision', 'predicted', average, warn_for)


precision = TP/(TP+FP) so if predictor doesn't predicts positive class at all - precision is 0.

recall = TP/(TP+FN), in case if predictor doesn't predict positive class - TP is 0 - recall is 0.

So now you are dividing 0/0.

In [33]:
# The model isn't predicting Positives, so there's no F-score to calculate. 
# C=0.01 is not enough regularization (penalty) strength, so our model is too overfit.

my_model = LogisticRegression(C=0.01)
my_model.fit(X_train, y_train)
prediction = my_model.predict(X_test)
print(set(y_test))
print(set(prediction))

{0, 1}
{0}


In [41]:
# Logistic Regression on non-resampled data

lr2 = Model_Evaluation.fit_model(LogisticRegression(C=1), X_train, y_train)
Model_Evaluation.score_model(lr2, X_test, y_test)

Time:  0:00:00.298089
Accuracy Score:  0.999357412514
F1 Score:  0.0769230769231


In [42]:
# Logistic Regression on non-resampled data

lr2 = Model_Evaluation.fit_model(LogisticRegression(C=100), X_train, y_train)
Model_Evaluation.score_model(lr2, X_test, y_test)

Time:  0:00:00.318385
Accuracy Score:  0.999330638036
F1 Score:  0.0740740740741


## Decision Tree
Recursively subdivides the instance space into finer and finer subregions until it is all one class (or good enough). New instances start at the root node and takes the appropriate path until it reaches a lead node, which determines the classification by checking the classes of the training instances that reached that leaf, and the majority determines the class. For that leaf, the score is calculated by:

<a href="https://www.codecogs.com/eqnedit.php?latex=\frac{majority&space;instances}{(majority&space;instances&space;&plus;&space;minority&space;instances)&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\frac{majority&space;instances}{(majority&space;instances&space;&plus;&space;minority&space;instances)&space;}" title="\frac{majority instances}{(majority instances + minority instances) }" /></a>

When using scikit-learn's DecisionTreeClassifier, always set min_samples_leaf to something like 5 or 10. Its default value of 1 is useless and is guaranteed to overfit. 

In [53]:
# Decision Tree on resampled data

dt1 = Model_Evaluation.fit_model(DecisionTreeClassifier(max_depth=5, min_samples_leaf=5, max_features=1), X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(dt1, X_test, y_test)

Time:  0:00:00.054830
Accuracy Score:  0.856006854267
F1 Score:  0.00296625880608


In [79]:
# Decision Tree on resampled data

dt1 = Model_Evaluation.fit_model(DecisionTreeClassifier(max_depth=5), X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(dt1, X_test, y_test)

Time:  0:00:00.536202
Accuracy Score:  0.857640097459
F1 Score:  0.00561062277913


In [80]:
# Decision Tree on non-resampled data

dt1 = Model_Evaluation.fit_model(DecisionTreeClassifier(max_depth=5), X_train, y_train)
Model_Evaluation.score_model(dt1, X_test, y_test)

Time:  0:00:00.161170
Accuracy Score:  0.999223540122
F1 Score:  0.0645161290323


## Random Forest

In [13]:
# Random Forest on resampled data

rf1 = Model_Evaluation.fit_model(RandomForestClassifier, X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(rf1, X_test, y_test)



Time:  0:00:05.435138
Accuracy Score:  0.998902246379
F1 Score:  0.046511627907


In [14]:
# Random Forest on non-resampled data

rf2 = Model_Evaluation.fit_model(RandomForestClassifier, X_train, y_train)
Model_Evaluation.score_model(rf2, X_test, y_test)



Time:  0:00:01.827316
Accuracy Score:  0.999223540122
F1 Score:  0.0645161290323


### K-Nearest Neighbor
If for example, `k=5`, for every new instance, 5 of its nearest neighbors are randomly selected and some function like majority is applied to the five neighbors. To assign a score, divide the number of positive instances by the total and return the fraction. 

In [None]:
# KNN on resampled data

kn1 = Model_Evaluation.fit_model(KNeighborsClassifier(n_neighbors=2), X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(kn1, X_test, y_test)

In [15]:
# COMPARE

kn1 = Model_Evaluation.fit_model(KNeighborsClassifier(n_neighbors=2), X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(kn1, X_test, y_test)

Time:  0:02:40.271168
Accuracy Score:  0.992877988701
F1 Score:  0.029197080292


In [None]:
# KNN on non-resampled data

kn1 = Model_Evaluation.fit_model(KNeighborsClassifier(n_neighbors=2), X_train, y_train)
Model_Evaluation.score_model(kn1, X_test, y_test)

In [16]:
# COMPARE

kn1 = Model_Evaluation.fit_model(KNeighborsClassifier(n_neighbors=2), X_train, y_train)
Model_Evaluation.score_model(kn1, X_test, y_test)

Time:  0:00:44.031340
Accuracy Score:  0.999303863557
F1 Score:  0.0


# Our best model: Decision Tree on non-resampled data, without GridSearch.

<img src='../work/best_model.png'>