## Procedure

1. Clean and transform data
2. Exploratory Data Analysis (EDA)
3. Handle imbalanced classes
4. **Modeling & evaluation**

Note: We shouldn't be using classification accuracy/error rate for evaluating classifiers, due to huge class imbalance in our dataset. Accuracy applies a 0.50 threahold to decide between classes, which is not our case.

**We want probabilities of class memberships instead of just labels.**

Instead, precision-recall curves predict probabilities of an observation belonging to each class in a classification problem rather than predicting the classes directly. We will be using precision-recall (instead of ROC curves) because we are dealing with class imbalance and precision-recall calculations do not make use of the true negatives. It is only concerned with the correct prediction of the minority class, because we are generally less interested in the ability of the model predicting class 0 correctly (which we have a lot of).

<a href="https://www.codecogs.com/eqnedit.php?latex=\frac{True&space;Positives}{(True&space;Positives&space;&plus;&space;False&space;Positives)&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\frac{True&space;Positives}{(True&space;Positives&space;&plus;&space;False&space;Positives)&space;}" title="\frac{True Positives}{(True Positives + False Positives) }" /></a>

The calculations do not make use of the true negatives. It is only concerned with the correct prediction of the miniority class (`failure=1`)

In [48]:
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split, GridSearchCV, ShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix, f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [17]:
X_train = joblib.load('../work/data/X_train')
y_train = joblib.load('../work/data/y_train')
X_train_resampled = joblib.load('../work/data/X_train_resampled')
y_train_resampled = joblib.load('../work/data/y_train_resampled')
X_test = joblib.load('../work/data/X_test')
y_test = joblib.load('../work/data/y_test')

## Logistic Regression
Draws a line (hyperplane) between different classes of points. The further a point is from the boundary line, the more its score (estimate) increases (nearing 0 or 1). 

In [4]:
LogisticRegression.get_params(LogisticRegression).keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [5]:
# Fit model

start = datetime.now()

params = {
    'penalty': ['l1','l2'],
    'C': np.arange(1,14,3)
}

lr1 = GridSearchCV(LogisticRegression(), param_grid=params, cv=ShuffleSplit(random_state=88), n_jobs=-1)
lr1.fit(X_train_resampled, y_train_resampled)

print("Time: ", datetime.now() - start)



Time:  0:00:52.155262


In [6]:
prediction = lr1.predict(X_test)

In [11]:
confusion = confusion_matrix(y_test, prediction)
pd.DataFrame(confusion, columns=['predicted failure=0', 'predicted failure=1'], index=['true failure=0', 'true failure=1'])

Unnamed: 0,predicted failure=0,predicted failure=1
true failure=0,33753,3572
true failure=1,5,19


In [14]:
lr1.score(X_test, y_test)

0.90422769016573401

Even though our model has a score of 90%, our model over-estimates the likelihood of a failure occurring, even when there is no failure. More importantly, it has some False Negatives, which means we can miss important failures. If the cost of a False Negative is high, we would want to judge our models based on their Recall scores. If the cost of a False Positive is high, we would want to use Precision. 

However, the instructions say we want to both minimize false positives and false negatives, so we will be using the F1 Score to judge our models, which balances Precision and Recall AND takes into acccount an uneven class distribution.

In [16]:
f1_score(y_test,prediction)

0.010511756569847857

In [18]:
# Try it on non-upsampled data

start = datetime.now()

params = {
    'penalty': ['l1','l2'],
    'C': np.arange(1,14,3)
}

lr2 = GridSearchCV(LogisticRegression(), param_grid=params, cv=ShuffleSplit(random_state=88), n_jobs=-1)
lr2.fit(X_train, y_train)

print("Time: ", datetime.now() - start)



Time:  0:00:46.887099


In [19]:
prediction = lr2.predict(X_test)
print(lr2.score(X_test, y_test))
print(f1_score(y_test,prediction))

0.999357412514
0.0769230769231


It actually does better on non-upsampled data. Interesting! Let's see if it's consistent or if it's just a fluke.

# Write classes/functions to repeat on other models

In [30]:
class Model_Evaluation:
    def __init__(self):
        self.name = name
    
    def fit_model(my_model, my_params, X, y):
        start = datetime.now()
        model = GridSearchCV(my_model(), param_grid=my_params, cv=ShuffleSplit(random_state=88), n_jobs=-1)
        model.fit(X, y)
        print("Time: ", datetime.now() - start)
        return model
    
    def score_model(model, X, y):
        prediction = model.predict(X)
        print("Accuracy Score: ", model.score(X, y))
        print("F1 Score: ", f1_score(y, prediction))

## Decision Tree
Recursively subdivides the instance space into finer and finer subregions until it is all one class (or good enough). New instances start at the root node and takes the appropriate path until it reaches a lead node, which determines the classification by checking the classes of the training instances that reached that leaf, and the majority determines the class. For that leaf, the score is calculated by:

<a href="https://www.codecogs.com/eqnedit.php?latex=\frac{majority&space;instances}{(majority&space;instances&space;&plus;&space;minority&space;instances)&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\frac{majority&space;instances}{(majority&space;instances&space;&plus;&space;minority&space;instances)&space;}" title="\frac{majority instances}{(majority instances + minority instances) }" /></a>

When using scikit-learn's DecisionTreeClassifier, always set min_samples_leaf to something like 5 or 10. Its default value of 1 is useless and is guaranteed to overfit. 

In [50]:
DecisionTreeClassifier.get_params(DecisionTreeClassifier).keys()

dict_keys(['class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'presort', 'random_state', 'splitter'])

In [24]:
params = {
    'max_depth': np.arange(1,14,3),
    'min_samples_leaf': np.arange(1,14,3)
}

In [32]:
# Decision Tree on resampled data

dt1 = Model_Evaluation.fit_model(DecisionTreeClassifier, params, X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(dt1, X_test, y_test)

Time:  0:02:05.669966
Accuracy Score:  0.972582933947
F1 Score:  0.0077519379845


In [33]:
# Decision Tree on non-resampled data

dt1 = Model_Evaluation.fit_model(DecisionTreeClassifier, params, X_train, y_train)
Model_Evaluation.score_model(dt1, X_test, y_test)

Time:  0:00:45.908535
Accuracy Score:  0.999357412514
F1 Score:  0.0


  'precision', 'predicted', average, warn_for)


## Random Forest

In [45]:
params = {
    'n_estimators':[10,100],
    'max_depth':[10,40,None]
}

In [39]:
# Random Forest on resampled data

rf1 = Model_Evaluation.fit_model(RandomForestClassifier, params, X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(rf1, X_test, y_test)

Time:  0:08:35.366464
Accuracy Score:  0.999009344293
F1 Score:  0.0


In [47]:
# Random Forest on non-resampled data

rf2 = Model_Evaluation.fit_model(RandomForestClassifier, params, X_train, y_train)
Model_Evaluation.score_model(rf2, X_test, y_test)

Time:  0:02:36.111837
Accuracy Score:  0.999357412514
F1 Score:  0.0


  'precision', 'predicted', average, warn_for)


### K-Nearest Neighbor
If for example, `k=5`, for every new instance, 5 of its nearest neighbors are randomly selected and some function like majority is applied to the five neighbors. To assign a score, divide the number of positive instances by the total and return the fraction. 

In [49]:
params = {
    'n_neighbors': np.arange(2,30,2)
}

In [50]:
# KNN on resampled data

kn1 = Model_Evaluation.fit_model(KNeighborsClassifier, params, X_train_resampled, y_train_resampled)
Model_Evaluation.score_model(kn1, X_test, y_test)

Time:  4:10:37.005268
Accuracy Score:  0.997108356315
F1 Score:  0.0181818181818


In [51]:
# KNN on resampled data

kn2 = Model_Evaluation.fit_model(KNeighborsClassifier, params, X_train, y_train)
Model_Evaluation.score_model(kn2, X_test, y_test)

Time:  2:48:13.990020
Accuracy Score:  0.999357412514
F1 Score:  0.0


  'precision', 'predicted', average, warn_for)
