## Procedure

1. Clean and transform data
2. Exploratory Data Analysis (EDA)
3. **Handle imbalanced classes**
4. Modeling & evaluation
5. Conclusion
6. Clean code with classes & functions

Balance the training set by
* **Oversample the minority class.**
* Undersample the majority class.
* Synthesize new minority classes.
* Penalize our model errors in predicting the minority `failure` class.

In the interest of time, I explore just one re-sampling technique: oversample the minority class. Over-sampling increases the number of minority class members in the training set and no information from the original training set is lost. However it can be prone to overfitting.

In [2]:
!pip install imblearn

Collecting imblearn
  Downloading https://files.pythonhosted.org/packages/81/a7/4179e6ebfd654bd0eac0b9c06125b8b4c96a9d0a8ff9e9507eb2a26d2d7e/imblearn-0.0-py2.py3-none-any.whl
Collecting imbalanced-learn (from imblearn)
  Downloading https://files.pythonhosted.org/packages/e5/4c/7557e1c2e791bd43878f8c82065bddc5798252084f26ef44527c02262af1/imbalanced_learn-0.4.3-py3-none-any.whl (166kB)
[K    100% |████████████████████████████████| 174kB 2.1MB/s ta 0:00:01
Collecting scikit-learn>=0.20 (from imbalanced-learn->imblearn)
  Downloading https://files.pythonhosted.org/packages/5e/82/c0de5839d613b82bddd088599ac0bbfbbbcbd8ca470680658352d2c435bd/scikit_learn-0.20.3-cp36-cp36m-manylinux1_x86_64.whl (5.4MB)
[K    100% |████████████████████████████████| 5.4MB 160kB/s eta 0:00:011B/s eta 0:00:01
[?25hInstalling collected packages: scikit-learn, imbalanced-learn, imblearn
  Found existing installation: scikit-learn 0.18.2
    Uninstalling scikit-learn-0.18.2:
      Successfully uninstalled scikit-

In [5]:
import pandas as pd
from datetime import datetime
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

import matplotlib.pyplot as plt

In [6]:
transformed_features = joblib.load('../work/data/transformed_features')
target = joblib.load('../work/data/target')

## train-test-split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(transformed_features, target, test_size=.3, random_state=88)

## Upsamping
Oversample the training data only.

In [8]:
sm = SMOTE(random_state=88, ratio = 1.0)
X_train_resampled, y_train_resampled = sm.fit_sample(X_train, y_train)

## Export data

In [9]:
joblib.dump(X_train_resampled, '../work/data/X_train_resampled')
joblib.dump(y_train_resampled, '../work/data/y_train_resampled')
joblib.dump(X_test, '../work/data/X_test')
joblib.dump(y_test, '../work/data/y_test')

['../work/data/y_test']