# Capstone - Partial Discharge
## Julian Sweet DSI-LA-6
## Notebook 4 - Decision Tree Modeling, Balanced Class

Decision Trees are good for classification, but tend to overfit. The hope is to improve upon the performance relative to logisitic regression.

In [1]:
import numpy as np
import pandas as pd

from scipy.signal import resample, stft
from sys import getsizeof
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import confusion_matrix, roc_auc_score
from scipy.fftpack import fft

In [2]:
X_train_fft    = np.load('./npy_datasets/X_train_bal_fft.npy')
X_test_fft     = np.load('./npy_datasets/X_test_bal_fft.npy')
y_test_resamp  = np.load('./npy_datasets/y_test_resamp.npy')
y_train_resamp = np.load('./npy_datasets/y_train_resamp.npy')

In [3]:
X_train_fft.shape, X_test_fft.shape

((840, 256), (210, 256))

Modeling will begin with a Random Forest Classifier, specifying multiple minimum sample splits, with 100 features selected by Random Forest. 

In [4]:
params1 = {
    'min_samples_split': [2,3,5,7,10,13]
}

gs1 = GridSearchCV(RandomForestClassifier(n_estimators=100), n_jobs = 6, verbose = 2,
                  param_grid = params1, 
                  return_train_score = True,
                  cv = 5)

In [5]:
gs1.fit(np.abs(X_train_fft), y_train_resamp)
gs1.best_params_

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  30 out of  30 | elapsed:    4.5s finished


{'min_samples_split': 3}

In [6]:
gs1.score(np.abs(X_train_fft), y_train_resamp)

1.0

In [7]:
gs1.score(np.abs(X_test_fft), y_test_resamp)

0.7761904761904762

In [8]:
C = confusion_matrix(y_test, biased_guess).ravel()
C / C.astype(np.float).sum()

NameError: name 'y_test' is not defined

So clearly overfit, but still able to give a signifcantly better answer on than the naive baseline for balanced classes of 50% accuracy.

Playing around with parameters as well as replacing the common "Grid Search" CV and "Random Forest" Classifier now with "Randomized Search" CV and "Extra Trees" Classifier.

In [9]:
params2 = {
    'min_samples_split' : np.arange(2,250,20)
    }

gs2 = RandomizedSearchCV(ExtraTreesClassifier(n_estimators=200), n_jobs = 6, verbose = 2,
                  param_distributions = params2, 
                  return_train_score = True,
                  cv = 5,
                  scoring = 'roc_auc'  
                  )

In [10]:
gs2.fit(np.abs(X_train_fft), y_train_resamp)
gs2.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  29 tasks      | elapsed:    1.9s
[Parallel(n_jobs=6)]: Done  50 out of  50 | elapsed:    3.0s finished


{'min_samples_split': 2}

In [11]:
gs2.score(np.abs(X_train_fft), y_train_resamp)

1.0

In [12]:
gs2.score(np.abs(X_test_fft), y_test_resamp)

0.8611791383219954

In [13]:
d = {'predictions': gs2.predict(np.abs(X_test_fft)), 'actual': y_test_resamp}

In [14]:
con = df = pd.DataFrame(data = d)
con.head(10)

Unnamed: 0,predictions,actual
0,1,1
1,1,1
2,1,1
3,1,1
4,1,1
5,1,1
6,1,1
7,1,1
8,1,1
9,0,1


In [15]:
C = confusion_matrix(con['actual'], con['predictions'])

In [17]:
C

array([[82, 23],
       [16, 89]])

In [18]:
tn, fp, fn, tp = confusion_matrix(con['actual'], con['predictions']).ravel()

In [19]:
(tn, fp, fn, tp)

(82, 23, 16, 89)

In [20]:
C / C.astype(np.float).sum(axis=1)

array([[0.78095238, 0.21904762],
       [0.15238095, 0.84761905]])

In [None]:
Sensitivity = .847, which is just normalized TP

In [21]:
df = pd.DataFrame(gs2.predict_proba(np.abs(X_test_fft)))
df.head()

Unnamed: 0,0,1
0,0.385,0.615
1,0.445,0.555
2,0.16,0.84
3,0.42,0.58
4,0.405,0.595


In [22]:
biased_guess = df[1] >= .40

In [23]:
tn, fp, fn, tp = confusion_matrix(y_test_resamp, biased_guess).ravel()

In [24]:
(tn, fp, fn, tp)

(64, 41, 6, 99)

In [None]:
Sensitivity = 99 / 105

This skewing of the AUC ROC curve decreases True Negative and False Positive, but has beneft of greatly decreasing and increasing True Positive 

Notebook #7 will now perform the same analysis, but with the larger anomalous / unbalanced dataset.