# Capstone - Partial Discharge
## Julian Sweet DSI-LA-6
## Notebook 5 - Decision Tree Modeling, Unbalanced Class

The unmanipulated dataset is a an anomaly detection problem. So this 94% / 4% split is the datset of significance.

In [8]:
import numpy as np
import pandas as pd

from scipy.signal import resample, stft
from sys import getsizeof
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix
from scipy.fftpack import fft

In [9]:
X_train_fft = np.load('./npy_datasets/X_train_unbal_fft.npy')
X_test_fft  = np.load('./npy_datasets/X_test_unbal_fft.npy')
y_test        = np.load('./npy_datasets/y_test.npy')
y_train       = np.load('./npy_datasets/y_train.npy')

In [10]:
params3 = {
    'min_samples_split' : np.arange(2,250,20)
    }
gs3 = RandomizedSearchCV(ExtraTreesClassifier(n_estimators = 600), n_jobs = 6, verbose = 2,
                  param_distributions = params3, 
                  return_train_score = True,
                  scoring = "accuracy",
                  cv = 5)

In [11]:
gs3.fit(X_train_fft, y_train)
gs3.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  29 tasks      | elapsed:   53.4s
[Parallel(n_jobs=6)]: Done  50 out of  50 | elapsed:  1.5min finished


{'min_samples_split': 2}

In [12]:
gs3.score(X_train_fft, y_train)

1.0

In [13]:
gs3.score(X_test_fft, y_test)

0.9403327596098681

In [14]:
pd.Series(y_train).value_counts(normalize = True)

0    0.939733
1    0.060267
dtype: float64

In [15]:
(0.9397590361445783-0.939733)*100

0.0026036144578300835

The Decision Tree accuracy beats the naive baseline by 0.0026%

In [16]:
params4 = {
    'min_samples_split' : np.arange(2,250,20)
    }
gs4 = RandomizedSearchCV(ExtraTreesClassifier(n_estimators = 600), n_jobs = 6, verbose = 2,
                  param_distributions = params3, 
                  return_train_score = True,
                  scoring = "roc_auc",
                  cv = 5)

In [17]:
gs4.fit(X_train_fft, y_train)
gs4.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  29 tasks      | elapsed:   50.5s
[Parallel(n_jobs=6)]: Done  50 out of  50 | elapsed:  1.4min finished


{'min_samples_split': 22}

In [18]:
gs4.score(X_train_fft, y_train)

0.9996164445316988

In [19]:
gs4.score(X_test_fft, y_test)

0.8626373626373626

In [20]:
d = {'predictions': gs4.predict(X_test_fft), 'actual': y_test}
con = pd.DataFrame(data = d)
con.head(10)

Unnamed: 0,predictions,actual
0,0,1
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,0


In [21]:
tn, fp, fn, tp = confusion_matrix(con['actual'], con['predictions']).ravel()

In [22]:
(tn, fp, fn, tp)

(1637, 1, 105, 0)

Sensitivity is 0, as bad as it gets

In [23]:
df = pd.DataFrame(gs4.predict_proba(X_test_fft))
df.head()

Unnamed: 0,0,1
0,0.874661,0.125339
1,0.983549,0.016451
2,0.880123,0.119877
3,0.886166,0.113834
4,0.941134,0.058866


In [24]:
biased_guess = (df[1] >= .40)

In [25]:
tn, fp, fn, tp = confusion_matrix(y_test, biased_guess).ravel()

In [26]:
(tn, fp, fn, tp)

(1635, 3, 98, 7)

In [None]:
Sensitivity increased to .07

Most signficantly, False negatives decreased and True Positive increased . That definitely outweighs the small decrease in True Negative and increase in False Positives. 