# Data Modeling 02

In this notebook, we'll use Dask to tune a classifier with HyperbandSearch, so that we can train many models in parallel on the PRP. 

In [1]:
import dask
import numpy as np
import matplotlib.pyplot as plt 

import dask.dataframe as dd
from dask_ml.model_selection import train_test_split, HyperbandSearchCV, RandomizedSearchCV, GridSearchCV
from dask_ml.linear_model import LogisticRegression

Now, let's read in our cleaned (and for this local example, reduced) data and train a model on it 

In [42]:
X = dd.read_csv('../data/processed/primary_reduction_neighbors_15_components_3.csv')
y = dd.read_csv('../data/processed/primary_labels_neighbors_15_components_50.csv', header=None)

In [43]:
# y = y + 1

In [93]:
est = LogisticRegression(class_weight='balanced')

grid = RandomizedSearchCV(
    n_iter=15,
    estimator=est,
    param_distributions={
        'penalty' : ['l1', 'l2'],
        'C' : np.linspace(0.1, 100, 50)
    },
    scoring='balanced_accuracy',
)

In [94]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

In [91]:
X_train.head()

Unnamed: 0,0,1,2
10078,0.174511,2.703613,4.8925
161248,-1.322768,9.670191,6.098368
105800,0.420085,4.96871,7.503746
188474,1.650733,3.510981,8.245358
95689,2.627469,2.725965,6.955181


In [92]:
y_train.head()

Unnamed: 0,0
10078,6.0
161248,12.0
105800,11.0
188474,2.0
95689,11.0


In [46]:
# best_est = grid.fit(X_train.values, y_train.values)

In [47]:
# best_est.cv_results_

Great, now let's see what the best estimator was!

In [48]:
# best_est.best_score_

Now let's define a generalized class to do this hyperparameter tuning 

In [95]:
class GeneClassifier:
    def __init__(self, est, params):
        self.est = est
        self.params = params
        
    def generate_model(self, X, y, n_iter=10):
        grid = RandomizedSearchCV(
            n_iter=n_iter,
            estimator=self.est,
            param_distributions=self.params,
            scoring='balanced_accuracy'
        )

        self.grid = grid.fit(X, y)
    
    def best_score(self):
        return self.grid.best_score_
    
    def best_model(self):
        return self.grid.best_estimator_
    
    def best_params(self):
        return self.grid.best_params_

In [96]:
param_distributions = {
        'penalty' : ['l1', 'l2'],
        'C' : np.linspace(0.1, 100, 50)
    },

logistic_est = GeneClassifier(LogisticRegression(class_weight='balanced'), param_distributions)

In [97]:
logistic_est.generate_model(X_train.values, y_train.values, n_iter=2)

Finally, let's quickly test the balanced accuracy on the test set

In [84]:
from sklearn.metrics import balanced_accuracy_score

est = logistic_est.best_model()

In [86]:
type(est)

dask_ml.linear_model.glm.LogisticRegression

## LogisticRegression doesn't support multi-class

So, ignore the above results

Now let's try this with a simple XGBClassifier (gradient boosted tree classifier)

In [98]:
from xgboost import XGBClassifier

params = {
    'eta' : np.linspace(0, 1, 20),
    'gamma': np.linspace(0, 1000, 20),
    'max_depth': np.linspace(0, 1000, 20, dtype=int),
}

xgb_est = GeneClassifier(XGBClassifier(), params)

Using the XGBClassifier from `dask_ml` requires a distributed Client, so we'll just use the default classifier instead. 

In [99]:
xgb_est.generate_model(X_train.values, y_train.values, n_iter=2)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [100]:
xgb_est.best_params()

{'max_depth': 526, 'gamma': 631.578947368421, 'eta': 0.42105263157894735}

In [101]:
xgb_est.best_score()

0.70812660083081