# Data Modeling 02

In this notebook, we'll use Dask to tune a classifier with HyperbandSearch, so that we can train many models in parallel on the PRP. 

In [1]:
import dask
import numpy as np
import matplotlib.pyplot as plt 

import dask.dataframe as dd
from dask_ml.model_selection import train_test_split, HyperbandSearchCV, RandomizedSearchCV, GridSearchCV
from dask_ml.linear_model import LogisticRegression

Now, let's read in our cleaned (and for this local example, reduced) data and train a model on it 

In [2]:
X = dd.read_csv('../data/processed/primary_reduction_neighbors_15_components_3.csv')
y = dd.read_csv('../data/processed/primary_labels_neighbors_15_components_50.csv', header=None)

In [3]:
# y = y + 1

In [4]:
est = LogisticRegression(class_weight='balanced')

grid = RandomizedSearchCV(
    n_iter=15,
    estimator=est,
    param_distributions={
        'penalty' : ['l1', 'l2'],
        'C' : np.linspace(0.1, 100, 50)
    },
    scoring='balanced_accuracy',
)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

In [6]:
X_train.head()

Unnamed: 0,0,1,2
67575,1.960368,1.299117,-0.212742
89917,4.305542,0.210048,7.261106
70364,3.158221,0.971542,0.213203
150005,1.889506,6.531239,4.297892
29714,3.637177,3.37771,3.172499


In [7]:
y_train.head()

Unnamed: 0,0
67575,10.0
89917,7.0
70364,10.0
150005,-1.0
29714,9.0


In [8]:
# best_est = grid.fit(X_train.values, y_train.values)

In [9]:
# best_est.cv_results_

Great, now let's see what the best estimator was!

In [10]:
# best_est.best_score_

Now let's define a generalized class to do this hyperparameter tuning 

In [11]:
class GeneClassifier:
    def __init__(self, est, params):
        self.est = est
        self.params = params
        
    def generate_model(self, X, y, n_iter=10):
        grid = RandomizedSearchCV(
            n_iter=n_iter,
            estimator=self.est,
            param_distributions=self.params,
            scoring='balanced_accuracy',
        )

        self.grid = grid.fit(X, y)
    
    def best_score(self):
        return self.grid.best_score_
    
    def best_model(self):
        return self.grid.best_estimator_
    
    def best_params(self):
        return self.grid.best_params_

In [12]:
param_distributions = {
        'penalty' : ['l1', 'l2'],
        'C' : np.linspace(0.1, 100, 50)
    },

logistic_est = GeneClassifier(LogisticRegression(class_weight='balanced'), param_distributions)

In [13]:
# logistic_est.generate_model(X_train.values, y_train.values, n_iter=2)

Finally, let's quickly test the balanced accuracy on the test set

In [14]:
from sklearn.metrics import balanced_accuracy_score

# est = logistic_est.best_model()

## LogisticRegression doesn't support multi-class

So, ignore the above results

Now let's try this with a simple XGBClassifier (gradient boosted tree classifier)

In [17]:
from xgboost import XGBClassifier

params = {
    'eta' : np.linspace(0, 1, 20),
    'gamma': np.linspace(0, 1000, 20),
    'max_depth': np.linspace(0, 1000, 20, dtype=int),
}

xgb_est = GeneClassifier(XGBClassifier(eval_metric='mlogloss'), params)

Using the XGBClassifier from `dask_ml` requires a distributed Client, so we'll just use the default classifier instead. 

In [None]:
xgb_est.generate_model(X_train.values, y_train.values.compute().ravel(), n_iter=50)





In [None]:
xgb_est.best_params()

In [None]:
xgb_est.best_score()