# Data Modeling 02

In this notebook, we'll use Dask to tune a classifier with HyperbandSearch, so that we can train many models in parallel on the PRP. 

In [31]:
import dask
import numpy as np
import matplotlib.pyplot as plt 

import dask.dataframe as dd
from dask_ml.model_selection import train_test_split, HyperbandSearchCV, RandomizedSearchCV, GridSearchCV
from dask_ml.linear_model import LogisticRegression

Now, let's read in our cleaned (and for this local example, reduced) data and train a model on it 

In [62]:
X = dd.read_csv('../data/processed/primary_reduction_neighbors_15_components_3.csv')
y = dd.read_csv('../data/processed/primary_labels_neighbors_15_components_50.csv', header=None)

In [80]:
est = LogisticRegression()

grid = RandomizedSearchCV(
    n_iter=15,
    estimator=est,
    param_distributions={
        'penalty' : ['l1', 'l2'],
        'C' : np.linspace(0.1, 100, 50)
    },
    scoring='balanced_accuracy',
)

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

In [82]:
best_est = grid.fit(X_train.values, y_train.values)

In [83]:
best_est.cv_results_

{'params': [{'penalty': 'l2', 'C': 49.030612244897966},
  {'penalty': 'l1', 'C': 73.49591836734695},
  {'penalty': 'l2', 'C': 26.604081632653067},
  {'penalty': 'l2', 'C': 75.53469387755102},
  {'penalty': 'l1', 'C': 28.642857142857146},
  {'penalty': 'l1', 'C': 22.5265306122449},
  {'penalty': 'l1', 'C': 77.57346938775511},
  {'penalty': 'l2', 'C': 93.88367346938776},
  {'penalty': 'l1', 'C': 4.177551020408163},
  {'penalty': 'l1', 'C': 36.79795918367348},
  {'penalty': 'l2', 'C': 89.8061224489796},
  {'penalty': 'l1', 'C': 6.216326530612245},
  {'penalty': 'l2', 'C': 40.87551020408164},
  {'penalty': 'l1', 'C': 89.8061224489796},
  {'penalty': 'l1', 'C': 100.0}],
 'mean_fit_time': array([ 9.71278294, 10.70838801, 10.27951299, 11.82670919, 12.62142772,
        12.74563172, 12.57570108, 11.91387925, 12.33058006, 12.26729521,
        11.38746392, 12.31507067, 11.13221544,  9.52779222,  7.88915701]),
 'std_fit_time': array([0.12342913, 0.35875605, 0.15985116, 0.72564082, 0.23096534,
    

Great, now let's see what the best estimator was!

In [85]:
best_est.best_score_

0.07142857142857142

Now let's define a generalized class to do this hyperparameter tuning 