# Hyperparameter tuning

## Dask

<img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="400">

**Hardware**: 10 nodes - r5.8xlarge's (32 CPU, 256 GB RAM each)

In [1]:
from ml_utils import MLUtils

ml_utils = MLUtils(
    ml_task='tip',
    tool='dask',
    model='elastic_net',
)

In [2]:
from dask.distributed import Client
from dask_saturn import SaturnCluster

cluster = SaturnCluster(n_workers=10, scheduler_size='xlarge', worker_size='8xlarge', nthreads=32)
client = Client(cluster)
cluster

[2020-12-07 15:23:11] INFO - dask-saturn | Cluster is ready


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

# Load data and feature engineering

In [3]:
import pandas as pd
import numpy as np
import dask.dataframe as dd

In [4]:
%%time
tip_train = dd.read_parquet(f'{ml_utils.taxi_path}/data/ml/tip_train_sample', engine='pyarrow')
len(tip_train)

CPU times: user 90.9 ms, sys: 748 µs, total: 91.6 ms
Wall time: 3.82 s


10994502

In [5]:
tip_train.head()

Unnamed: 0,id,pickup_datetime,dropoff_datetime,pickup_taxizone_id,dropoff_taxizone_id,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_week_hour,passenger_count,tip_fraction
0,326fdd4d9a1843488a38d16a3bb6278b,2016-07-16 18:24:40,2016-07-16 18:49:56,237.0,249.0,5,28,18,24,138,1.0,0.114286
1,d58919163315476fbd3269d13c31173c,2016-07-17 06:17:08,2016-07-17 06:53:45,132.0,239.0,6,28,6,17,150,1.0,0.224423
2,caa9550ccbda4c1690514a10012e22ef,2016-07-16 17:13:58,2016-07-16 17:21:27,161.0,163.0,5,28,17,13,137,1.0,0.221429
3,812739604c0f474995830e5bb0c5d272,2016-07-16 02:23:48,2016-07-16 03:03:08,148.0,75.0,5,28,2,23,122,1.0,0.208254
4,76ecb54bb45c49d293e81588a4e09720,2016-07-17 21:32:38,2016-07-17 22:00:32,138.0,87.0,6,28,21,32,165,5.0,0.235584


<br>
Let's take the same sample we used in the single node scikit example

In [6]:
sample = tip_train.sample(frac=0.1, replace=False, random_state=42)
len(sample)

1099448

# Run grid search

- use `dask-ml` preprocessing and grid search classes
- still using `sklearn.linear_model.ElasticNet` for model fitting

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler, DummyEncoder, Categorizer
from dask_ml.model_selection import GridSearchCV

features = ml_utils.tip_vars.features
y_col = ml_utils.tip_vars.y_col

pipeline = Pipeline(steps=[
    ('categorize', Categorizer(columns=ml_utils.tip_vars.categorical_feat)),
    ('onehot', DummyEncoder(columns=ml_utils.tip_vars.categorical_feat)),
    ('scale', ColumnTransformer(
        transformers=[('num', StandardScaler(), ml_utils.tip_vars.numeric_feat)], 
        remainder='passthrough',
    )),
    ('clf', ElasticNet(normalize=False, max_iter=100)),
])

params = ml_utils.tip_vars.elastic_net_grid_search_params

grid_search = GridSearchCV(pipeline, params, cv=3, scoring='neg_mean_squared_error')

In [8]:
%%time
with ml_utils.time_fit():
    _ = grid_search.fit(sample[features], sample[y_col])
grid_search.best_score_

CPU times: user 2.29 s, sys: 211 ms, total: 2.5 s
Wall time: 21min 3s


-0.03564949121546809

In [9]:
grid_search.best_params_

{'clf__alpha': 0.5, 'clf__l1_ratio': 0.0}

## Save model

`GridSearchCV` automatically fits the best paramemters to the full data and stores in `best_estimator_`

In [10]:
ml_utils.write_model(grid_search.best_estimator_)

uploading model to 's3://saturn-titan/nyc-taxi/ml_results/models/tip__dask__elastic_net.pkl'
successfully uploaded model


## Predict on test set

If the test set was _really_ big, we could wrap the estimator in `dask_ml.wrappers.ParallelPostFit` to perform the predictions in parallel. For now, we will predict on the single-node with scikit.

In [11]:
%%time

tip_test = ml_utils.read_parquet_dir(f'{ml_utils.taxi_path}/data/ml/tip_test')
preds = tip_test[['id', y_col]].copy()
preds.columns = ['id', 'actual']
preds['predicted'] = grid_search.predict(tip_test[features])

CPU times: user 3min 37s, sys: 53.6 s, total: 4min 31s
Wall time: 5min 29s


In [12]:
preds.head()

Unnamed: 0,id,actual,predicted
0,2e8f402e4dc44f2fae8b9328a237c4d2,0.117647,0.218268
1,5f067a4121244f42bf460867c23b39c9,0.216842,0.218792
2,60e8442d3d434df4959261905a279f55,0.15,0.218514
3,2d1537ce2ed347778e078eaee7eacd44,0.10625,0.218924
4,13bb8a9ecbd04b559b7b9e40904026b0,0.0,0.211234


In [13]:
%%time
ml_utils.write_predictions(preds)

Writing predictions to 's3://saturn-titan/nyc-taxi/ml_results/predictions/tip__dask__elastic_net'
Done writing predictions
CPU times: user 10.3 s, sys: 2.47 s, total: 12.8 s
Wall time: 1min 36s


In [14]:
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(preds.actual, preds.predicted, squared=False)
ml_utils.write_metric_df('rmse', rmse)

Unnamed: 0,ml_task,tool,model,metric,value,fit_seconds
0,tip,dask,elastic_net,rmse,0.207701,1263.106061
