Scalable Machine Learning in Python 
===================
with Scikit-Learn and Dask 
===============
## 4 - Dask and Scikit Learn
**May 2017**

<a href=http://dask.pydata.org ><img src=https://www.continuum.io/sites/default/files/dask_stacked.png
 width=200 />
</a>

[http://bit.ly/scaleml-dask-wkshp](http://bit.ly/scaleml-dask-wkshp)


Cross-Validated Parameter Search
------------------------------------

In this section we present an open ended problem, cross-validated parameter search, and encourage students to try one of the previously mentioned techniques to parallelize it.  Any of `map`, `submit`, or collections like `spark` or `dask.bag` will work fine.


### Requirements

*  SciKit Learn
*  A parallel computing framework of your choice


### Application

We use grid search to find the optimal parameters for tuning a machine learning model.  This is slightly more complex than a map so we use `submit`.  We train the support vector classifier on handwritten digits using cross validation to avoid over-fitting.

As before we start with a sequential solution.

### Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

### Data

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()  # Collect Data

plt.imshow(digits.data[75].reshape(8, 8),  # Example element
           interpolation='nearest', cmap='Greys');

### Utility functions

We use three utility functions that we provide in the `cv_params_demo.py` module. The `load_cv_split` function splits the data into a training and test set. `evaluate_one` fits the model and scores it over the data for a particular set of tuning parameters. `plot_results` visualizes the model score over the sampled parameter space.

In [None]:
from cv_params_demo import load_cv_split, evaluate_one, plot_results

### Parameters

`C`, `gamma`, and `tol` are all tunable parameters to the support-vector classifier, representing the penalty parameter of the error term, the kernel coefficient, and the stopping tolerance, respectively. Although `scikit-learn` can pick reasonable defaults for each of these, they can frequently be improved with additional knowledge of the data or by what we're doing here, randomly sampling the parameter space. We start with ten parameter samples, but can increase this after we've built our parallel solution.

In [None]:
CV_SPLIT_COUNT     = 3   # increase to 5 when parallel code is working
PARAM_SAMPLE_COUNT = 10  # increase to 40 when parallel code is working

In [None]:
from sklearn.model_selection import ParameterSampler

param_grid = {
    'C': np.logspace(-10, 10, 1001),
    'gamma': np.logspace(-10, 10, 1001),
    'tol': np.logspace(-4, -1, 4),
}

param_samples = ParameterSampler(param_grid, PARAM_SAMPLE_COUNT)

print(len(param_samples))
list(param_samples)

### Split data for cross-validation

For now, we'll only build two randomly-chosen splits of the data for training and testing. We can increase this number after we've built our parallel solution.

In [None]:
from cv_params_demo import load_cv_split

cv_splits = [load_cv_split(i) for i in range(CV_SPLIT_COUNT)]
idx, (x_train, x_test, y_train, y_test) = cv_splits[0]
x_train, y_train

### Sequential cross validated parameter search

The below code sequentially loops over the randomly created data splits and parameter samples to create a list of scored samples over the parameter space.

In [None]:
%%time

from sklearn.svm import SVC

results = []

for split in cv_splits:
    for params in param_samples:
        result = evaluate_one(SVC, params, split)
        results.append(result)

### Plot results

Which region of the parameter space is scoring well (higher is better)?  Are the number of samples we've computed sufficient to completely tune the model?  

Searching over more parameters would help to improve the intuition we can gain here.

In [None]:
from cv_params_demo import plot_results

plot_results(results)

## Exercise 4.1 Parallel cross validated parameter search

Try using some of the techniques we've used before (or other techniques altogether) to parallelize the above computation.  

Afterwards, increase the number of parameters to help improve our understanding of the image.

In [None]:
from distributed import Client

client = Client()

In [None]:
client

In [None]:
CV_SPLIT_COUNT     = 5   # increase to 5 when parallel code is working
PARAM_SAMPLE_COUNT = 40  # increase to 40 when parallel code is working

In [None]:
from sklearn.model_selection import ParameterSampler

param_grid = {
    'C': np.logspace(-10, 10, 1001),
    'gamma': np.logspace(-10, 10, 1001),
    'tol': np.logspace(-4, -1, 4),
}

param_samples = ParameterSampler(param_grid, PARAM_SAMPLE_COUNT)

print(len(param_samples))
list(param_samples)

In [None]:
cv_splits     = [load_cv_split(i) for i in range(CV_SPLIT_COUNT)]  # Increase the number 2 after parallel computation acheived
param_samples = ParameterSampler(param_grid, PARAM_SAMPLE_COUNT)    # Increase the number 10 after parallel computation acheived

In [None]:
%%time

from dask import delayed, compute
results = []

for split in cv_splits:
    for params in param_samples:
        result = delayed(evaluate_one)(SVC, params, split)
        results.append(result)
        
final = compute(results)

In [None]:
plot_results(final[0])

## Read More

* [Dask SVD](http://matthewrocklin.com/blog/work/2015/06/26/Complex-Graphs)
* [Blog post on Grid Search with Dask](http://matthewrocklin.com/blog/work/2017/02/07/dask-sklearn-simple)
    * NOTE: The `dklearn` package has been reorganized and renamed
    * This piece is now [dask-searchcv](https://github.com/dask/dask-searchcv)
* [XGBoost and Dask Notebook](./X2 Dask XGBoost Example.ipynb)

In [None]:
client.restart()