<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo">


Model Parallelism with SKLearn and Dask
=======================


<img src="https://avatars2.githubusercontent.com/u/365630?v=3&s=400"
     align="right"
     width="25%"
     alt="SKLearn logo">


*How do we choose the right parameters for a machine learning pipeline?*

This notebook takes a standard example from the Scikit-Learn documententation and parallelizes it using a Dask-powered `GridSearchCV` function, which is a drop in replacement.  We achieve significant speedup on a cluster on an important problem just by changing an import.


### SKLearn example

Taken from: http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#sphx-glr-auto-examples-plot-digits-pipe-py

In [None]:
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause


import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

In [None]:
%%time

n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)

#Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(pipe,
                         dict(pca__n_components=n_components,
                              logistic__C=Cs))
with joblib.parallel_backend('dask.distributed', scheduler_host=client.scheduler.address):
    estimator.fit(X_digits, y_digits)

In [None]:
from dask.distributed import Client
client = Client(processes=True)
client

In [None]:
import distributed.joblib
import sklearn.externals.joblib as joblib