Parallelize Pipeline-GridSearch with Dask Delayed
=================================================

In this exercise we parallelize hyper-parameter selection on a Scikit-Learn pipeline.  This is an example of a non-trivial parallel algorithm that we can write down with for loops and Dask Delayed.

We extend an [example taken from the Scikit-Learn documentation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html) that builds a pipeline to transform and train text data.  We recommend that you review that example by clicking the link above.

## Data Preparation

We copy the first part of that example.  The part that sets up the data and the parameters for the pipeline.

In [None]:
from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()



In [None]:
# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])


# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (5,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

## Unroll Pipeline-GridSearch into nested for loops

Normally with Scikit-Learn you would now construct a pipeline, GridSearchCV object, and call fit.  This would use complex code within Scikit-Learn to run this on your machine.

This is a common operation that people want to parallelize across a cluster.  We can do so by writing out the process explicitly as a highly nested for loop.  There is one for loop for the train/test splits and then one for loop for each parameter over which we want to iterate.  

In [None]:
parameter_scores = []

for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

    for max_df in [0.5, 0.75, 1.0]:
        for ngram_range in [(1, 1), (1, 2)]:
            vect = CountVectorizer(max_df=max_df, ngram_range=ngram_range)
            vect = vect.fit(X_train)
            X2_train = vect.transform(X_train)
            X2_test = vect.transform(X_test)
            for norm in ['l1', 'l2']:
                tfidf = TfidfTransformer(norm=norm)
                tfidf = tfidf.fit(X2_train)
                X3_train = tfidf.transform(X2_train)
                X3_test = tfidf.transform(X2_test)
                
                for max_iter in [5]:
                    for alpha in [0.00001, 0.000001]:
                        for penalty in ['l2', 'elasticnet']:
                            clf = SGDClassifier(max_iter=max_iter, alpha=alpha, penalty=penalty)
                            clf = clf.fit(X3_train, y_train)
                            
                            score = clf.score(X3_test, y_test)
                            params = {
                                'max_df': max_df,
                                'ngram_range': ngram_range,
                                'norm': norm,
                                'max_iter': max_iter,
                                'alpha': alpha,
                                'penalty': penalty
                            }
                            
                            parameter_scores.append((params, score))
                            
best = max(parameter_scores, 
           key=lambda param_score: param_score[1])

In [None]:
best

## Exercise: Parallelize the computation above with Dask Delayed

1.  Use Dask delayed to parallelize the code above.  
2.  Check your graph using the `.visualize` method or `dask.visualize` function
3.  Start a Dask cluster using `KubeCluster`
4.  Run your computation on the cluster.  
5.  Is it faster or is it slower?  
6.  Use the dashboard to determine what operations are taking up the most time

In [None]:
%load solutions/03-grid-search.py

## Look at Dask-ML

Operations like these are already built and available in [Dask ML](https://ml.dask.org)