# Distributed Scikit-Learn with Dask 03

#### Objective:
- Demonstrate how to run Distributed Scikit-Learn Algorithms with Dask on a CML Cluster
- Notice: this is different from the Dask-ML library which is the subject of the next notebook 

#### For a comparison with Dask-ML see: https://tutorial.dask.org/08_machine_learning.html#Types-of-Scaling

Code from: https://examples.dask.org/machine-learning/scale-scikit-learn.html

In [1]:
import cdsw_dask_utils
import cdsw

# Run a Dask cluster with three workers and return an object containing
# a description of the cluster. 
# 
# Note that the scheduler will run in the current session, and the Dask
# dashboard will become available in the nine-dot menu at the upper
# right corner of the CDSW app.

cluster = cdsw_dask_utils.run_dask_cluster(
  n=3, \
  cpu=1, \
  memory=1, \
  nvidia_gpu=0
)

# Connect a Dask client to the scheduler address in the cluster
# description.
from dask.distributed import Client
client = Client(cluster["scheduler_address"])
client

Waiting for Dask scheduler to become ready...
Dask scheduler is ready
IDs ['0wwz1hv7uurjy7jd', 'c7wclzioorq7m9fh', '5ah93un6bffsyb0s']


0,1
Client  Scheduler: tcp://10.0.85.15:2323  Dashboard: http://10.0.85.15:8100/status,Cluster  Workers: 1  Cores: 16  Memory: 1000.00 MB


#### Dask Scheduler UI

In [None]:
import os 
engine_id = os.environ.get('CDSW_ENGINE_ID')
cdsw_domain = os.environ.get('CDSW_DOMAIN')

from IPython.core.display import HTML
HTML('<a  target="_blank" rel="noopener noreferrer" href="http://read-only-{}.{}">http://read-only-{}.{}</a>'
     .format(engine_id,cdsw_domain,engine_id,cdsw_domain))

In [2]:
from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [3]:
from dask.distributed import Client, progress
import dask.dataframe as dd

In [6]:
# Scale Up: set categories=None to use all the categories
categories = [
    'alt.atheism',
    'talk.religion.misc',
]

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
857 documents
2 categories



In [7]:
pipeline = Pipeline([
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=1000)),
])

In [8]:
parameters = {
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    # 'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__n_iter': (10, 50, 80),
}

In [9]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=3, refit=False, iid=False)

#### To use the Dask distributed backend, which will use a cluster of machines to train the model, perform the fit in a parallel_backend context.

In [12]:
from sklearn.externals.joblib import parallel_backend

with parallel_backend('dask'):
    grid_search.fit(data.data, data.target)

[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 48 concurrent workers.


Fitting 3 folds for each of 8 candidates, totalling 24 fits


distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fbee9a32510> of <Future: status: cancelled, key: _fit_and_score-batch-9182a6b64c5f40fcb48d5382bd753ea7>:
Traceback (most recent call last):
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 287, in execute_callback
    fn(fut)
  File "/home/cdsw/.local/lib/python3.6/site-packages/joblib/_dask.py", line 260, in callback_wrapper
    result = future.result()
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 224, in result
    raise result
concurrent.futures._base.CancelledError: _fit_and_score-batch-9182a6b64c5f40fcb48d5382bd753ea7
distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fbe69ece950> of <Future: status: cancelled, key: _fit_and_score-batch-2974eab0a6a0469da36b3a91a00825c4>:
Traceback (most recent call last):
  File "/ho

distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fbe74437a60> of <Future: status: cancelled, key: _fit_and_score-batch-77644c6852b244d0ae774557f8be3e38>:
Traceback (most recent call last):
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 287, in execute_callback
    fn(fut)
  File "/home/cdsw/.local/lib/python3.6/site-packages/joblib/_dask.py", line 260, in callback_wrapper
    result = future.result()
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 224, in result
    raise result
concurrent.futures._base.CancelledError: _fit_and_score-batch-77644c6852b244d0ae774557f8be3e38
distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fbe74437620> of <Future: status: cancelled, key: _fit_and_score-batch-2c311939a72043109faa37ed2f90be62>:
Traceback (most recent call last):
  File "/ho

CancelledError: _fit_and_score-batch-a65945a1f5b44e85a720511f734180c6

distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fbe74437950> of <Future: status: cancelled, key: _fit_and_score-batch-26b80f5d41d248cc9f9dd4d70040547b>:
Traceback (most recent call last):
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 287, in execute_callback
    fn(fut)
  File "/home/cdsw/.local/lib/python3.6/site-packages/joblib/_dask.py", line 260, in callback_wrapper
    result = future.result()
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 224, in result
    raise result
concurrent.futures._base.CancelledError: _fit_and_score-batch-26b80f5d41d248cc9f9dd4d70040547b
distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fbe74437ea0> of <Future: status: cancelled, key: _fit_and_score-batch-397d8bb295aa494aa06606779ce71249>:
Traceback (most recent call last):
  File "/ho

In [None]:
## stop CDSW workers
#Parameter
#worker_id (int, optional) - The ID numbers of the worker engines that must be stopped. 
#If an ID is not provided, all the worker engines on the cluster will be stopped.

cdsw.stop_workers()

#### Next we will see how you can use the Dask-ML library as an alternative to the above