# Distributed Scikit-Learn with Dask 03

#### Objective:
- Demonstrate how to run Distributed Scikit-Learn Algorithms with Dask on a CML Cluster
- Notice: this is different from the Dask-ML library which is the subject of the next notebook 

#### For a comparison with Dask-ML see: https://tutorial.dask.org/08_machine_learning.html#Types-of-Scaling

Code from: https://examples.dask.org/machine-learning/scale-scikit-learn.html

In [1]:
import cdsw_dask_utils
import cdsw

# Run a Dask cluster with three workers and return an object containing
# a description of the cluster. 
# 
# Note that the scheduler will run in the current session, and the Dask
# dashboard will become available in the nine-dot menu at the upper
# right corner of the CDSW app.

cluster = cdsw_dask_utils.run_dask_cluster(
  n=3, \
  cpu=1, \
  memory=1, \
  nvidia_gpu=0
)

# Connect a Dask client to the scheduler address in the cluster
# description.
from dask.distributed import Client
client = Client(cluster["scheduler_address"])
client

Waiting for Dask scheduler to become ready...
Dask scheduler is ready
IDs ['7i1070ace2lmgo7v', '7u6ia88b0myb90ns', '410o5c4kp00jmsi9']


0,1
Client  Scheduler: tcp://10.0.66.253:2323  Dashboard: http://10.0.66.253:8100/status,Cluster  Workers: 5  Cores: 80  Memory: 11.00 GB


#### Dask Scheduler UI

In [2]:
import os 
engine_id = os.environ.get('CDSW_ENGINE_ID')
cdsw_domain = os.environ.get('CDSW_DOMAIN')

from IPython.core.display import HTML
HTML('<a  target="_blank" rel="noopener noreferrer" href="http://read-only-{}.{}">http://read-only-{}.{}</a>'
     .format(engine_id,cdsw_domain,engine_id,cdsw_domain))

In [3]:
from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [4]:
from dask.distributed import Client, progress
import dask.dataframe as dd

In [5]:
# Scale Up: set categories=None to use all the categories
categories = [
    'alt.atheism',
    'talk.religion.misc',
]

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
857 documents
2 categories



In [6]:
pipeline = Pipeline([
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=1000)),
])

In [7]:
parameters = {
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    # 'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__n_iter': (10, 50, 80),
}

In [8]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=3, refit=False, iid=False)

#### To use the Dask distributed backend, which will use a cluster of machines to train the model, perform the fit in a parallel_backend context.

In [9]:
from joblib import Parallel, parallel_backend
with parallel_backend('dask'):
    grid_search.fit(data.data, data.target)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 96 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  24 | elapsed:    3.9s remaining:    7.9s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    4.7s finished


In [10]:
## stop CDSW workers
#Parameter
#worker_id (int, optional) - The ID numbers of the worker engines that must be stopped. 
#If an ID is not provided, all the worker engines on the cluster will be stopped.

cdsw.stop_workers()

[<Response [204]>,
 <Response [204]>,
 <Response [204]>,
 <Response [204]>,
 <Response [204]>,
 <Response [204]>]

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py", line 662, in log_errors
    yield
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 1290, in _close
    await gen.with_timeout(timedelta(seconds=2), list(coroutines))
concurrent.futures._base.CancelledError
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py", line 662, in log_errors
    yield
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 1019, in _reconnect
    await self._close()
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py", line 1290, in _close
    await gen.with_timeout(timedelta(seconds=2), list(coroutines))
concurrent.futures._base.CancelledError


#### Next we will see how you can use the Dask-ML library as an alternative to the above