In [None]:
import warnings
warnings.filterwarnings('ignore')

<img src="https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask_horizontal.svg"
     width="60%"
     alt="Dask logo\" />

# Parallel and Distributed Machine Learning
So far we have seen how Dask makes data analysis scalable with parallelization via Dask DataFrames and Dask Array. Let's now see how [Dask-ML](https://ml.dask.org/) allows us to do machine learning in a parallel and distributed manner. Note, machine learning is really just a special case of data analysis (one that automates analytical model building), so the 💪 Dask gains 💪 we've seen will apply here as well!

> If you'd like a refresher on the difference between parallel and distributed computing, [here's a good discussion on StackExchange](https://cs.stackexchange.com/questions/1580/distributed-vs-parallel-computing). You can also check out [The Beginner's Guide to Distributed Computing](https://towardsdatascience.com/the-beginners-guide-to-distributed-computing-6d6833796318).

### What we'll cover:
1. Types of scaling problems in ML
2. Scale Scikit-Learn with Joblib+Dask (compute-bound)
3. Scale Scikit-Learn with Dask-ML (memory-bound)
4. Scale XGBoost with Dask

## Types of scaling problems in machine learning

There are two main types of scaling challenges you can run into in your machine learning workflow: scaling the **size of your data** and scaling the **size of your model**. That is:

1. **Memory-bound problems**: Data is larger than RAM, and sampling isn't an option.
2. **CPU-bound problems**: Data fits in RAM, but training takes too long. Many hyperparameter combinations, a large ensemble of many models, etc.

Here's a handy diagram for visualizing these problems:

In [1]:
from IPython.display import Image
Image(url="images/dask-zones.png", width=400)

In the bottom-left quadrant, your datasets are not too large (they fit comfortably in RAM) and your model is not too large either. When these conditions are met, you are much better off using something like scikit-learn, XGBoost, and similar libraries. You don't need to leverage multiple machines in a distributed manner with a library like Dask-ML. However, if you are in any of the other quadrants, distributed machine learning is the way to go.

Summarizing: 

* For in-memory problems, just use scikit-learn (or your favorite ML library).
* For large models, use `dask` and `joblib` together with your favorite scikit-learn estimator.
* For large datasets, use `dask_ml` or `dask-xgboost` estimators.

## Scikit-Learn Refresher

<img src="https://raw.githubusercontent.com/coiled/data-science-at-scale/master/images/scikit_learn_logo_small.svg" 
     width="30%"
     alt="sklearn logo\" />

In this section, we'll quickly run through a typical Scikit-Learn workflow:

* Load some data (in this case, we'll generate it)
* Import the Scikit-Learn module for our chosen ML algorithm
* Create an estimator for that algorithm and fit it with our data
* Inspect the learned attributes
* Check the accuracy of our model


### Generate some random data

In [1]:
from sklearn.datasets import make_classification

# Generate data
X, y = make_classification(n_samples=10000, n_features=4, random_state=0)

In [3]:
# Let's take a look at X
X[:8]

array([[-0.77244139,  0.3607576 , -2.38110133,  0.08757   ],
       [ 1.14946035,  0.62254594,  0.37302939,  0.45965795],
       [-1.90879217, -1.1602627 , -0.27364545, -0.82766028],
       [-0.77694695,  0.31434299, -2.26231851,  0.06339125],
       [-1.17047054,  0.02212382, -2.17376797, -0.13421976],
       [ 0.79010037,  0.68530624, -0.44740487,  0.44692959],
       [ 1.68616989,  1.6329131 , -1.42072654,  1.04050557],
       [-0.93912893, -1.02270838,  1.10093827, -0.63714432]])

In [4]:
# Let's take a look at y
y[:8]

array([0, 0, 1, 0, 0, 0, 0, 1])

### Fitting a SVC

For this example, we will fit a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In [6]:
from sklearn.svm import SVC

estimator = SVC(random_state=0)
estimator.fit(X, y)

We can inspect the learned features by taking a look a the `support_vectors_`:

In [7]:
estimator.support_vectors_[:4]

array([[-0.77244139,  0.3607576 , -2.38110133,  0.08757   ],
       [ 1.14946035,  0.62254594,  0.37302939,  0.45965795],
       [-0.77694695,  0.31434299, -2.26231851,  0.06339125],
       [ 0.79010037,  0.68530624, -0.44740487,  0.44692959]])

And we check the accuracy:

In [8]:
estimator.score(X, y)

0.905

### Hyperparameter Optimization

There are a few ways to learn the best *hyper*parameters while training. One is `GridSearchCV`.
As the name implies, this does a brute-force search over a grid of hyperparameter combinations. Scikit-learn provides tools to automatically find the best parameter combinations via cross-validation (which is the "CV" in `GridSearchCV`).

In [9]:
from sklearn.model_selection import GridSearchCV

In [10]:
%%time
estimator = SVC(gamma='auto', random_state=0, probability=True)
param_grid = {
    'C': [0.001, 10.0],
    'kernel': ['rbf', 'poly'],
}

# Brute-force search over a grid of hyperparameter combinations
grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2)
grid_search.fit(X, y)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV] END ................................C=0.001, kernel=rbf; total time=   7.9s
[CV] END ................................C=0.001, kernel=rbf; total time=   7.9s
[CV] END ...............................C=0.001, kernel=poly; total time=   4.4s
[CV] END ...............................C=0.001, kernel=poly; total time=   4.4s
[CV] END .................................C=10.0, kernel=rbf; total time=   2.2s
[CV] END .................................C=10.0, kernel=rbf; total time=   2.1s
[CV] END ................................C=10.0, kernel=poly; total time=   3.9s
[CV] END ................................C=10.0, kernel=poly; total time=   4.2s
CPU times: user 44 s, sys: 1.48 s, total: 45.5 s
Wall time: 46.1 s


In [11]:
grid_search.best_params_, grid_search.best_score_

({'C': 10.0, 'kernel': 'rbf'}, 0.9086000000000001)

## Compute Bound: Single-machine parallelism with Joblib
With Joblib, we can say that Scikit-Learn has *single-machine* parallelism.

**Any Scikit-Learn estimator that can operate in parallel exposes an `n_jobs` keyword**, which tells you how many tasks to run in parallel. Specifying `n_jobs=-1` jobs means running the maximum possible number of tasks in parallel.

In [12]:
%%time
grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)
grid_search.fit(X, y)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV] END ...............................C=0.001, kernel=poly; total time=   9.3s
[CV] END ...............................C=0.001, kernel=poly; total time=  12.0s
[CV] END .................................C=10.0, kernel=rbf; total time=   4.7s
[CV] END .................................C=10.0, kernel=rbf; total time=   5.5s
[CV] END ................................C=0.001, kernel=rbf; total time=  20.1s
[CV] END ................................C=0.001, kernel=rbf; total time=  20.2s
[CV] END ................................C=10.0, kernel=poly; total time=   6.9s
[CV] END ................................C=10.0, kernel=poly; total time=   5.6s
CPU times: user 8.86 s, sys: 296 ms, total: 9.15 s
Wall time: 34.9 s


Notice that the computation above it is faster than before. 

## Compute Bound: Multi-machine parallelism with Dask

In this section we'll see how Dask (plus Joblib and Scikit-Learn) gives us multi-machine parallelism. Here's what our grid search graph would look like if we allowed Dask to schedule our training "jobs" over multiple machines in our cluster:

Dask can talk to Scikit-Learn (via Joblib) so that our *Dask cluster* is used to train a model. 



In [13]:
from dask.distributed import Client

# create local Dask cluster with 8 workers (cores)
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 4,Total memory: 15.02 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39431,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 15.02 GiB

0,1
Comm: tcp://127.0.0.1:38437,Total threads: 1
Dashboard: http://127.0.0.1:33135/status,Memory: 3.75 GiB
Nanny: tcp://127.0.0.1:40857,
Local directory: /tmp/dask-scratch-space/worker-eum9rb1n,Local directory: /tmp/dask-scratch-space/worker-eum9rb1n

0,1
Comm: tcp://127.0.0.1:37409,Total threads: 1
Dashboard: http://127.0.0.1:39681/status,Memory: 3.75 GiB
Nanny: tcp://127.0.0.1:43649,
Local directory: /tmp/dask-scratch-space/worker-rdie5jil,Local directory: /tmp/dask-scratch-space/worker-rdie5jil

0,1
Comm: tcp://127.0.0.1:32859,Total threads: 1
Dashboard: http://127.0.0.1:34609/status,Memory: 3.75 GiB
Nanny: tcp://127.0.0.1:34309,
Local directory: /tmp/dask-scratch-space/worker-fv7w254q,Local directory: /tmp/dask-scratch-space/worker-fv7w254q

0,1
Comm: tcp://127.0.0.1:36877,Total threads: 1
Dashboard: http://127.0.0.1:40867/status,Memory: 3.75 GiB
Nanny: tcp://127.0.0.1:35973,
Local directory: /tmp/dask-scratch-space/worker-2n6zsnjd,Local directory: /tmp/dask-scratch-space/worker-2n6zsnjd


**Note:** Click on Cluster Info, to see more details about the cluster. You can see the configuration of the cluster and some other specs. 

We can expand our problem by specifying more hyperparameters before training, and see how using `dask` as backend can help us. 

In [14]:
param_grid = {
    'C': [0.001, 0.1, 1.0, 2.5, 5, 10.0],
    'kernel': ['rbf', 'poly', 'linear'],
    'shrinking': [True, False],
}

grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)

### Dask parallel backend

We can fit our estimator with multi-machine parallelism by quickly *switching to a Dask parallel backend* when using joblib. 

In [15]:
import joblib

In [16]:
%%time
with joblib.parallel_backend("dask", scatter=[X, y]):
    grid_search.fit(X, y)

Fitting 2 folds for each of 36 candidates, totalling 72 fits
[CV] END ..............C=0.001, kernel=poly, shrinking=False; total time=  13.5s
[CV] END ................C=0.001, kernel=rbf, shrinking=True; total time=  20.8s
[CV] END ...............C=0.001, kernel=rbf, shrinking=False; total time=  23.0s
[CV] END ................C=0.001, kernel=rbf, shrinking=True; total time=  24.3s
[CV] END ...............C=0.001, kernel=rbf, shrinking=False; total time=  16.1s
[CV] END ...............C=0.001, kernel=poly, shrinking=True; total time=  10.1s
[CV] END ...............C=0.001, kernel=poly, shrinking=True; total time=  14.1s
[CV] END ..............C=0.001, kernel=poly, shrinking=False; total time=  15.3s
[CV] END ............C=0.001, kernel=linear, shrinking=False; total time=   5.0s
[CV] END .............C=0.001, kernel=linear, shrinking=True; total time=  10.8s
[CV] END .............C=0.001, kernel=linear, shrinking=True; total time=  10.3s
[CV] END ..................C=0.1, kernel=rbf, sh

**What just happened?**

Dask-ML developers worked with the Scikit-Learn and Joblib developers to implement a Dask parallel backend. So internally, scikit-learn now talks to Joblib, and Joblib talks to Dask, and Dask is what handles scheduling all of those tasks on multiple machines.

The best parameters and best score:

In [17]:
grid_search.best_params_, grid_search.best_score_

({'C': 10.0, 'kernel': 'rbf', 'shrinking': True}, 0.9086000000000001)

## But that was cheating...sort of

In [18]:
import coiled 
cluster = coiled.Cluster(
    name="intro-to-dask",
    n_workers=10,
    worker_memory="16GiB",
    package_sync=True,
)

ModuleNotFoundError: No module named 'coiled'

In [None]:
from distributed import Client
client = Client(cluster)

In [None]:
%%time
with joblib.parallel_backend("dask", scatter=[X, y]):
    grid_search.fit(X, y)

## Memory Bound: Parallel Machine Learning with Dask-ML

We have seen how to work with larger models, but sometimes you'll want to train on a larger than memory dataset. `dask-ml` has implemented estimators that work well on Dask `Arrays` and `DataFrames` that may be larger than your machine's RAM.

In [19]:
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression
from dask_ml.model_selection import train_test_split

In [20]:
# create synthetic regression data
X, y = make_regression(n_samples=10_000, chunks=100)

In [21]:
X

Unnamed: 0,Array,Chunk
Bytes,7.63 MiB,78.12 kiB
Shape,"(10000, 100)","(100, 100)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 7.63 MiB 78.12 kiB Shape (10000, 100) (100, 100) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",100  10000,

Unnamed: 0,Array,Chunk
Bytes,7.63 MiB,78.12 kiB
Shape,"(10000, 100)","(100, 100)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [22]:
# create train/test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=21, test_size=0.3, convert_mixed_types=True)

In [23]:
# instantiate model
lr = LinearRegression()

### Exercise:
Can you fit this parallel Dask-ML `LinearRegression()` model on the training data?

In [24]:
# %load solutions/ml-ex-1.py
lr.fit(X_train, y_train)

### Exercise:
Can you make predictions with this `LinearRegression()` model?

In [25]:
# %load solutions/ml-ex-2.py
y_pred = lr.predict(X_test)

In [26]:
lr.score(X,y)

0.9999887951615408

## Training XGBoost in Parallel

Dask-ML implements some of the most popular machine learning algorithms for parallel processing, but not all of them.

For XGBoost, the maintainers of Dask and XGBoost took a different approach: they built a Dask Backend for XGBoost so you can run XGBoost in parallel with Dask straight from your normal XGBoost library.

Running an XGBoost model with the distributed Dask backend requires minimal changes to your regular XGBoost code:

```python
import xgboost as xgb

# Create the XGBoost DMatrix for our training and testing splits
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
dtest = xgb.dask.DaskDMatrix(client, X_test, y_test)

# Set model parameters (XGBoost defaults)
params = {
    "max_depth": 6,
    "gamma": 0,
    "eta": 0.3,
    "min_child_weight": 30,
    "objective": "reg:squarederror",
    "grow_policy": "depthwise"
}

# train the model
output = xgb.dask.train(
    client, params, dtrain, num_boost_round=4,
    evals=[(dtrain, 'train')]
)

# make predictions
y_pred = xgb.dask.predict(client, output, dtest)
```

See [this step-by-step tutorial](https://coiled.io/blog/dask-xgboost-python-example/) if you're interested to learn more.

## Extra resources:

- [Dask-ML documentation](https://ml.dask.org/)
- [Getting started with Coiled](https://docs.coiled.io/user_guide/getting_started.html)

# LightGBM

In [9]:
from distributed import Client, LocalCluster

# create local Dask cluster with 8 workers (cores)
cluster = LocalCluster(n_workers=2)
client = Client()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 34597 instead
Perhaps you already have a cluster running?
Hosting the HTTP server on port 46561 instead


In [10]:
import lightgbm as lgb
dask_model = lgb.DaskLGBMClassifier(client=client)

In [11]:
from dask import array as da
import numpy as np

In [12]:
X = da.random.random((1000, 10), (500, 10))
y = da.random.random((1000,), (500,))

In [13]:
def custom_l2_obj(y_true, y_pred):
    grad = y_pred - y_true
    hess = np.ones(len(y_true))
    return grad, hess

In [14]:
dask_model = lgb.DaskLGBMRegressor(
    objective=custom_l2_obj
)

In [15]:
dask_model.fit(X, y)



Finding random open ports for workers
[LightGBM] [Info] Trying to bind port 36441...
[LightGBM] [Info] Binding port 36441 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Trying to bind port 41993...
[LightGBM] [Info] Binding port 41993 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000151 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1670
[LightGBM] [Info] Number of data points in the train set: 500, number of used features: 10
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Auto-choosing col-wise multi-threading, t

In [17]:
pred = dask_model.predict(X)

In [18]:
pred_local = pred.compute()

In [19]:
actual_local = y.compute()

In [20]:
from sklearn.metrics import mean_squared_error

In [21]:
mean_squared_error(actual_local, pred_local)

0.006862700144845762