# Topics covered
* Distributed machine learning in Dask
    * Hyperparameter serach
    * Distributed prediction
* Scaling in Dask using PBS

<img src="ml-dimensions.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />
     


 * The models are usually compute bound
 * The data is memory bound

In [155]:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from distributed import Client

from sklearn.datasets import make_circles
import numpy as np
import pandas as pd



## Hyperparameters

Every machine learning model has some values, cllared hyperparameters, that are specified before training begins. These values better help adapt the model to the data. The hyperparameters selected can significantly influency the accuracy of the model.

In [156]:
# Create a random data

X, y = make_circles(n_samples=30_000, random_state=0, noise=0.09)

pd.DataFrame({0: X[:, 0], 1: X[:, 1], "class": y}).sample(4_000).plot.scatter(
    x=0, y=1, alpha=0.2, c="class", cmap="bwr"
);

from sklearn.utils import check_random_state

rng = check_random_state(42)
random_feats = rng.uniform(-1, 1, size=(X.shape[0], 4))
X = np.hstack((X, random_feats))
X.shape



TypeError: got an unexpected keyword argument 'n_classes'

In [153]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle='False')

InvalidParameterError: The 'shuffle' parameter of train_test_split must be an instance of 'bool', an instance of 'numpy.bool_' or an instance of 'int'. Got 'False' instead.

In [149]:
# Hyper parameter 1
est = LogisticRegression(C=10, solver="sag", penalty="l2")
est.fit(X, y)
est.score(X, y)

ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.

In [136]:
# Hyper parameter 2
est = LogisticRegression(C=10, solver="lbfgs", penalty="l2")
est.fit(X, y)
est.score(X, y)

0.49993333333333334

## Local cluster
In dask you can create a local dask cluster with more than one worker. In this case we are creating a cluster witt 1 workers and the worker has 4 threads. As the cluster is local both workers will be running on the same node (machine).

In [95]:

client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 39429 instead


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://150.203.163.92:39429/status,

0,1
Dashboard: http://150.203.163.92:39429/status,Workers: 1
Total threads: 4,Total memory: 1.86 GiB
Status: running,Using processes: False

0,1
Comm: inproc://150.203.163.92/556264/33,Workers: 1
Dashboard: http://150.203.163.92:39429/status,Total threads: 4
Started: Just now,Total memory: 1.86 GiB

0,1
Comm: inproc://150.203.163.92/556264/36,Total threads: 4
Dashboard: http://150.203.163.92:38587/status,Memory: 1.86 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-li6sydp5,Local directory: /tmp/dask-worker-space/worker-li6sydp5


In [137]:
#split the data to training and testing set 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=5_000, random_state=42)

In [138]:
# some data preprocessing

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
scaler = StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [139]:
import numpy as np
from sklearn.neural_network import MLPClassifier

# create the classifier
model = MLPClassifier()

In [140]:
# List the parameters we eant to search
params = {
    "hidden_layer_sizes": [
        (24, ),
        (12, 12),
        (6, 6, 6, 6),
        (4, 4, 4, 4, 4, 4),
        (12, 6, 3, 3),
    ],
    "activation": ["relu", "logistic", "tanh"],
    "alpha": np.logspace(-6, -3, num=1000),  # cnts
    "batch_size": [16, 32, 64, 128, 256, 512],
}

In [141]:
from dask_ml.model_selection import HyperbandSearchCV

In [142]:
# For quick response
n_examples = 4 * len(X_train)
n_params = 8

# In practice, HyperbandSearchCV is most useful for longer searches
# n_examples = 15 * len(X_train)
# n_params = 15

In [143]:
max_iter = n_params  # number of times partial_fit will be called
chunks = n_examples // n_params  # number of examples each call sees

max_iter, chunks

(8, 12500)

In [144]:
import dask.array as da
X_train2 = da.from_array(X_train, chunks=chunks)
y_train2 = da.from_array(y_train, chunks=chunks)
X_train2

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,195.31 kiB
Shape,"(25000, 2)","(12500, 2)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 390.62 kiB 195.31 kiB Shape (25000, 2) (12500, 2) Dask graph 2 chunks in 1 graph layer Data type float64 numpy.ndarray",2  25000,

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,195.31 kiB
Shape,"(25000, 2)","(12500, 2)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [145]:
search = HyperbandSearchCV(
    model,
    params,
    max_iter=max_iter,
    patience=True,
)

In [146]:
search.metadata["partial_fit_calls"]

26

In [147]:
%%time
search.fit(X_train2, y_train2, classes=[0, 1, 2, 3])

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.