Dask's machine learning library, called **Dask-ml**, needs to be installed separately from the Dask. So, before moving forward, we need to install it from the command line as follows:

```bash
pip install dask-ml
```

Or you can run the same command inside jupyter notebook as follows:

```python
!pip install dask-ml
```

Let's start with how we can make use of the parallelization when training Scikit-learn models with Dask.

# Parallelism in scikit-learn

Scikit-learn library already offers a parallelization capability for some of its models. In particular, if you see `n_jobs` parameter in the documentation of a model, then it implies that you can set that parameter and scikit-learn dispatches the job to the cores of your computer. Scikit-learn does this using a library called [joblib](https://joblib.readthedocs.io/en/latest/).

However, this parallelization in Scikit-learn is restricted to a single machine. Dask extends this and brings parallelization over a cluster. As we'll see in this checkpoint, if we use Dask as the backend to the joblib library, we can run Scikit-learn models over many machines in parallel.

However as we said before, some models in Scikit-learn can't be trained on very large training data that doesn't fit into memory. For those situations, we'll talk about Dask's own machine learning module later in this checkpoint.

# Using Scikit-learn with Dask

We'll start our discussion with how we can use Dask to enhance the parallelization capabilities of Scikit-learn. In doing this, we'll be using the [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud) dataset from Kaggle as we did in the previous checkpoints. However, this time we'll build several machine learning models using Scikit-learn and Dask together. We'll see that using Dask, we can parallelize some of scikit-learn's machine learning models into several cores (or even machines on a cluster) when training. 

As we did in the previous checkpoints, we suggest starting a Dask client before running your code. By doing this, you'll be setting some configurations like the number of workers as well as you'll be able to monitor the execution of your codes.

In our example, we set the number of workers to be four, number of threads per worker to be two and the memory limit for each worker to be 2GB.

In [2]:
import warnings
warnings.filterwarnings("ignore")

from dask.distributed import Client, progress

client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:50743  Dashboard: http://127.0.0.1:50744/status,Cluster  Workers: 4  Cores: 8  Memory: 8.00 GB


To be able to load the dataset, we use Dask datarames. In the following cell, we load the dataset using the `.read_csv()` function of the Dask dataframe:

In [3]:
# Dataframes implement the Pandas API
import dask.dataframe as dd

# This loads the data into Dask dataframe
df = dd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/creditcard.csv', dtype={'Time': 'float64'})

Let's check what we have:

In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Distributing the training

As we mentioned before, Scikit-learn can distribute the training of the models using [joblib](http://joblib.readthedocs.io/) library. However, this parallelization only supports a single machine. That is to say, we can only use the CPU cores of the machine that we run our models on. Dask can extend Scikit-learn's single machine parallelism to multiple machines. 

In demonstrating how we can parallelize the training of the Scikit-learn models using Dask, we train several classifiers on [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud) dataset from Kaggle. Even more, we use four fold cross-validation to evaluate the performances of the models and then we do grid search cross-validation to tune the hyperparameters of the models. 

As stated in the official documentation of Dask, using Scikit-learn models with Dask support:

> "...is most useful for training large models on medium sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the Scikit-learn estimators may not be able to cope."

We start with importing the libraries we'll use:

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.metrics import roc_auc_score
import joblib
from dask_ml.model_selection import train_test_split
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

Next, we select the target variable and the features. We use the `Class` column as our target variable. It takes value 1 when the transaction is a fraud and 0 elsewhere. Our aim is to predict whether a given transaction is fraud or not. This problem is a classification problem and that is why we use classifiers.

Note that the variables in this dataset except `Time`, `Class` and `Amount` are actually the principal components of some real variables that aren't exposed to us due to the confidentiality and privacy concerns. In the following models, we use the first three principal components as our feature set. This is for the sake of demonstration and speed. In the assignments, you'll be working with all of the variables.

In the following cell, we also use a familiar function from Sickit-learn: `train_test_split()`. However we use it from Dask-ml's `model_selection` module. Similar to what Scikit-learn's method does, this method also randomly divide a dataset into train and test sets. However, Dask's `train_test_split()` can work in parallel!

In [6]:
# This is our feature set
X = df[["V1", "V2", "V3", "Amount"]]

# This is our target variable
Y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# Since our data can fit into memory
# we persist them to the RAM.
X_train.persist()
X_test.persist()
y_train.persist()
y_test.persist()

Dask Series Structure:
npartitions=3
    int64
      ...
      ...
      ...
Name: Class, dtype: int64
Dask Name: split, 3 tasks

When we use Dask as the backend of the joblib library, we need to put our machine learning code inside the context manager of joblib. That is, we need to put the code we want to parallelize inside the following with statement:

```python
with joblib.parallel_backend('dask'):
    Scikit-learn code here
```

For example, in the cell below, we train a random forest classifier using Dask as the parallelization library. In the following, we distribute the task across the cores of our computer. However, this code can be parallelized over multiple machines.

Note that, we do four fold cross-validation in this code. So, we actually train four models and evaluate them on a different hold-out group. This alone is something that lays itself parallelization quite well. That being said, Dask can also parallelize the training of a single random forest classifier.

In [7]:
rf_model = RandomForestClassifier()

with joblib.parallel_backend('dask'):
    scores = cross_validate(rf_model, X_train.compute(), y_train.compute(), cv=4)
    
scores

{'fit_time': array([7.6613791 , 6.18501663, 6.21940088, 7.33158302]),
 'score_time': array([0.11823916, 0.13610721, 0.13948917, 0.12369299]),
 'test_score': array([0.99766515, 0.99840247, 0.99859555, 0.99847267])}

Next, we move one step further and create a [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object to tune the `max_depth` hyperparameter of the random forest classifier:

In [8]:
# Random forest classifier
rf_params = {"max_depth": [2, 4, 8, 16]}

rf_model = RandomForestClassifier()

grid_search_rf = GridSearchCV(rf_model,
                           param_grid=rf_params,
                           return_train_score=True,
                           iid=True,
                           cv=4,
                           n_jobs=-1, 
                           scoring='roc_auc')

As you can recall, when we do grid search with scikit-learn, we fit the model on the data as follows:

```python
grid_search.fit(X, y)
```

However, this time we want to distribute the training and cross validation over a cluster with Dask. In order to do this, we need to use the context manager provided by joblib library using the `with` statement of Python.

Now, let's run the grid search over the hyperparameters of the random forest model:

In [16]:
with joblib.parallel_backend('dask'):
    grid_search_rf.fit(X_train.compute(), y_train.compute())

We trained 16 different combinations of the model with four different values of the `max_depth` parameter and with four fold cross validation. Now, let's find out the best value for the `max_depth` parameter and get the AUC score on the test set:

In [19]:
print("The best value is: ", grid_search_rf.best_params_)
print("The test AUC score is: ", grid_search_rf.score(X_test.compute(), y_test.compute()))

The best value is:  {'max_depth': 8}
The test AUC score is:  0.9535513483491163


For more on training scikit-learn models with distributed joblib, see the [dask-ml documentation](http://dask-ml.readthedocs.io/en/latest/joblib.html).

## Training on Large Datasets

So far, we only used our good old Scikit-learn library using Dask for the parallelization. However, most of the models in Scikit-learn are designed to work only on the data in the memory. Hence, training with large datasets that exceed the available memory isn't feasible with the approach we discussed so far.

Here, we show that [Dask-ml](https://ml.dask.org/preprocessing.html#) library provides models that can be trained on very large datasets. All of the algorithms implemented in Dask-ml work well on larger than memory datasets. That being said, the model sets that are implemented in Dask-ml isn't very large. Hence, we can only find a restricted set of models in this library. To demonstrate the usage of Dask-ml, we'll use the same dataset above. But, you can try the following with a larger dataset that doesn't fit into the memory of your computer.

The model we'll use is the logistic regression from Dask-ml's `linear_model` module. We start by importing the estimator:

In [21]:
from dask_ml.linear_model import LogisticRegression

Next, we create the logistic regression model and fit it to the training data. Note that when calling the `.fit()` function, we converted Dask dataframes X_train and y_train to Dask arrays by calling their `.values` attributes. **This is because, right now Dask-ml's glm based estimators just work with Dask arrays**.

In [22]:
lr = LogisticRegression()
lr.fit(X_train.values.compute(), y_train.values.compute())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1.0, max_iter=100, multi_class='ovr',
                   n_jobs=1, penalty='l2', random_state=None, solver='admm',
                   solver_kwargs=None, tol=0.0001, verbose=0, warm_start=False)

Now, let's get the training and test scores:

In [23]:
preds_train = lr.predict(X_train.values.compute())
preds_test = lr.predict(X_test.values.compute())

print("Training score is: ", roc_auc_score(preds_train, y_train.values.compute()))
print("Test score is: ", roc_auc_score(preds_test, y_test.values.compute()))

Training score is:  0.8398388183067783
Test score is:  0.8069194422938428


And that's all, we can close the client connection. For more information, you can read the official documentation of Dask-ml's linear regression [here](https://ml.dask.org/modules/generated/dask_ml.linear_model.LinearRegression.html#dask_ml.linear_model.LinearRegression).

In [24]:
client.close()