# LightGBM + Dask

<table>
    <tr>
        <td>
            <img src="./_img/lightgbm.svg" width="300">
        </td>
        <td>
            <img src="./_img/dask-horizontal.svg" width="300">
        </td>
    </tr>
</table>

This notebook shows how to use `lightgbm.dask` to train a LightGBM model on data stored as a [Dask Array](https://docs.dask.org/en/latest/array.html).

To explore other topics in greater depth, see the other notebooks.

<hr>

## Set up a local Dask cluster

Create a cluster with 3 workers. Since this is a `LocalCluster`, those workers are just 3 local processes.

In [None]:
from dask.distributed import Client, LocalCluster

n_workers = 3
cluster = LocalCluster(n_workers=n_workers)

client = Client(cluster)
client.wait_for_workers(n_workers)

print(f"View the dashboard: {cluster.dashboard_link}")

Click the link above to view a diagnostic dashboard while you run the training code below.

<hr>

## Get some training data

This example uses `sklearn.datasets.make_regression()` to generate a dataset in `numpy` format, then uses `dask.Array.from_array()` to turn that into a Dask Array.

That's just done for convenience. `lightgbm.dask` just expects that your data are Dask Arrays or Dask DataFrames.

In [None]:
import dask.array as da
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10000, random_state=42)
dX = da.from_array(X, chunks=(1000, X.shape[1]))
dy = da.from_array(y, chunks=1000)

Right now, the Dask Arrays `data` and `labels` are lazy. Before training, you can force the cluster to compute them by running `.persist()` and then wait for that computation to finish by `wait()`-ing on them.

Doing this is optional, but it will make data loading a one-time cost so subsequent runs are fast.

In [None]:
from dask.distributed import wait

dX = dX.persist()
dy = dy.persist()
_ = wait([dX, dy])

<hr>

## Train a model

With the data set up on the workers, train a model. `lightgbm.dask.DaskLGBMRegressor` has an interface that tries to stay as close as possible to the non-Dask scikit-learn interface to LightGBM (`lightgbm.sklearn.LGBMRegressor`).

In [None]:
from lightgbm.dask import DaskLGBMRegressor

dask_reg = DaskLGBMRegressor(
    client=client,
    max_depth=5,
    objective="regression_l1",
    learning_rate=0.1,
    tree_learner="data",
    n_estimators=100,
    min_child_samples=1,
)

dask_reg.fit(
    X=dX,
    y=dy,
)

<hr>

## Evaluate the model

The `.predict()` method takes in a Dask collection and returns a Dask Array.

In [None]:
preds = dask_reg.predict(dX)
print(str(preds))

Before calculating the mean absolute error (MAE) of these predictions, compute some summary statistics on the target variable. This is necessary to understand what "good" values of MAE look like.

In [None]:
p = [0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99]
dy_percentiles = da.percentile(dy, p).compute()

for i, percentile in enumerate(p):
    print(f"{percentile * 100}%: {round(dy_percentiles[i], 2)}")

The metrics functions from `dask-ml` match those from `scikit-learn`, but take in and return Dask collections. You can use these functions to perform model evaluation without the evaluation data or predictions needing to be pulled down to the machine running this notebook. Pretty cool, right?

In [None]:
from dask_ml.metrics.regression import mean_absolute_error

mean_absolute_error(preds, dy)

<hr>

## Next Steps

Learn more: https://lightgbm.readthedocs.io/en/latest/Python-API.html#dask-api.

Ask a question, report a bug, or submit a feature request: https://github.com/microsoft/LightGBM/issues.

Contribute: https://github.com/microsoft/LightGBM/issues?q=is%3Aissue+is%3Aopen+label%3Adask.