# LightGBM + Dask

<table>
    <tr>
        <td>
            <img src="./_img/lightgbm.svg" width="300">
        </td>
        <td>
            <img src="./_img/dask-horizontal.svg" width="300">
        </td>
    </tr>
</table>

This notebook shows how to use `lightgbm.dask` to train a LightGBM model on data stored as a [Dask DataFrame](https://docs.dask.org/en/latest/dataframe.html) or [Dask Array](https://docs.dask.org/en/latest/array.html).

It uses `LocalCluster` to run on this machine only. If you want to try with a distributed cluster on AWS Fargate, see [the AWS notebook](./aws.ipynb).

In [7]:
import dask.array as da
import os

from dask.distributed import Client, LocalCluster, wait

from lightgbm.dask import DaskLGBMRegressor

Create a cluster with 3 workers. Since this is a `LocalCluster`, those workers are just 3 local processes.

In [8]:
n_workers = 3
cluster = LocalCluster()
client = Client(cluster)
client.wait_for_workers(n_workers)

print(f"View the dashboard: {cluster.dashboard_link}")

Perhaps you already have a cluster running?
Hosting the HTTP server on port 36503 instead


View the dashboard: http://127.0.0.1:36503/status


Click the link above to view a diagnostic dashboard while you run the training code below.

In [9]:
num_rows = 1e6
num_features = 1e2
num_partitions = 10
rows_per_chunk = num_rows / num_partitions

data = da.random.random(
    (num_rows, num_features),
    (rows_per_chunk, num_features)
)

labels = da.random.random(
    (num_rows, 1),
    (rows_per_chunk, 1)
)

Right now, the Dask Arrays `data` and `labels` are lazy. Before training, you can force the cluster to compute them by running `.persist()` and then wait for that computation to finish by `wait()`-ing on them.

In [10]:
data = data.persist()
labels = labels.persist()
_ = wait(data)
_ = wait(labels)

With the data set up on the workers, train a model. `lightgbm.dask.DaskLGBMRegressor` has an interface that tries to stay as close as possible to the non-Dask scikit-learn interface to LightGBM (`lightgbm.sklearn.LGBMRegressor`).

In [11]:
dask_reg = DaskLGBMRegressor(
    silent=False,
    max_depth=5,
    random_state=708,
    objective="regression_l2",
    learning_rate=0.1,
    tree_learner="data",
    n_estimators=10,
    min_child_samples=1,
    n_jobs=-1
)

dask_reg.fit(
    client=client,
    X=data,
    y=labels,
)

DaskLGBMRegressor(local_listen_port=12400,
                  machines='127.0.0.1:12400,127.0.0.1:12401,127.0.0.1:12402,127.0.0.1:12403',
                  max_depth=5, min_child_samples=1, n_estimators=10,
                  num_machines=4, num_threads=2, objective='regression_l2',
                  random_state=708, silent=False, time_out=120,
                  tree_learner='data')

The model produced by this training run is an instance of `DaskLGBMRegressor`. To get a regular non-Dask model (which can be pickled and saved), run `.to_local()`.

In [12]:
local_model = dask_reg.to_local()
type(local_model)

lightgbm.sklearn.LGBMRegressor

You can visualize this model by looking at a data frame representation of it.

In [17]:
local_model.booster_.trees_to_dataframe()

Unnamed: 0,tree_index,node_depth,node_index,left_child,right_child,parent_index,split_feature,split_gain,threshold,decision_type,missing_direction,missing_type,value,weight,count
0,0,1,0-S0,0-S1,0-S2,,Column_48,1.044720,0.587599,<=,left,,0.500732,0.0,1000000
1,0,2,0-S1,0-S6,0-S16,0-S0,Column_29,1.207930,0.621069,<=,left,,0.500804,587338.0,587338
2,0,3,0-S6,0-S7,0-S9,0-S1,Column_92,1.062580,0.382854,<=,left,,0.500692,365018.0,365018
3,0,4,0-S7,0-S8,0-S20,0-S6,Column_49,1.636070,0.819036,<=,left,,0.500909,139139.0,139139
4,0,5,0-S8,0-L0,0-L9,0-S7,Column_73,1.245770,0.217072,<=,left,,0.501070,114127.0,114127
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,9,6,9-L10,,,9-S15,,,,,,,-0.000333,17658.0,17658
604,9,6,9-L16,,,9-S15,,,,,,,0.001127,4387.0,4387
605,9,5,9-S13,9-L13,9-L14,9-S12,Column_51,0.853396,0.055757,<=,left,,-0.005640,278.0,278
606,9,6,9-L13,,,9-S13,,,,,,,-0.028840,15.0,15
