# LightGBM + Dask

This notebook shows how to use `lightgbm.dask` to train a LightGBM model on data stored as a [Dask DataFrame](https://docs.dask.org/en/latest/dataframe.html) or [Dask Array](https://docs.dask.org/en/latest/array.html).

It uses `LocalCluster` to run on this machine only. If you want to try with a distributed cluster on AWS Fargate, see [the AWS notebook](./aws.ipynb).

In [1]:
import dask.array as da
from dask.distributed import Client, LocalCluster, wait
from lightgbm.dask import DaskLGBMRegressor
from sklearn.datasets import make_regression

## Set up a Dask cluster

Create a cluster with 3 workers. Since this is a `LocalCluster`, those workers are just 3 local processes.

In [2]:
n_workers = 3
cluster = LocalCluster(n_workers=n_workers)
client = Client(cluster)
client.wait_for_workers(n_workers)

print(f"View the dashboard: {cluster.dashboard_link}")

View the dashboard: http://127.0.0.1:8787/status


Click the link above to view a diagnostic dashboard while you run the training code below.

## Get Some Training Data

In [5]:
X, y = make_regression(n_samples=10000, random_state=42)
dX = da.from_array(X, chunks=(1000, X.shape[1]))
dy = da.from_array(y, chunks=1000)

In [None]:
num_rows = 1e5
num_features = 1e2
num_partitions = 10
rows_per_chunk = num_rows / num_partitions

data = da.random.random((num_rows, num_features), (rows_per_chunk, num_features))

labels = da.random.random((num_rows,), (rows_per_chunk,))

Right now, the Dask Arrays `data` and `labels` are lazy. Before training, you can force the cluster to compute them by running `.persist()` and then wait for that computation to finish by `wait()`-ing on them.

In [6]:
dX = dX.persist()
dy = dy.persist()
_ = wait([dX, dy])

## Train a model

With the data set up on the workers, train a model. `lightgbm.dask.DaskLGBMRegressor` has an interface that tries to stay as close as possible to the non-Dask scikit-learn interface to LightGBM (`lightgbm.sklearn.LGBMRegressor`).

In [7]:
dask_reg = DaskLGBMRegressor(
    client=client,
    silent=False,
    max_depth=5,
    random_state=708,
    objective="regression_l2",
    learning_rate=0.1,
    tree_learner="data",
    n_estimators=10,
    min_child_samples=1,
    n_jobs=-1,
)

dask_reg.fit(
    X=dX,
    y=dy,
)

DaskLGBMRegressor(client=<Client: 'tcp://127.0.0.1:39759' processes=3 threads=3, memory=2.08 GB>,
                  local_listen_port=12400,
                  machines='127.0.0.1:12400,127.0.0.1:12401,127.0.0.1:12402',
                  max_depth=5, min_child_samples=1, n_estimators=10,
                  num_machines=3, num_threads=1, objective='regression_l2',
                  random_state=708, silent=False, time_out=120,
                  tree_learner='data')

## Evaluate the model

In [None]:
preds = 

The model produced by this training run is an instance of `DaskLGBMRegressor`. To get a regular non-Dask model (which can be pickled and saved), run `.to_local()`.

In [8]:
local_model = dask_reg.to_local()
type(local_model)

lightgbm.sklearn.LGBMRegressor

You can visualize this model by looking at a data frame representation of it.

In [9]:
local_model.booster_.trees_to_dataframe()

Unnamed: 0,tree_index,node_depth,node_index,left_child,right_child,parent_index,split_feature,split_gain,threshold,decision_type,missing_direction,missing_type,value,weight,count
0,0,1,0-S0,0-S2,0-S1,,Column_52,53967200.0,-0.087022,<=,left,,-1.380680,0.0,10000
1,0,2,0-S2,0-S6,0-S4,0-S0,Column_94,21943500.0,0.014280,<=,left,,-9.071060,4793.0,4793
2,0,3,0-S6,0-S16,0-S8,0-S2,Column_23,10207000.0,-0.664254,<=,left,,-15.824600,2401.0,2401
3,0,4,0-S16,0-L0,0-S29,0-S6,Column_58,2463690.0,0.026021,<=,left,,-27.260700,589.0,589
4,0,5,0-L0,,,0-S16,,,,,,,-33.522770,304.0,304
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,9,6,9-L6,,,9-S22,,,,,,,-1.614494,121.0,121
606,9,6,9-L23,,,9-S22,,,,,,,6.909675,212.0,212
607,9,5,9-S15,9-L11,9-L16,9-S10,Column_94,1242230.0,0.788213,<=,left,,11.775900,1175.0,1175
608,9,6,9-L11,,,9-S15,,,,,,,10.042623,915.0,915
