# LightGBM Classifier Training With And Without Dask Test

This test generates a big binarry classification data for later train a model on the generated data using LightGBM with and without Dask. Later it will verify that:
  * The accuracy was not damaged in Dask.
  * The dask training run was faster. 

## General Configurations

In [1]:
# Path to store the generated data:
DATA_PATH = "./data"

# Number of samples of generated data (number of rows in the data table):
N_SAMPLES = 2_000_000

# Number of features of the generated data (number of columns in the data table):
N_FEATURES = 40

# The percentage of data to be labeled as a testing set:
TRAIN_TEST_SPLIT = 0.33

# The amount of parquet partitions to have of the generated data:
N_PARTITIONS = 20

# The margin of accuracy error to accept:
ACC_EROR = 0.1

## 1. Generate Data:

1. Generate a binary classification data.
2. Turn the data into a `pandas.DataFrame` naming the columns `features_{i}` and adding the partioting column (year).
3. Split the data into a training and testing sets.

In [2]:
import os
import shutil

import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


def generate_data(
    output_path: str,
    n_samples: int, 
    n_features: int, 
    test_size: float, 
    n_partitions: int,
):
    # Generate data:
    x, y = make_classification(
        n_samples=n_samples, 
        n_features=n_features, 
        n_informative=int(n_features / 4) + 1, 
        n_classes=2, 
        n_redundant=int(n_features / 10) + 1,
    )
    
    # Create a dataframe:
    data = pd.DataFrame(
        data=x, 
        columns=[f"feature_{i}" for i in range(n_features)]
    )
    data["year"] = np.random.randint(2000, 2000 + n_partitions, size=n_samples)
    data["label"] = y
    
    # Split into train and test sets:
    train_set, test_set = train_test_split(data, test_size=test_size)
    
    # Save to parquets:
    train_set.to_parquet(f"{output_path}/train", partition_cols=["year"])
    test_set.to_parquet(f"{output_path}/test", partition_cols=["year"])

Generate the data (will require writing permissions to the local directory).

In [3]:
# Delete past generated data (in case there was a failure):
if os.path.exists(DATA_PATH):
    shutil.rmtree(os.path.abspath(DATA_PATH))

# Generate new data:
generate_data(
    output_path=DATA_PATH,
    n_samples=N_SAMPLES, 
    n_features=N_FEATURES, 
    test_size=TRAIN_TEST_SPLIT,
    n_partitions=N_PARTITIONS,
)

## 2. Training Code

1. Read the data into a pandas (dask) `DataFrame`.
2. Split the data into `x` and `y`.
3. Initialize a `LGBMClassifier` (`DaskLGBMClassifier`) model.
4. Run training on the training set.
5. Run evaluation on the testing set.

Accuracy score will be logged as a result.

In [4]:
# mlrun: start-code

In [5]:
import pandas as pd
import dask
import lightgbm as lgb
from sklearn.metrics import accuracy_score
import mlrun
import mlrun.frameworks.lgbm as mlrun_lgbm


@mlrun.handler(outputs=["accuracy_score"])
def train(context: mlrun.MLClientCtx, data_path: str):
    # Check for a dask client:
    dask_function = context.get_param("dask_function", None)
    dask_client = mlrun.import_function(dask_function).client if dask_function else None
    
    # Get the data:
    read_parquet_function = dask.dataframe.read_parquet if dask_client else pd.read_parquet
    train_set = read_parquet_function(f"{data_path}/train")
    if dask_client:
        train_set = dask_client.persist(train_set)
    test_set = read_parquet_function(f"{data_path}/test")
    if dask_client:
        test_set = dask_client.persist(test_set)
    
    # Split into x and y:
    y_train = train_set.label
    x_train = train_set.drop(columns=["label"])
    y_test = test_set.label
    x_test = test_set.drop(columns=["label"])
    
    # Initialize a model:
    model = lgb.DaskLGBMClassifier(client=dask_client) if dask_client else lgb.LGBMClassifier()
    
    # Train:
    model.fit(x_train, y_train)
    
    # Predict:
    y_pred = model.predict(x_test)
    
    # Evaluate:
    if dask_client:
        y_test = y_test.compute()
        y_pred = y_pred.compute()
    return accuracy_score(y_pred, y_test)

In [6]:
# mlrun: end-code

## 3. Create a Project

1. Create the MLRun project.
2. Use MLRun's `code_to_function` to create an MLRun function of the training code.

In [7]:
import os
import shutil
import time

import mlrun

In [8]:
# Create the project:
project = mlrun.get_or_create_project(
    name="dask-lightgbm-test", 
    context="./", 
    user_project=True
)

> 2022-12-08 13:12:56,988 [info] loaded project dask-lightgbm-test from MLRun DB


In [9]:
# Create the training function:
train_function = mlrun.code_to_function(
    name="train",
    kind="job",
    image="mlrun/ml-models",
    handler="train",
)
train_function.apply(mlrun.auto_mount())

# Assign the function to the project:
project.set_function(train_function)

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f6f4fd28f50>

## 4. Run Without Dask

Run the training without Dask while timing it and storing the accuracy score.

In [10]:
without_dask_time = time.time()
without_dask_run = train_function.run(
    name="without_dask",
    params={
        "data_path": os.path.abspath(DATA_PATH),
    },
)
without_dask_time = time.time() - without_dask_time
without_dask_score = without_dask_run.status.results['accuracy_score']

> 2022-12-08 13:13:03,764 [info] starting run without_dask uid=2f69739abc4f4a2585f041e2c60559fb DB=http://mlrun-api:8080
> 2022-12-08 13:13:03,942 [info] Job is running in the background, pod: without-dask-xgph2
> 2022-12-08 13:15:50,765 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
dask-lightgbm-test-guyl,...c60559fb,0,Dec 08 13:13:10,completed,without_dask,v3io_user=guylkind=jobowner=guylmlrun/client_version=1.2.0host=without-dask-xgph2,,data_path=/User/dask/datadask_function=None,accuracy_score=0.9535681818181818,





> 2022-12-08 13:15:54,693 [info] run executed, status=completed


## 5. Run With Dask

1. Create the Dask function.
2. Configure it.
3. Run the training with Dask while timing it and storing the accuracy score.

In [11]:
# Create the dask function:
dask_function = mlrun.new_function(name="my_dask", kind="dask", image="mlrun/ml-models")

# Configure the dask function specs:
dask_function.spec.remote = True
dask_function.spec.replicas = 5
dask_function.spec.service_type = 'NodePort'
dask_function.with_limits(mem="6G")
dask_function.spec.nthreads = 5
dask_function.apply(mlrun.auto_mount())

# Assign the function to the project:
project.set_function(dask_function)

# Save:
dask_function.save()

'db://dask-lightgbm-test-guyl/my_dask'

In [12]:
dask_function.client

> 2022-12-08 13:16:05,945 [info] trying dask client at: tcp://mlrun-my-dask-9c07be63-2.default-tenant:8786
> 2022-12-08 13:16:05,982 [info] using remote dask scheduler (mlrun-my-dask-9c07be63-2) at: tcp://mlrun-my-dask-9c07be63-2.default-tenant:8786


Mismatched versions found

+-------------+--------+-----------+---------+
| Package     | client | scheduler | workers |
+-------------+--------+-----------+---------+
| blosc       | 1.7.0  | 1.10.6    | None    |
| cloudpickle | 2.0.0  | 2.2.0     | None    |
| lz4         | 3.1.0  | 3.1.10    | None    |
| msgpack     | 1.0.3  | 1.0.4     | None    |
| toolz       | 0.11.2 | 0.12.0    | None    |
| tornado     | 6.1    | 6.2       | None    |
+-------------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Connection method: Direct,
Dashboard: http://mlrun-my-dask-9c07be63-2.default-tenant:8787/status,

0,1
Comm: tcp://10.200.81.217:8786,Workers: 0
Dashboard: http://10.200.81.217:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B


In [13]:
with_dask_time = time.time()
with_dask_run = train_function.run(
    name="without_dask",
    params={
        "data_path": os.path.abspath(DATA_PATH),
        "dask_function": "db://" + dask_function.uri,
    },
)
with_dask_time = time.time() - with_dask_time
with_dask_score = with_dask_run.status.results['accuracy_score']

> 2022-12-08 13:16:06,080 [info] starting run without_dask uid=94b09f30ae274ab5bb87daff273664cf DB=http://mlrun-api:8080
> 2022-12-08 13:16:06,319 [info] Job is running in the background, pod: without-dask-j5lcf
> 2022-12-08 13:16:17,011 [info] trying dask client at: tcp://mlrun-my-dask-9c07be63-2.default-tenant:8786
> 2022-12-08 13:16:17,037 [info] using remote dask scheduler (mlrun-my-dask-9c07be63-2) at: tcp://mlrun-my-dask-9c07be63-2.default-tenant:8786
remote dashboard: default-tenant.app.dev6.lab.iguazeng.com:30198
Finding random open ports for workers
> 2022-12-08 13:17:00,996 [info] run executed, status=completed
Parameter n_jobs will be ignored.
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
dask-lightgbm-test-guyl,...273664cf,0,Dec 08 13:16:14,completed,without_dask,v3io_user=guylkind=jobowner=guylmlrun/client_version=1.2.0host=without-dask-j5lcf,,data_path=/User/dask/datadask_function=db://dask-lightgbm-test-guyl/my_dask,accuracy_score=0.9530681818181819,





> 2022-12-08 13:17:05,262 [info] run executed, status=completed


## 6. Compare Runtimes

1. Print a summary message.
2. Verify that the dask run took less time and yielded an accuracy score that is almost equal or better than the no dask run.

In [14]:
# Delete the generated data:
shutil.rmtree(os.path.abspath(DATA_PATH))

In [15]:
# Print the test's collected results:
print(
    f"Without dask:\n" 
    f"\t{'%.2f' % without_dask_time} Seconds\n"
    f"\tAccuracy: {without_dask_score}"
)
print(
    f"With dask:\n"
    f"\t{'%.2f' % with_dask_time} Seconds\n"
    f"\tAccuracy: {with_dask_score}\n"
)

# Verification:
assert with_dask_time < without_dask_time
assert (
    with_dask_score > without_dask_score or 
    without_dask_score - with_dask_score < ACC_EROR
)

# Summary message:
print(f"Overall x{'%.2f' % (without_dask_time / with_dask_time)} faster!")

Without dask:
	170.97 Seconds
	Accuracy: 0.9535681818181818
With dask:
	59.22 Seconds
	Accuracy: 0.9530681818181819

Overall x2.89 faster!
