# Generic Scikit-Learn Classifier With Dask

Run any scikit-learn compatible classifier or list of classifiers

### steps

1. **Generate a scikit-learn model configuration** using the `model_pkg_class` parameter
   * input a package and class name, for example, `sklearn.linear_model.LogisticRegression`  
   * mlrun will find the class and instantiate a copy using default parameters  
   * You can modify both the model class and the fit methods
2. **Get a sample of data** from a data source
   * select a random sample of rows using a negative integer
   * select consecutive rows using a positive integer
3. **Split the data** into train, validation, and test sets 
   * the test set is saved as an artifact and never seen again until testing
4. **Train the model** 
5. **pickle / serialize the model**
   * models can be pickled or saved as json
6. **Evaluate the model**
   * a custom evaluator can be provided, see function doc for details


Train a sklearn classifier with Dask
    
    :param context:                 Function context.
    :param dataset:                 Raw data file.
    :param model_pkg_class:         Model to train, e.g, "sklearn.ensemble.RandomForestClassifier", 
                                    or json model config.
    :param label_column:            (label) Ground-truth y labels.
    :param train_validation_size:   (0.75) Train validation set proportion out of the full dataset.
    :param sample:                  (1.0) Select sample from dataset (n-rows/% of total), randomzie rows as default.
    :param models_dest:             (models) Models subfolder on artifact path.
    :param test_set_key:            (test_set) Mlrun db key of held out data in artifact store.
    :param plots_dest:              (plots) Plot subfolder on artifact path.
    :param dask_key:                (dask key) Key of dataframe in dask client "datasets" attribute.
    :param dask_persist:            (False) Should the data be persisted (through the `client.persist`)
    :param scheduler_key:           (scheduler) Dask scheduler configuration, json also logged as an artifact.
    :param file_ext:                (parquet) format for test_set_key hold out data
    :param random_state:            (42) sklearn seed



### TODO

1. Add cross validation methods
2. Improve dask efficiency by calling dask data frame (not from pandas)
3. Log dataset artifact as dask data frame 
4. Add values imputer (instead of drop na)

### Save and Config

In [1]:
import mlrun
import yaml

with open("item.yaml") as item_file:
    items = yaml.load(item_file, Loader=yaml.FullLoader)

_, artifact_path = mlrun.set_environment(artifact_path="./")

# create job function object from notebook code
fn = mlrun.code_to_function(
    items["name"],
    kind=items["spec"]["kind"],
    handler=items["spec"]["handler"],
    filename=items["spec"]["filename"],
    image=items["spec"]["image"],
    description=items["description"],
    categories=items["categories"],
    labels=items["labels"],
)

# add metadata (for templates and reuse)
fn.export("sklearn_classifier_dask.yaml")

> 2021-02-18 10:03:18,942 [info] function spec saved to path: sklearn_classifier_dask.yaml


<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f05ffaeaa90>

### Init Dask

#### init a dask cluster and set dask specs

In [2]:
fn.apply(mlrun.platforms.auto_mount())
DATA_URL = "/User/iris.csv"

In [3]:
!curl -L "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv" > {DATA_URL}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2776  100  2776    0     0  16426      0 --:--:-- --:--:-- --:--:-- 16426


In [4]:
# create a dask test cluster (dask function)
dask_cluster = mlrun.new_function("dask_tests", kind="dask", image="mlrun/ml-models")
dask_cluster.apply(mlrun.mount_v3io())
dask_cluster.spec.remote = True
dask_cluster.with_requests(mem="8G")
dask_cluster.save()

> 2021-02-18 10:03:19,788 [info] using in-cluster config.


'1907624df6a69852a02e55316d74c4050305b29e'

#### init dask client 
copy the scheduler address to **DASK_CLIENT** param in the following cell, this will make the function use the dask cluster.

In [5]:
dask_cluster.client

> 2021-02-18 10:03:25,327 [info] to get a dashboard link, use NodePort service_type
> 2021-02-18 10:03:25,328 [info] trying dask client at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786
> 2021-02-18 10:03:25,360 [info] using remote dask scheduler (mlrun-dask-tests-b9d87c48-f) at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786



+-------------+--------+-----------+---------+
| Package     | client | scheduler | workers |
+-------------+--------+-----------+---------+
| blosc       | 1.7.0  | 1.10.2    | None    |
| distributed | 2.30.0 | 2.30.1    | None    |
| lz4         | 3.1.0  | 3.1.3     | None    |
| msgpack     | 1.0.0  | 1.0.2     | None    |
| toolz       | 0.11.1 | 0.10.0    | None    |
| tornado     | 6.0.4  | 6.0.3     | None    |
+-------------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Client  Scheduler: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786  Dashboard: http://mlrun-dask-tests-b9d87c48-f.default-tenant:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


### Set Parameters

In [6]:
task_params = {
    "params": {
        "sample": 1,
        "train_val_split": 0.75,
        "random_state": 42,
        "n_jobs": -1,
        "plots_dest": "plots-p",
        "models_dest": "sklearn-clfmodel",
    }
}


models = [
    "sklearn.ensemble.RandomForestClassifier",
    "sklearn.ensemble.AdaBoostClassifier",
    "sklearn.linear_model.LogisticRegression",
]

### Test and Run

In [7]:
import os
from sklearn_classifier_dask import train_model

outputs = []
for model in models:
    task_copy = task_params.copy()
    task_copy.update(
        {
            "params": {
                "model_pkg_class": model,
                "label_column": "label",
                "dask_function": "db://default/dask_tests",
            }
        }
    )

    # customize specific model parameters
    if "RandomForestClassifier" in model:
        task_copy["params"].update({"CLASS_max_depth": 5})

    if "LogisticRegression" in model:
        task_copy["params"].update({"CLASS_solver": "liblinear"})

    if "AdaBoostClassifier" in model:
        task_copy["params"].update(
            {"CLASS_n_estimators": 200, "CLASS_learning_rate": 0.01}
        )

    name = model.replace(".", "_")
    output = fn.run(
        mlrun.NewTask(**task_copy),
        handler=train_model,
        name=name,
        inputs={"dataset": DATA_URL},
        artifact_path=os.path.join(artifact_path, model),
        local=False,
    )

    outputs.append({name: output.outputs})

> 2021-02-18 10:03:26,304 [info] starting run sklearn_ensemble_RandomForestClassifier uid=922ffa8a474643b1b914e632663b3913 DB=http://mlrun-api:8080
> 2021-02-18 10:03:26,455 [info] Job is running in the background, pod: sklearn-ensemble-randomforestclassifier-mx95q
> 2021-02-18 10:03:31,828 [info] using in-cluster config.
> 2021-02-18 10:03:31,828 [info] to get a dashboard link, use NodePort service_type
> 2021-02-18 10:03:31,829 [info] trying dask client at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786
> 2021-02-18 10:03:31,851 [info] using remote dask scheduler (mlrun-dask-tests-b9d87c48-f) at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786
> 2021-02-18 10:03:31,851 [info] Read Data
> 2021-02-18 10:03:31,869 [info] Prep Data
> 2021-02-18 10:03:36,231 [info] Split and Train
> 2021-02-18 10:03:37,482 [info] Evaluate
> 2021-02-18 10:03:39,175 [info] Log artifacts
> 2021-02-18 10:03:39,518 [info] Done!
> 2021-02-18 10:03:39,553 [info] run executed, status=completed
final 

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...663b3913,0,Feb 18 10:03:31,completed,sklearn_ensemble_RandomForestClassifier,v3io_user=eyalskind=jobowner=eyalshost=sklearn-ensemble-randomforestclassifier-mx95qclass=sklearn.ensemble.RandomForestClassifier,dataset,model_pkg_class=sklearn.ensemble.RandomForestClassifierlabel_column=labeldask_function=db://default/dask_testsCLASS_max_depth=5,micro=0.9989612188365651macro=0.998483560090703precision-2=1.0precision-0=1.0precision-1=0.9090909090909091recall-2=1.0recall-0=0.9285714285714286recall-1=1.0f1-2=1.0f1-0=0.962962962962963f1-1=0.9523809523809523,ROCAUCClassificationReportConfusionMatrixFeatureImportancesmodelstandard_scalerlabel_encodertest_set


to track results use .show() or .logs() or in CLI: 
!mlrun get run 922ffa8a474643b1b914e632663b3913 --project default , !mlrun logs 922ffa8a474643b1b914e632663b3913 --project default
> 2021-02-18 10:03:45,658 [info] run executed, status=completed
> 2021-02-18 10:03:45,658 [info] starting run sklearn_ensemble_AdaBoostClassifier uid=62fd15e0a45e49a7a6e8c266c66d4e5e DB=http://mlrun-api:8080
> 2021-02-18 10:03:45,847 [info] Job is running in the background, pod: sklearn-ensemble-adaboostclassifier-7fnf6
> 2021-02-18 10:03:50,828 [info] using in-cluster config.
> 2021-02-18 10:03:50,829 [info] to get a dashboard link, use NodePort service_type
> 2021-02-18 10:03:50,829 [info] trying dask client at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786
> 2021-02-18 10:03:50,851 [info] using remote dask scheduler (mlrun-dask-tests-b9d87c48-f) at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786
> 2021-02-18 10:03:50,851 [info] Read Data
> 2021-02-18 10:03:50,869 [info] Prep Data
> 2021-0

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...c66d4e5e,0,Feb 18 10:03:50,completed,sklearn_ensemble_AdaBoostClassifier,v3io_user=eyalskind=jobowner=eyalshost=sklearn-ensemble-adaboostclassifier-7fnf6class=sklearn.ensemble.AdaBoostClassifier,dataset,model_pkg_class=sklearn.ensemble.AdaBoostClassifierlabel_column=labeldask_function=db://default/dask_testsCLASS_n_estimators=200CLASS_learning_rate=0.01,micro=0.9065096952908587macro=0.930850151695027precision-1=1.0precision-2=0.6666666666666666precision-0=0.8recall-1=1.0recall-2=0.6recall-0=0.8421052631578947f1-1=1.0f1-2=0.631578947368421f1-0=0.8205128205128205,ROCAUCClassificationReportConfusionMatrixFeatureImportancesmodelstandard_scalerlabel_encodertest_set


to track results use .show() or .logs() or in CLI: 
!mlrun get run 62fd15e0a45e49a7a6e8c266c66d4e5e --project default , !mlrun logs 62fd15e0a45e49a7a6e8c266c66d4e5e --project default
> 2021-02-18 10:04:05,056 [info] run executed, status=completed
> 2021-02-18 10:04:05,057 [info] starting run sklearn_linear_model_LogisticRegression uid=cea75b1f16f449938c590fb044024b32 DB=http://mlrun-api:8080
> 2021-02-18 10:04:05,205 [info] Job is running in the background, pod: sklearn-linear-model-logisticregression-8wr6m
> 2021-02-18 10:04:10,303 [info] using in-cluster config.
> 2021-02-18 10:04:10,303 [info] to get a dashboard link, use NodePort service_type
> 2021-02-18 10:04:10,303 [info] trying dask client at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786
> 2021-02-18 10:04:10,328 [info] using remote dask scheduler (mlrun-dask-tests-b9d87c48-f) at: tcp://mlrun-dask-tests-b9d87c48-f.default-tenant:8786
> 2021-02-18 10:04:10,328 [info] Read Data
> 2021-02-18 10:04:10,348 [info] Prep Data


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...44024b32,0,Feb 18 10:04:10,completed,sklearn_linear_model_LogisticRegression,v3io_user=eyalskind=jobowner=eyalshost=sklearn-linear-model-logisticregression-8wr6mclass=sklearn.linear_model.LogisticRegression,dataset,model_pkg_class=sklearn.linear_model.LogisticRegressionlabel_column=labeldask_function=db://default/dask_testsCLASS_solver=liblinear,micro=0.9387119113573407macro=0.9337594951121522precision-1=1.0precision-2=0.7692307692307693precision-0=0.6428571428571429recall-1=0.9166666666666666recall-2=0.6666666666666666recall-0=0.8181818181818182f1-1=0.9565217391304348f1-2=0.7142857142857142f1-0=0.7200000000000001,ROCAUCClassificationReportConfusionMatrixFeatureImportancesmodelstandard_scalerlabel_encodertest_set


to track results use .show() or .logs() or in CLI: 
!mlrun get run cea75b1f16f449938c590fb044024b32 --project default , !mlrun logs cea75b1f16f449938c590fb044024b32 --project default
> 2021-02-18 10:04:24,443 [info] run executed, status=completed
