# MLFlow Experiment

How to efficiently run multiple ML experiments using `dutil` and `mlflow`:
- make explicit dependencies between the tasks in the pipeline
- record and visualize metrics from multiple runs (MLFlow)
- cache outputs from all pipeline steps (data and models) on disk (dutil.pipeline)
- run the pipeline with different parameters (papermill)

About:
- see the experiment pipeline: `mlflow_experiment.py`
- show a metrics summary via MLFlow: `mlflow ui` (in the shell)
- run notebooks with different parameters via Papermill: `mlflow_experiment_papermill.ipynb`

Limitations:
- currently, `dutil.pipeline` only supports "threads" Dask scheduler

## Setup

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_absolute_error, make_scorer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
import dutil.pipeline as dpipe
from loguru import logger
from pprint import pprint

import mlflow_experiment as experiment

## Experiment

The pipeline is constructed in `mlflow_experiment.py`

In [2]:
# --- Global Notebook Parameters ---
fversion = 0
mversion = 0
include_country = True
adjust_for_country = True
target = 'e'
test_ratio = 0.3
n_folds = 2
_n_jobs = 1

In [3]:
# Parameters
include_c = False
adjust_for_country = True


In [4]:
experiment.params.update_many(dict(
    fversion=fversion,
    mversion=mversion,
    include_country=include_country,
    adjust_for_country=adjust_for_country,
    target=target,
    test_ratio=test_ratio,
    n_folds=n_folds,
    _n_jobs=_n_jobs,
))

### Linear Regression

In [5]:
with experiment.params.context(dict(
    model_name='lr',
    _model=Pipeline((
        ('t', SimpleImputer(fill_value=0)),
        ('e', LinearRegression()),
    )),
)):
    model, results = experiment.run_experiment()

2020-12-18 16:05:12.519 | INFO     | dutil.pipeline._cached:new_foo:314 - Task load_data_3.pickle: skip (cache exists)


2020-12-18 16:05:12.520 | INFO     | dutil.pipeline._cached:new_foo:314 - Task load_data_1.pickle: skip (cache exists)


2020-12-18 16:05:12.520 | INFO     | dutil.pipeline._cached:new_foo:314 - Task load_data_2.pickle: skip (cache exists)


2020-12-18 16:05:12.522 | INFO     | dutil.pipeline._cached:new_foo:314 - Task mlpipe_make_x_y_df_1|9686102406375020340_df_2|3176941632375591712_df_3|16702583620649360787_include_country|True_adjust_for_country|True_target|e_fversion|0.pickle: skip (cache exists)


2020-12-18 16:05:12.523 | INFO     | dutil.pipeline._cached:new_foo:314 - Task mlpipe_split_x_y_X|68681996546272477_y|16896747200748431878_test_ratio|0.3.pickle: skip (cache exists)


2020-12-18 16:05:12.525 | INFO     | dutil.pipeline._cached:new_foo:314 - Task mlpipe_crossval_model_model_name|lr_X|6317714561027695988_y|805566526682016355_n_folds|2_mversion|0_n_jobs|1.pickle: skip (cache exists)


2020-12-18 16:05:12.527 | DEBUG    | dutil.pipeline._cached:load:196 - Task mlpipe_crossval_model_model_name|lr_X|6317714561027695988_y|805566526682016355_n_folds|2_mversion|0_n_jobs|1.pickle: data has been loaded from cache


2020-12-18 16:05:12.527 | INFO     | mlflow_experiment:run_experiment:159 - Experiment run is finished


In [6]:
pprint(model)
print()
pprint(results)

Pipeline(steps=[('t', SimpleImputer(fill_value=0)), ('e', LinearRegression())])

{'fit_time': array([0.00358796, 0.00284004]),
 'score_time': array([0.00155425, 0.00142932]),
 'test_mae': array([-1.66666667, -4.        ]),
 'test_r2': array([-10.55555556, -15.        ]),
 'train_mae': array([-2.22044605e-16, -1.11022302e-16]),
 'train_r2': array([1., 1.])}


### Random Forests

In [7]:
with experiment.params.context(dict(
    model_name='rf',
    _model=Pipeline((
        ('t', SimpleImputer(fill_value=0)),
        ('e', RandomForestRegressor()),
    )),
)):
    model, results = experiment.run_experiment()

2020-12-18 16:05:12.589 | INFO     | dutil.pipeline._cached:new_foo:314 - Task load_data_1.pickle: skip (cache exists)


2020-12-18 16:05:12.590 | INFO     | dutil.pipeline._cached:new_foo:314 - Task load_data_2.pickle: skip (cache exists)


2020-12-18 16:05:12.591 | INFO     | dutil.pipeline._cached:new_foo:314 - Task load_data_3.pickle: skip (cache exists)


2020-12-18 16:05:12.595 | INFO     | dutil.pipeline._cached:new_foo:314 - Task mlpipe_make_x_y_df_1|9686102406375020340_df_2|3176941632375591712_df_3|16702583620649360787_include_country|True_adjust_for_country|True_target|e_fversion|0.pickle: skip (cache exists)


2020-12-18 16:05:12.597 | INFO     | dutil.pipeline._cached:new_foo:314 - Task mlpipe_split_x_y_X|68681996546272477_y|16896747200748431878_test_ratio|0.3.pickle: skip (cache exists)


2020-12-18 16:05:12.599 | INFO     | dutil.pipeline._cached:new_foo:314 - Task mlpipe_crossval_model_model_name|rf_X|6317714561027695988_y|805566526682016355_n_folds|2_mversion|0_n_jobs|1.pickle: skip (cache exists)


2020-12-18 16:05:12.603 | DEBUG    | dutil.pipeline._cached:load:196 - Task mlpipe_crossval_model_model_name|rf_X|6317714561027695988_y|805566526682016355_n_folds|2_mversion|0_n_jobs|1.pickle: data has been loaded from cache


2020-12-18 16:05:12.604 | INFO     | mlflow_experiment:run_experiment:159 - Experiment run is finished


In [8]:
with experiment.params.context(dict(
    model_name='rf',
    _model=RandomForestRegressor(),
)):
    print(experiment.params.get_params())

{'fversion': 0, 'mversion': 0, 'include_country': True, 'adjust_for_country': True, 'target': 'e', 'test_ratio': 0.3, 'n_folds': 2, 'model_name': 'rf'}
