# MLFlow Experiment

How to efficiently run multiple ML experiments using `dutil` and `mlflow`:
- make explicit dependencies between the tasks in the pipeline
- run the pipeline with different parameters 
- record and visualize metrics from multiple runs
- cache intermediate data and models

About:
- See the experiment pipeline: `mlflow_experiment.py`
- Show a metrics summary via MLFlow: `mlflow ui` (in the shell)
- Run notebooks with different parameters via Papermill: `mlflow_experiment_papermill.ipynb`

Limitations:
- currently `dutil` only supports "threads" Dask scheduler

## Setup

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_absolute_error, make_scorer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
import dutil.pipeline as dpipe
from loguru import logger
from pprint import pprint

import mlflow_experiment as experiment

## Experiment

The pipeline is constructed in `mlflow_experiment.py`

In [2]:
# --- Global Notebook Parameters ---
fversion = 0
mversion = 0
include_country = True
adjust_for_country = True
target = 'e'
test_ratio = 0.3
n_folds = 2
_n_jobs = 1

In [3]:
experiment.params.update_many(dict(
    fversion=fversion,
    mversion=mversion,
    include_country=include_country,
    adjust_for_country=adjust_for_country,
    target=target,
    test_ratio=test_ratio,
    n_folds=n_folds,
    _n_jobs=_n_jobs,
))

### Linear Regression

In [4]:
with experiment.params.context(dict(
    model_name='lr',
    _model=Pipeline((
        ('t', SimpleImputer(fill_value=0)),
        ('e', LinearRegression()),
    )),
)):
    model, results = experiment.run_experiment()

2020-11-27 12:45:39.701 | DEBUG    | dutil.pipeline._cached:dump:207 - Task load_data_2.pickle: data has been saved to cache
2020-11-27 12:45:39.701 | DEBUG    | dutil.pipeline._cached:dump:207 - Task load_data_1.pickle: data has been saved to cache
2020-11-27 12:45:39.702 | DEBUG    | dutil.pipeline._cached:dump:207 - Task load_data_3.pickle: data has been saved to cache
2020-11-27 12:45:39.703 | INFO     | dutil.pipeline._cached:new_foo:326 - Task load_data_2.pickle: data has been computed and saved to cache
2020-11-27 12:45:39.703 | INFO     | dutil.pipeline._cached:new_foo:326 - Task load_data_1.pickle: data has been computed and saved to cache
2020-11-27 12:45:39.704 | INFO     | dutil.pipeline._cached:new_foo:326 - Task load_data_3.pickle: data has been computed and saved to cache
2020-11-27 12:45:39.706 | DEBUG    | dutil.pipeline._cached:__cached_hash__:225 - Task load_data_1.pickle: hash has been computed from data
2020-11-27 12:45:39.707 | DEBUG    | dutil.pipeline._cached:__

INFO: 'mlpipe_' does not exist. Creating a new experiment


In [5]:
pprint(model)
print()
pprint(results)

Pipeline(steps=[('t', SimpleImputer(fill_value=0)), ('e', LinearRegression())])

{'fit_time': array([0.00814676, 0.00339603]),
 'score_time': array([0.00221658, 0.0025723 ]),
 'test_mae': array([-1.66666667, -4.        ]),
 'test_r2': array([-10.55555556, -15.        ]),
 'train_mae': array([-2.22044605e-16, -1.11022302e-16]),
 'train_r2': array([1., 1.])}


### Random Forests

In [6]:
with experiment.params.context(dict(
    model_name='rf',
    _model=Pipeline((
        ('t', SimpleImputer(fill_value=0)),
        ('e', RandomForestRegressor()),
    )),
)):
    model, results = experiment.run_experiment()

2020-11-27 12:45:45.950 | INFO     | dutil.pipeline._cached:new_foo:311 - Task load_data_1.pickle: skip (cache exists)
2020-11-27 12:45:45.951 | INFO     | dutil.pipeline._cached:new_foo:311 - Task load_data_2.pickle: skip (cache exists)
2020-11-27 12:45:45.952 | INFO     | dutil.pipeline._cached:new_foo:311 - Task load_data_3.pickle: skip (cache exists)
2020-11-27 12:45:45.955 | INFO     | dutil.pipeline._cached:new_foo:311 - Task mlpipe_make_x_y_df_1|9686102406375020340_df_2|11371928299187938700_df_3|16702583620649360787_include_country|True_adjust_for_country|True_target|e_fversion|0.pickle: skip (cache exists)
2020-11-27 12:45:45.957 | INFO     | dutil.pipeline._cached:new_foo:311 - Task mlpipe_split_x_y_X|68681996546272477_y|16896747200748431878_test_ratio|0.3.pickle: skip (cache exists)
2020-11-27 12:45:45.959 | DEBUG    | dutil.pipeline._cached:load:193 - Task mlpipe_split_x_y_X|68681996546272477_y|16896747200748431878_test_ratio|0.3.pickle: data has been loaded from cache
2020-

In [7]:
with experiment.params.context(dict(
    model_name='rf',
    _model=RandomForestRegressor(),
)):
    print(experiment.params.get_params())

{'fversion': 0, 'mversion': 0, 'include_country': True, 'adjust_for_country': True, 'target': 'e', 'test_ratio': 0.3, 'n_folds': 2, 'model_name': 'rf'}
