# MLFlow Experiment

How to efficiently run multiple ML experiments using `dutil` and `mlflow`:
- make explicit dependencies between the tasks in the pipeline
- run the pipeline with different parameters 
- record and visualize metrics from multiple runs
- cache intermediate data and models

About:
- The pipeline is set in `mlflow_experiment.py`
- To see MLFlow metrics summary, run `mlflow ui` in the shell (from this folder)
- To use with papermill, run `papermill mlflow_experiment.ipynb -p fversion=1 adjust_for_country=0`

Limitations:
- currently `dutil` only supports "threads" Dask scheduler

## Setup

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_absolute_error, make_scorer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
import dutil.pipeline as dpipe
from loguru import logger
from pprint import pprint

import mlflow_experiment as experiment

## Experiment

The pipeline is constructed in `mlflow_experiment.py`

In [2]:
# --- Global Notebook Parameters ---
fversion = 0
mversion = 0
include_c = False
adjust_for_country = True
target = 'e'
test_ratio = 0.3
n_folds = 2

In [3]:
experiment.params.update_many(dict(
    fversion=fversion,
    mversion=mversion,
    include_c=include_c,
    adjust_for_country=adjust_for_country,
    target=target,
    test_ratio=test_ratio,
    n_folds=n_folds,
))

### Linear Regression

In [4]:
with experiment.params.context(dict(
    model_name='lr',
    _model=Pipeline((
        ('t', SimpleImputer(fill_value=0)),
        ('e', LinearRegression()),
    )),
)):
    model, results = experiment.run_experiment()

2020-11-25 11:08:58.996 | DEBUG    | dutil.pipeline._cached:dump:200 - Task load_data_3.pickle: data has been saved to cache
2020-11-25 11:08:58.997 | DEBUG    | dutil.pipeline._cached:dump:200 - Task load_data_1.pickle: data has been saved to cache
2020-11-25 11:08:58.998 | DEBUG    | dutil.pipeline._cached:dump:200 - Task load_data_2.pickle: data has been saved to cache
2020-11-25 11:08:58.999 | INFO     | dutil.pipeline._cached:new_foo:319 - Task load_data_3.pickle: data has been computed and saved to cache
2020-11-25 11:08:59.000 | INFO     | dutil.pipeline._cached:new_foo:319 - Task load_data_1.pickle: data has been computed and saved to cache
2020-11-25 11:08:59.002 | INFO     | dutil.pipeline._cached:new_foo:319 - Task load_data_2.pickle: data has been computed and saved to cache
2020-11-25 11:08:59.004 | DEBUG    | dutil.pipeline._cached:__cached_hash__:218 - Task load_data_1.pickle: hash has been computed from data
2020-11-25 11:08:59.004 | DEBUG    | dutil.pipeline._cached:__

INFO: 'mlpipe_' does not exist. Creating a new experiment


In [5]:
pprint(model)
print()
pprint(results)

Pipeline(steps=[('t', SimpleImputer(fill_value=0)), ('e', LinearRegression())])

{'fit_time': array([0.00727439, 0.00400782]),
 'score_time': array([0.00246429, 0.00274873]),
 'test_mae': array([-2.25, -4.  ]),
 'test_r2': array([-19.5, -15. ]),
 'train_mae': array([-2.22044605e-16, -1.11022302e-16]),
 'train_r2': array([1., 1.])}


### Random Forests

In [7]:
with experiment.params.context(dict(
    model_name='rf',
    _model=Pipeline((
        ('t', SimpleImputer(fill_value=0)),
        ('e', RandomForestRegressor()),
    )),
)):
    model, results = experiment.run_experiment()

2020-11-25 11:09:21.658 | INFO     | dutil.pipeline._cached:new_foo:304 - Task load_data_1.pickle: skip (cache exists)
2020-11-25 11:09:21.659 | INFO     | dutil.pipeline._cached:new_foo:304 - Task load_data_2.pickle: skip (cache exists)
2020-11-25 11:09:21.660 | INFO     | dutil.pipeline._cached:new_foo:304 - Task load_data_3.pickle: skip (cache exists)
2020-11-25 11:09:21.662 | INFO     | dutil.pipeline._cached:new_foo:304 - Task mlpipe_make_x_y_df_1|9686102406375020340_df_2|1608211821468367081_df_3|16702583620649360787_include_c|False_adjust_for_country|True_target|e_fversion|0.pickle: skip (cache exists)
2020-11-25 11:09:21.663 | INFO     | dutil.pipeline._cached:new_foo:304 - Task mlpipe_split_x_y_X|17713854509878100511_y|16896747200748431878_test_ratio|0.3.pickle: skip (cache exists)
2020-11-25 11:09:21.667 | DEBUG    | dutil.pipeline._cached:load:186 - Task mlpipe_split_x_y_X|17713854509878100511_y|16896747200748431878_test_ratio|0.3.pickle: data has been loaded from cache
2020-

In [8]:
with experiment.params.context(dict(
    model_name='rf',
    _model=RandomForestRegressor(),
)):
    print(experiment.params.get_params())

{'fversion': 0, 'mversion': 0, 'include_c': False, 'adjust_for_country': True, 'target': 'e', 'test_ratio': 0.3, 'n_folds': 2, 'model_name': 'rf'}
