For this hands-on, we will be using the [Power Plant dataset](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset where the goal is to predict the net hourly electrical energy output (EP) of a plant.

In [7]:
from datetime import datetime

import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

pd.set_option("display.max_columns", None)

In [8]:
!pip freeze

alembic==1.4.1
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.5
attrs==21.4.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.0
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.12
click==8.1.3
cloudpickle==2.0.0
colorama==0.4.4
databricks-cli==0.16.6
debugpy==1.6.0
decorator==5.1.1
defusedxml==0.7.1
docker==5.0.3
entrypoints==0.4
executing==0.8.3
fastjsonschema==2.15.3
Flask==2.1.2
gitdb==4.0.9
GitPython==3.1.27
greenlet==1.1.2
idna==3.3
importlib-metadata==4.11.3
ipykernel==6.13.0
ipython==8.3.0
ipython-genutils==0.2.0
ipywidgets==7.7.0
itsdangerous==2.1.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.1.0
jsonschema==4.5.1
jupyter==1.0.0
jupyter-client==7.3.1
jupyter-console==6.4.3
jupyter-core==4.10.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==1.1.0
Mako==1.2.0
MarkupSafe==2.1.1
matplotlib-inline==0.1.3
mistune==0.8.4
mlflow==1.20.2
nbclient==0.6.3
nbconvert==6.5.0
nbformat==5.4.0
nest-asyncio==1.5.5
notebook==6.4.11
numpy==1.22.3
oauthlib==3.2.0
packaging==21.3

In [2]:
df = pd.read_csv("../data/power_plants.csv")
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


# MLflow Tracking

## Model traning

In [9]:
def train_model(train_df, max_depth=2):
    # Split data
    X = train_df[["AT", "V", "AP", "RH"]]
    y = train_df["PE"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Fit model
    model = RandomForestRegressor(max_depth=max_depth)
    model.fit(X_train, y_train)

    # Evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    print(f"Test mse = {mse}, Test RMSE = {rmse}, Random forest max depth = {max_depth}")
    return model, mse, rmse

In [6]:
_ = train_model(df, max_depth=2)

Test mse = 37.70306642963242, Test RMSE = 6.140282276054776, Random forest max depth = 2


- Test with different max depths for the Random forest

In [10]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

Test mse = 37.79798798984541, Test RMSE = 6.148006830660276, Random forest max depth = 2
Test mse = 19.940726005759334, Test RMSE = 4.465504003554283, Random forest max depth = 4
Test mse = 14.57855904194579, Test RMSE = 3.8181879264836858, Random forest max depth = 6


## Experiment tracking

### Some vocabulary:
- **run**: single execution of model training code. Each run can record different informations (model parameters, metrics, tags, artifacts, etc).
- **experiment**: the primary unit of organization and access control for MLflow runs; all MLflow runs belong to an experiment. Experiments let you visualize, search for, and compare runs, as well as download run artifacts and metadata for analysis in other tools.

In [11]:
!ls

mlflow_tracking_hands_on.ipynb


In [12]:
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name)

INFO: 'ep_prediction_with_random_forest' does not exist. Creating a new experiment


In [None]:
!ls

### Basic logging
- Log model hyper-parameters, metric and the model itself

In [15]:
def train_model(train_df, max_depth=2):
    with mlflow.start_run():
        # Split data
        X = train_df[["AT", "V", "AP", "RH"]]
        y = train_df["PE"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)
        ## mlflow: log model & its hyper-parameters
        mlflow.log_param("max_depth", max_depth)
        mlflow.sklearn.log_model(model, "model")

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse}, Test RMSE = {rmse}, Random forest max depth = {max_depth}")

- Run the function with mlflow tracking

In [14]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

Test mse = 38.43902347797478, Test RMSE = 6.199921247723617, Random forest max depth = 2
Test mse = 20.083811340187975, Test RMSE = 4.481496551397532, Random forest max depth = 4
Test mse = 14.549285665756692, Test RMSE = 3.81435258802286, Random forest max depth = 6


### Visualize experiments with MLflow tracking UI

To run the [MLflow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui), you need to either run the UI with ```mlflow ui``` (needs to be executed from the *notebooks* folder) oor to run an *mlflow server* (will be used in the following section)

### Where mlflow saves the data

#### Some vocabulary:
- **Backend store**: for MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc)
- **Artefact store**: for artifacts (files, models, images, in-memory objects, etc)
- For more information, [check the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded)

#### Without prior configuration
- When no pror configuration is set, MLflow creates an *mlruns* folder where the data will be saved

In [16]:
!ls

mlflow_tracking_hands_on.ipynb
mlruns


- MLflow created a new folder *mlruns* where it will store the different run informations

In [17]:
!tree mlruns

Folder PATH listing for volume Windows
Volume serial number is 96A4-2B2A
C:\USERS\SHUDA\PROJECT\MLFLOW_HANDS_ON\NOTEBOOKS\MLRUNS
+---.trash
+---0
+---1
    +---6d3d81e179a64bdf979601b9298f2095
    ¦   +---artifacts
    ¦   ¦   +---model
    ¦   +---metrics
    ¦   +---params
    ¦   +---tags
    +---9eb04ffe0ae34bc98c2fad2e75e8dfb5
    ¦   +---artifacts
    ¦   ¦   +---model
    ¦   +---metrics
    ¦   +---params
    ¦   +---tags
    +---e7f57382489d4b8b8a449dd6c7dcae8c
        +---artifacts
        ¦   +---model
        +---metrics
        +---params
        +---tags


#### With prior configuration
- Set the **Backend store** to an sqlite database located in */tmp/mlruns.db* and the **Artefact store**  to a folder located in */tmp/mlruns*. For more informations on the different possibilities available (S3, blobstorage, etc) check [the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded).
- To run the MLflow server, you needd to execute the following command in your terminal
```mlflow server --backend-store-uri sqlite:////tmp/mlruns.db --default-artifact-root /tmp/mlruns```
- Set the tracking uri in the notebook ```mlflow.set_tracking_uri('http://127.0.0.1:5000')```

In [18]:
mlflow.set_tracking_uri('http://127.0.0.1:5000')

In [20]:
# Create the experiment in the new database
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name)

### Loggiong with autolog

- Autollog will log all the model parameters, training metrics, model binary, etc **BUT not the test metrics**, tthey needd to be logged manually

In [21]:
def train_model(train_df, max_depth=2):
    training_timestamp = datetime.now().strftime('%Y-%m-%d, %H:%M:%S')
    with mlflow.start_run(run_name=f"model_{training_timestamp}"):

        mlflow.autolog()
        
        # Split data
        X = train_df[["AT", "V", "AP", "RH"]]
        y = train_df["PE"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse}, Test RMSE = {rmse}, Random forest max depth = {max_depth}")

In [None]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)


### Search runs

- [In the UI directly](https://www.mlflow.org/docs/latest/search-syntax.html#search)
- [Programmatically with search_runs](https://www.mlflow.org/docs/latest/search-syntax.html#programmatically-searching-runs)

- Get the id of the experiment where we want to search runs

In [None]:
mlflow.get_experiment_by_name(experiment_name)

In [None]:
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
experiment_id

- Get all runs for the experiment

In [None]:
mlflow.search_runs(experiment_id)

- Filter runs by max_depth and mse and order them by mse

In [None]:
max_depth = 4
mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 40",
    order_by=['metrics.testing_mse asc']
)

### Load a saved model

- [More informations on other format of model_uri](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.load_model)

#### With the result of search_runs

In [None]:
run = mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 30",
    order_by=["metrics.testing_mse asc"]
).iloc[0]
run

In [None]:
run.artifact_uri

In [None]:
model = mlflow.sklearn.load_model(model_uri=f"{run.artifact_uri}/model")
model

In [None]:
model.predict(df[:5][["AT", "V", "AP", "RH"]])