<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Logging-Machine-Learning-Models-with-mlflow" data-toc-modified-id="Logging-Machine-Learning-Models-with-mlflow-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Logging Machine Learning Models with mlflow</a></span><ul class="toc-item"><li><span><a href="#Access-Data" data-toc-modified-id="Access-Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Access Data</a></span></li><li><span><a href="#Understand-Data" data-toc-modified-id="Understand-Data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Understand Data</a></span></li><li><span><a href="#Preprocess-Data" data-toc-modified-id="Preprocess-Data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Preprocess Data</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Set-mlflow-experiment" data-toc-modified-id="Set-mlflow-experiment-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Set mlflow-experiment</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Evaluation</a></span></li></ul></li></ul></li></ul></div>

# Logging Machine Learning Models with mlflow 

* Author: Johannes Maucher
* Last Update: 04.04.2022

**Goal:** In this notebook the logging of Machine Learning Models with [mlflow](https://mlflow.org/docs/latest/index.html) shall be demonstrated. For this we apply different [linear regression models from scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification)



## Access Data

In this example, structured data is available from a .csv file. Data has been collected by a U.S. insurance company. For 1339 clients the following features are contained:
* age
* sex
* Body-Mass-Index (BMI)
* Number of children
* living region
* annual charges 

In [1]:
#!conda install -y matplotlib

In [2]:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np

In [3]:
data="../Data/insurance.csv"
insurancedf=pd.read_csv(data,na_values=[" ","null"])
insurancedf.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Understand Data

Typical procedures applied for data-understanding are: 
* calculation of descriptive statistics
* visualisation

Numeric features:

In [4]:
insurancedf.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Categorical Features:

In [5]:
catFeats=insurancedf.select_dtypes("object").columns
for cf in catFeats:
    print("\nFeature %s :"%cf)
    print(insurancedf[cf].value_counts())
    


Feature sex :
male      676
female    662
Name: sex, dtype: int64

Feature smoker :
no     1064
yes     274
Name: smoker, dtype: int64

Feature region :
southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64


## Preprocess Data

Non-numeric features must be transformed to a numeric representation:

In [6]:
from sklearn.preprocessing import LabelEncoder
for cf in catFeats:
    insurancedf[cf] = LabelEncoder().fit_transform(insurancedf[cf].values)

In [7]:
insurancedf.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


## Modelling
In this example a model shall be learned, which can be applied to estimate the annual charges, given the other 6 features of a person. Since we also like to evaluate the learned model, we have to split the set of all labeled data into 2 disjoint sets - one for training and the other for test.

### Set mlflow-experiment

In [8]:
#!pip install mlflow

In [9]:
import mlflow
import mlflow.sklearn

If the following code-cell is executed for the first time, a new directory `mlruns` is created. Within this directory one subdirectory, which refers to the experiment `Linear Regression Experiment`, is created. The name of this subdirectory is **the integer-index of this experiment**. In the subdirectory's `meta.yaml`-file the experiment-name as well as the experiment-index are defined.

As long as the experiment is not changed, all runs (models) will be stored in this experiment's directory.

In [11]:
mlflow.set_experiment("Linear Regression Experiment")

<Experiment: artifact_location='file:///Users/johannes/gitprojects/dsmmlbook/mlbook/mlflowExperiments/mlruns/1', experiment_id='1', lifecycle_stage='active', name='Linear Regression Experiment', tags={}>

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X=insurancedf.values[:,:-1]
y=insurancedf.values[:,-1]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=234)

In scikit-learn a model is learned by calling the `fit(X,y)`-method of the corresponding algorithm-class. The arguments $X$ and $y$ are the matrix of input-samples and the vector of class-labels, respectively.

In [16]:
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
name="OLS"
linreg=LinearRegression() # 1. Model: Simple Linear Regression
#name="Ridge"
#alpha=0.5
#linreg=Ridge(alpha=alpha) # 2. Model: Ridge Regression
#name="ElasticNet"
#alpha=0.0
#l1_ratio=0.5
#linreg=ElasticNet(alpha=alpha,l1_ratio=l1_ratio)
linreg.fit(X_train,y_train)

LinearRegression()

### Evaluation
Once the model has been learned it can be applied for predictions:

In [17]:
ypredTest=linreg.predict(X_test)
ypredTrain=linreg.predict(X_train)

In [18]:
for pred, target in zip(ypredTest[:5],y_test[:5]):
    print("Prediction: {0:2.2f} \t Target: {1:2.2f}".format(pred,target))

Prediction: 5266.10 	 Target: 4237.13
Prediction: 6748.86 	 Target: 9644.25
Prediction: 4008.05 	 Target: 4719.52
Prediction: 3439.80 	 Target: 21984.47
Prediction: 3654.77 	 Target: 5693.43


In [19]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [20]:
def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

In [21]:
rmseTrain,maeTrain,r2Train=eval_metrics(y_train,ypredTrain)
print("  RMSE on Training: %s" % rmseTrain)
print("  MAE on Training: %s" % maeTrain)
print("  R2 on Training: %s" % r2Train)

  RMSE on Training: 5939.452942890092
  MAE on Training: 4173.832343645823
  R2 on Training: 0.7616880346385798


In [22]:
rmseTest,maeTest,r2Test=eval_metrics(y_test,ypredTest)
print("  RMSE on Test: %s" % rmseTest)
print("  MAE on Test: %s" % maeTest)
print("  R2 on Test: %s" % r2Test)

  RMSE on Test: 6290.007277198498
  MAE on Test: 4274.3718973638715
  R2 on Test: 0.7232869725193736


### Log this model to the current mlflow-experiment

In the code-cell below a new `run` will be created and activated, if currently no run is active.
Then the model itself, the applied hyperparameters and the attained performance metrics are logged to the currently active run within the current mlflow-experiment. 
Note that a repeated execution of these code-cells without changing the `run` yields that the `metrics`-files are appended.

In [24]:
mlflow.log_param("Modeltype", name)
#mlflow.log_param("alpha", alpha)
#mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse Train", rmseTrain)
mlflow.log_metric("r2 Train", r2Train)
mlflow.log_metric("mae Train", maeTrain)
mlflow.log_metric("rmse Test", rmseTest)
mlflow.log_metric("r2 Test", r2Test)
mlflow.log_metric("mae Test", maeTest)
mlflow.sklearn.log_model(linreg, "model")

ModelInfo(artifact_path='model', flavors={'python_function': {'model_path': 'model.pkl', 'loader_module': 'mlflow.sklearn', 'python_version': '3.10.4', 'env': 'conda.yaml'}, 'sklearn': {'pickled_model': 'model.pkl', 'sklearn_version': '1.0.2', 'serialization_format': 'cloudpickle'}}, model_uri='runs:/3bef5741797649429c57682c164b5baa/model', model_uuid='eaa52651b895431fa54f91ca4026cf5c', run_id='3bef5741797649429c57682c164b5baa', saved_input_example_info=None, signature_dict=None, utc_time_created='2022-04-04 14:43:30.164734')

Log data:

In [25]:
pd.DataFrame(X_train).to_csv("InsuranceTrainX.csv")
pd.DataFrame(y_train).to_csv("InsuranceTrainY.csv")
pd.DataFrame(X_test).to_csv("InsuranceTestX.csv")
pd.DataFrame(y_test).to_csv("InsuranceTestY.csv")

In [26]:
mlflow.log_artifact("InsuranceTrainX.csv")
mlflow.log_artifact("InsuranceTrainY.csv")
mlflow.log_artifact("InsuranceTestX.csv")
mlflow.log_artifact("InsuranceTestY.csv")

Get some information on the currently active run:

In [27]:
r=mlflow.active_run()
r.info

<RunInfo: artifact_uri='file:///Users/johannes/gitprojects/dsmmlbook/mlbook/mlflowExperiments/mlruns/1/3bef5741797649429c57682c164b5baa/artifacts', end_time=None, experiment_id='1', lifecycle_stage='active', run_id='3bef5741797649429c57682c164b5baa', run_uuid='3bef5741797649429c57682c164b5baa', start_time=1649083402144, status='RUNNING', user_id='johannes'>

Terminate the currently active run:

In [28]:
mlflow.end_run()

Get some information on the currently active experiment:

In [29]:
exp=mlflow.get_experiment("1")
exp

<Experiment: artifact_location='file:///Users/johannes/gitprojects/dsmmlbook/mlbook/mlflowExperiments/mlruns/1', experiment_id='1', lifecycle_stage='active', name='Linear Regression Experiment', tags={}>

In [30]:
exp.name

'Linear Regression Experiment'

In [31]:
try:
    r=mlflow.active_run()
    print(r.info)
except:
    print("There is no active run")

There is no active run


## Analyse and Reload Logged Models
In this notebook it has been shown how to log scikit-learn ML experiments with mlflow. In Notebook [mlflowAnalyseSklearn.ipynb](mlflowAnalyseSklearn.ipynb) it is demonstrated how to load logged models and apply them for predictions. 

## Logging of Keras Models
mlflow not only supports scikit-learn. Models from almost all conveniet ML-libraries can be logged and managed as well, e.g. Tensorflow, Keras, PyTorch, ... In notebook [mlflowKerasReutersClassification.ipynb](mlflowKerasReutersClassification.ipynb) a Keras model for text-classification is learned and logged with mlflow and noebook [mlflowAnalyseModelsKeras.ipynb](mlflowAnalyseModelsKeras.ipynb) demonstrates how to reload logged Keras models from mlflow.

## Viewing tracked Experiments

The command

```mlflow ui```

yields that the mlflow tracking UI server starts. The UI can be viewed at `http://localhost:5000` or `http://127.0.0.1:5000`. This command has to be entered into the terminal at the directory in which the `mlruns` directory is located.