# MLOps with datalab

## 1. Introduction

### 1.1 What is MLOps?

**MLOps** stands for `Machine Learning Operations`. It contains a set of best practices that seeks to increase automation and improve the efficiency of models development and deployment.

### 1.2 Why do we need MLOps? Git is not enough?

Put a machine learning model into production is difficult. It envoles many complex aspects such as
- data collection/ingest,
- data prep (e.g. cleaning, feature engineering, etc),
- model development
- model training,
- model tuning
- model deployment
- model monitoring,
- model explainability
- ETC.

Below figure shows the different aspects of mlops :

![ml_technical_debt.PNG](img/ml_technical_debt.PNG)

### 1.3 ML Operations

We need to address the following MLOps principals:

- **Model tracking**: track all the necessary element to reproduce the model such as code, hyperparameter and training data.
- **Model review**: Test model and produce quality assurance report. Inference model production-specifics properties such as model response times.
- **Model Governance** : manage model versions, model artifacts and transitions through their lifecycle (e.g. staging, production, archived,etc.).

- **Model deployment**: Automate the process of deploying registered models (e.g. permissions, cluster creation, API management, etc.)
- **Model monitoring**: Monitor the state of model production server (e.g. number of request, response time, serving data anomalies, etc.)
- **Model retraining**: Create alerts and automation to take corrective action in case of **model drift** due to
                    differences in training and inference data or `data evolution`.


### 1.4 Continuous X

- **CI**: Track model code, training data( e.g. Feature engineering/selection), hyper-parameters optimization
- **CD**: Need to deliver not only an executable package, but also a complete pipeline of how the model is trained.
- **CT(Continuous training)**: Models need to be retrained automatically. Because evolving data make your model decay. data validation is essential at this step, Because data drifting can be caused by evolution or errors.

## 2 Illustrate mlops via a application example

## 2.1 The context

If you are a pokemon go player, when you capture a new pokemon, you may want to know if this pokemon is good or not. To make this easier, we would like to train a classification model that can tell us if the pokemon is legendary or not.

In this tutorial we will use a random forest classifier to implement this model

## 2.2 Prepare environment and install the dependencies

Launch a jupyter service in datalab.

**Don't forget to assign admin role in kubernetes tab**

![jupyter_datalab.PNG](img/jupyter_datalab.PNG)

Start a terminal in jupyter

```shell
git clone https://github.com/pengfei99/MLOPS.git
```

Then install the dependencies

```shell
pip install -r requirements.txt
```



In [1]:

import sys
import os

import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

# Phase 1. Train a model in an old school way

In [2]:
# calculate an accuracy from the confusion matrix
def get_model_accuracy(cf_matrix):
    diagonal_sum = cf_matrix.trace()
    sum_of_all_elements = cf_matrix.sum()
    return diagonal_sum / sum_of_all_elements


def train_model(data_url: str, n_estimator: int, max_depth: int, min_samples_split: int):
    print(f"data source: {data_url}")
    feature_data, label_data = prepare_data(data_url)
    train_X, test_X, train_y, test_y = train_test_split(feature_data, label_data, train_size=0.8, test_size=0.2,
                                                        random_state=0)
    # print(len(test_X))

    # create a random forest classifier
    rf_clf = RandomForestClassifier(n_estimators=n_estimator, max_depth=max_depth,
                                    min_samples_split=min_samples_split,
                                    n_jobs=2, random_state=0)
    # train the model with training_data
    rf_clf.fit(train_X, train_y)
    # predict testing data
    predicts_val = rf_clf.predict(test_X)

    # Generate a cm
    cm = confusion_matrix(test_y, predicts_val)
    model_accuracy = get_model_accuracy(cm)
    print("RandomForest model with hyper-parameters: (n_estimator=%f, max_depth=%f, min_samples_split=%f):" % (
        n_estimator, max_depth, min_samples_split))
    print("accuracy: %f" % model_accuracy)


def prepare_data(data_url):
    # read data as df
    try:
        input_df = pd.read_csv(data_url, index_col=0)
        input_df.head()
    except Exception as e:
        print(
            "Unable to read data from the giving path, check your data location. Error: %s", e
        )
    # Prepare data for ml model
    label = input_df.legendary
    feature = input_df.drop(['legendary', 'generation', 'total'], axis=1).select_dtypes(exclude=['object'])
    return feature, label

In [3]:
np.random.seed(40)
# set the training data path
data_url = "https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv"

# set the hyper parameters
n_estimator = 50
max_depth = 30
min_samples_split = 2

# train the model
train_model(data_url, n_estimator, max_depth, min_samples_split)

data source: https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv
RandomForest model with hyper-parameters: (n_estimator=50.000000, max_depth=30.000000, min_samples_split=2.000000):
accuracy: 0.925000


# Phase 2. Train a model with model tracking tools (CI)

In this tutorial, we use mlflow as our model tracking tools. Before you start the phase 2, you need to launch the [mlflow](https://mlflow.org/) service in datalab

![mlflow_datalab.PNG](img/mlflow_datalab.PNG)

Once you launched the mlflow service you need to create an experiment(project) if you don't have one. The name of the experiment is important, because we will need it to setup a mlflow context. The following code is an example on how to track your model and upload the information to mlflow server,
 1. create a mlflow context
 2. track training data source
 3. track hyperparameter
 4. track metric
 5. track model binary.

In [None]:
def train_model_with_mlflow_tracking(mlflow_experiment_name: str, mlflow_run_name: str, data_url: str, n_estimator: int,
                                     max_depth: int,
                                     min_samples_split: int):
    # Step1: Prepare data
    feature_data, label_data = prepare_data(data_url)
    train_X, test_X, train_y, test_y = train_test_split(feature_data, label_data, train_size=0.8, test_size=0.2,
                                                        random_state=0)
    # set up mlflow context
    mlflow.set_experiment(mlflow_experiment_name)
    with mlflow.start_run(run_name=mlflow_run_name):
        # create a random forest classifier
        rf_clf = RandomForestClassifier(n_estimators=n_estimator, max_depth=max_depth,
                                        min_samples_split=min_samples_split,
                                        n_jobs=2, random_state=0)
        # train the model with training_data
        rf_clf.fit(train_X, train_y)
        # predict testing data
        predicts_val = rf_clf.predict(test_X)

        # Generate a cm
        cm = confusion_matrix(test_y, predicts_val)
        model_accuracy = get_model_accuracy(cm)
        print("RandomForest model with hyper-parameters: (n_estimator=%f, max_depth=%f, min_samples_split=%f):" % (
            n_estimator, max_depth,
            min_samples_split))
        print("accuracy: %f" % model_accuracy)
        # log the model hyper-parameters to the mlflow server
        mlflow.log_param("data_url", data_url)
        mlflow.log_param("n_estimator", n_estimator)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)

        # log shap feature explanation extension. This will generate a graph of feature importance of the model
        # mlflow.shap.log_explanation(rf_clf.predict, test_X.sample(70))

        # log the model accuracy to the mlflow server
        mlflow.log_metric("model_accuracy", model_accuracy)

        # log the model to the mlflow server
        mlflow.sklearn.log_model(rf_clf, "model")

To make the model training more flexible, we also convert the above jupyter notebook to a python script. Fot the full code, please check [here](tutorials/pokemon/train_model.py)

And we write a little bash script to run the python with specific env var. **You need to modify the configuration such as MLFLOW_TRACKING_URI to your own mlflow server uri to make the script work**

```shell
#! /bin/bash
export MLFLOW_S3_ENDPOINT_URL='https://minio.lab.sspcloud.fr'
export MLFLOW_TRACKING_URI='https://user-pengfei-866801.kub.sspcloud.fr/'
export MLFLOW_EXPERIMENT_NAME="pokemon"

run_name="default"
data_url="https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv"

# set the hyper parameters
n_estimator="50"
max_depth="30"
min_samples_split="2"

root_path="/home/jovyan/work/MLOPS"

python ${root_path}/tutorials/pokemon/train_model.py ${MLFLOW_EXPERIMENT_NAME} ${run_name} ${data_url} ${n_estimator} ${max_depth} ${min_samples_split}
```

In [None]:
! sh tutorials / pokemon / bash_command / local_run.sh

After you run the above command, all the tracking information of the model will be uploaded to the target mlflow server.

Below figure shows the architecture:

![local_training_archi.png](img/local_training_archi.png)

But we still need to set up the python `virtual environment` and git clone the code, can we do better?

Yes, we can. Thanks to the mlflow, which offers a launching API which can build the virtual environment and get the code automatically.

We only need to setup two config files
- MLproject (**This file must be at the root path of your git repo, otherwise mlflow can't run the workflow**)
- conda.yaml

Below is our [MLporject](MLproject):
```yaml
name: pokemon-legendary-estimator

conda_env: tutorials/pokemon/conda.yaml

entry_points:
  main:
    parameters:
      remote_server_uri: {type: str, default: http://pengfei.org:8000}
      experiment_name: {type: str, default: test-1}
      run_name: {type: str, default: default}
      data_url: {type: str, default: https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv}
      n_estimator: {type: int, default: 10}
      max_depth: {type: int, default: 5}
      min_samples_split: {type: int, default: 2}
    command: "python tutorials/pokemon/train_model.py {experiment_name} {run_name} {data_url} {n_estimator} {max_depth} {min_samples_split}"
```

and [conda.yaml](tutorials/pokemon/conda.yaml)

```yaml
name: pokemon-legendary-estimator
channels:
  - defaults
dependencies:
  - python=3.8
  - pip
  - pip:
    - scikit-learn==1.1.2
    - mlflow>=1.28.0
    - pandas>=1.2.2
    - numpy>=1.20.1
    - shap>=0.39.0
    - matplotlib>=3.4.1
    - boto3==1.17.19
```

And now we can train a model without installing anything with below bash script. **You need to modify the configuration such as MLFLOW_TRACKING_URI to your own mlflow server uri to make the script work**

```shell
#! /bin/bash
export MLFLOW_S3_ENDPOINT_URL='https://minio.lab.sspcloud.fr'
export MLFLOW_TRACKING_URI='https://user-pengfei-866801.kub.sspcloud.fr/'
export MLFLOW_EXPERIMENT_NAME="pokemon"

run_name="default"
data_url="https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv"

# set the hyper parameters
n_estimator="50"
max_depth="30"
min_samples_split="2"

mlflow run https://github.com/pengfei99/MLOPS.git -P remote_server_uri=${MLFLOW_TRACKING_URI} \
-P experiment_name=${MLFLOW_EXPERIMENT_NAME} \
-P data_url=${data_url} \
-P n_estimator=${n_estimator} -P max_depth=${max_depth} -P min_samples_split=${min_samples_split}
```

Now, you can see we let the mlflow download the code and create virtual environment for us

In [None]:
! sh tutorials / pokemon / bash_command / remote_run.sh

# Phase 3 Train many models in parallel (CI)

After Phase 2, we can track all information to reproduce a model, but it will take long time to compare many hyper-parameter combinations, if we can only train a model one by one. If we can train the model in parallel, then we can shorten the development time of the model.

To do so, we need to launch a new service called [argo workflow](https://argoproj.github.io/argo-workflows/) in datalab

![argo_datalab.PNG](img/argo_datalab.PNG)

When you launch the argo-workflow service, pay attention to the `service account` value, because this service will need to launch new pods in the cluster k8s, which requires special rights, and this service account will give you the rights.

Now let's check the workflow specification ([workflow.yaml](tutorials/pokemon/argo_workflow/workflow.yaml)).

It can be dived into three parts:
1. Workflow configuration
2. Workflow dag planing
3. Task logic implementation of the workflow dag

**We need to configure the workflow parameter** before launching the workflow such as:
- minio creds
- mlflow uri
- model git repo uri
- etc.

## 3.1 Installation of the argo workflow client

In [None]:
! sudo sh tutorials / pokemon / bash_command / argo_deb_install.sh

In [None]:
# check the client version
! argo version

## 3.2 Run the workflow

In [None]:
! argo submit tutorials / pokemon / argo_workflow / workflow.yaml --watch

You can also check the workflow progress via the argo workflow web interface

Below figure shows the architecture of what just happened
![multi_model_training_archi_overview.png](img/multi_model_training_archi_overview.png)

# Phase 4 Model management

## 4.1 Model evaluation
Now we have trained many models, we need to find the best model and deploy it into production.

Mlflow allow us to compare model accuracy based on different hyperparameters. Below figure is an example

![mlflow_model_eval.PNG](img/mlflow_model_eval.PNG)

## 4.2 Model delivery

After we find our target model, we can publish it to our model registry with a version number and state
- production
- staging
- archived

Below figure shows an example of the model registry
![mlflow_model_version.PNG](img/mlflow_model_version.PNG)


## 4.3 Consuming the model,

Once the model is published in the model registry, it can be consumed by using the mlflow api. In below code, we build an rest Api by consuming 



In [2]:
def prepare_sample_data():
    # get the data
    data_url = "https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv"
    input_df = pd.read_csv(data_url, index_col=0)

    ## prepare sample data
    # Prepare data for ml model testing
    legendary_pokemon = input_df[input_df["legendary"] == True]
    legendary_pokemon_sample = legendary_pokemon.sample(5).drop(['legendary', 'generation', 'total'],
                                                                axis=1).select_dtypes(
        exclude=['object'])
    normal_pokemon = input_df[input_df["legendary"] == False]
    normal_pokemon_sample = normal_pokemon.sample(5).drop(['legendary', 'generation', 'total'], axis=1).select_dtypes(
        exclude=['object'])
    return legendary_pokemon_sample, normal_pokemon_sample


# fetch the trained model from mlflow server by using its version
def fetch_model(server_uri: str, experiment_name: str, model_version: str):
    os.environ["MLFLOW_TRACKING_URI"] = server_uri
    model = mlflow.pyfunc.load_model(model_uri=f"models:/{experiment_name}/{model_version}")
    return model


def test_model(server_uri: str, experiment_name: str, model_version: str):
    # step1: prepare sample data
    legendary_sample, normal_sample = prepare_sample_data()

    # step2: fetch the model
    model = fetch_model(server_uri, experiment_name, model_version)
    # step3: predict the sample data
    print(model.predict(legendary_sample))
    print(model.predict(normal_sample))

In [4]:
# set up the mlflow server url and experiment name
mlflow_server_uri = "https://user-pengfei-42041.kub.sspcloud.fr/"
experiment_name = "pokemon"
version = '5'

test_model(mlflow_server_uri,experiment_name,version)

NoCredentialsError: Unable to locate credentials