# MLOPs Demo

Mlops can help you to train and select a better model

## Step1: Train a model without any mlops tools

The first step is to train a simple model (supervise learning/classification, algo:RandomForest) to predict if a pokemon is legendary or not

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [2]:
# calculate an accuracy from the confusion matrix
def get_model_accuracy(confusion_matrix):
    diagonal_sum = confusion_matrix.trace()
    sum_of_all_elements = confusion_matrix.sum()
    return diagonal_sum / sum_of_all_elements


def train_model(data_url:str,n_estimator:int, max_depth:int, min_samples_split:int):
    print(data_url)
    feature_data, label_data = prepare_data(data_url)
    train_X, test_X, train_y, test_y = train_test_split(feature_data, label_data, train_size=0.8, test_size=0.2,
                                                        random_state=0)
    print(len(test_X))
   
    # create a random forest classifier
    rf_clf = RandomForestClassifier(n_estimators=n_estimator, max_depth=max_depth,
                                    min_samples_split=min_samples_split,
                                    n_jobs=2, random_state=0)
    # train the model with training_data
    rf_clf.fit(train_X, train_y)
    # predict testing data
    predicts_val = rf_clf.predict(test_X)

    # Generate a cm
    cm = confusion_matrix(test_y, predicts_val)
    model_accuracy = get_model_accuracy(cm)
    print("RandomForest model (n_estimator=%f, max_depth=%f, min_samples_split=%f):" % (n_estimator, max_depth,
                                                                                        min_samples_split))
    print("accuracy: %f" % model_accuracy)


def prepare_data(data_url):
    # read data as df
    try:
        input_df = pd.read_csv(data_url, index_col=0)
        input_df.head()
    except Exception as e:
        print(
            "Unable to read data from the giving path, check your data location. Error: %s", e
        )
    # Prepare data for ml model
    label = input_df.legendary
    feature = input_df.drop(['legendary', 'generation', 'total'], axis=1).select_dtypes(exclude=['object'])
    return feature, label


In [4]:
np.random.seed(40)
# raw data 
data_url = "https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv"

# prepare hyper parameters
n_estimator = 50
max_depth = 30
min_samples_split = 2

train_model(data_url,n_estimator, max_depth, min_samples_split)

https://minio.lab.sspcloud.fr/pengfei/sspcloud-demo/pokemon-cleaned.csv
160
RandomForest model (n_estimator=50.000000, max_depth=30.000000, min_samples_split=2.000000):
accuracy: 0.925000


## Step2: Train a model with model tracking tools

remote-run.sh

```sh
#! /bin/bash
export MLFLOW_S3_ENDPOINT_URL='https://minio.lab.sspcloud.fr'
export MLFLOW_TRACKING_URI='https://user-pengfei-531016.kub.sspcloud.fr/'
export MLFLOW_EXPERIMENT_NAME="pokemon"

mlflow run https://github.com/pengfei99/mlflow-pokemon-example.git -P remote_server_uri=$MLFLOW_TRACKING_URI -P experiment_name=$MLFLOW_EXPERIMENT_NAME \
-P data_url=https://minio.lab.sspcloud.fr/pengfei/mlflow-demo/pokemon-partial.csv \
-P n_estimator=50 -P max_depth=30 -P min_samples_split=2
```

In [5]:
! sh ../bash_command/remote-run.sh

2022/05/23 15:33:49 INFO mlflow.projects.utils: === Fetching project from https://github.com/pengfei99/mlflow-pokemon-example.git into /tmp/tmptu08et6u ===
2022/05/23 15:33:52 INFO mlflow.utils.conda: Conda environment mlflow-00d6eb3c61cb6060bb0061cb1ebb0b8779fc3e55 already exists.
2022/05/23 15:33:52 INFO mlflow.projects.utils: === Created directory /tmp/tmplwx500qh for downloading remote URIs passed to arguments of type 'path' ===
2022/05/23 15:33:52 INFO mlflow.projects.backend.local: === Running command 'source activate mlflow-00d6eb3c61cb6060bb0061cb1ebb0b8779fc3e55 1>&2 && python pokemon.py https://user-pengfei-531016.kub.sspcloud.fr/ pokemon default https://minio.lab.sspcloud.fr/pengfei/mlflow-demo/pokemon-partial.csv 50 30 2' in run with ID 'c3e442c52e524be193626a374bd48adf' === 
87
RandomForest model (n_estimator=50.000000, max_depth=30.000000, min_samples_split=2.000000):
accuracy: 0.942529
2022/05/23 15:34:07 INFO mlflow.projects: === Run (ID 'c3e442c52e524be193626a374bd48ad

## Step3 Train many models in parallel

In [7]:
! kubectl apply -f ../argo_workflow/workflow.yaml

workflow.argoproj.io/pokemon-model-training-workflow-v1 created
