# Train, compare, and register models

This notebook provides a quick overview of training ML models using [MLRun](https://www.mlrun.org/) MLOps orchestration framework.

Make sure you reviewed the basics in MLRun [**Quick Start Tutorial**](./01-mlrun-basics.html).

Tutorial steps:
- [**Define an MLRun project and a training functions**](#define-project)
- [**Run the function, log the artifacts and model**](#run-function)
- [**Hyper-parameter tuning and model/experiment comparison**](#hyper-param)
- [**Build and test the model serving functions**](#model-serving)


## MLRun installation and configuration

Before running this notebook make sure `mlrun` and `sklearn` packages are installed (`pip install mlrun scikit-learn~=1.0`) and that you have configured the access to the MLRun service. 

In [None]:
# install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

<a id="define-project"></a>
## Define MLRun project and a training functions

You should create, load, or use (get) an **{ref}`MLRun Project <Projects>`** that holds all your functions and assets.

**Get or create a new project:**

The `get_or_create_project()` method tries to load the project from MLRun DB. If the project does not exist it creates a new one.

In [1]:
import mlrun
project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)

> 2022-09-01 07:35:22,164 [info] Username was normalized to match the required pattern for project name: {'username': 'Davesh', 'normalized_username': 'davesh'}
> 2022-09-01 07:35:22,165 [info] Username was normalized to match the required pattern for project name: {'username': 'Davesh', 'normalized_username': 'davesh'}
> 2022-09-01 07:35:22,198 [info] Username was normalized to match the required pattern for project name: {'username': 'Davesh', 'normalized_username': 'davesh'}
> 2022-09-01 07:35:37,619 [info] loaded project tutorial from None or context and saved in MLRun DB


**Add (auto) MLOps to your training function:**

Training functions generate models and various model statistics. You'll want to store the models along with all the relevant data,
metadata, and measurements. MLRun can apply all the MLOps functionality automatically ("Auto-MLOps") by simply using the framework specific `apply_mlrun()` method.

In the training function below note the **single** custom line you need to add to your code:

```python
apply_mlrun(model=model, model_name="my_model", x_test=x_test, y_test=y_test)
```

`apply_mlrun()` manages the training process and automatically logs all the framework-specific model object, details, data, metadata, and metrics.
It accepts the model object and various optional parameters. When specifying the `x_test` and `y_test` data it generates various plots and calculations to evaluate the model.
Metadata and parameters are automatically recorded (from MLRun `context` object) and don't need to be specified.

**Function code:**

Run the following cell to generate the `trainer.py` file (or copy it manually):

In [2]:
%%writefile trainer.py

from sklearn import ensemble
from sklearn.model_selection import train_test_split

import mlrun
from mlrun.frameworks.sklearn import apply_mlrun


def train(
    dataset: mlrun.DataItem,  # data inputs are of type DataItem (abstract the data source)
    label_column: str = "Win",
    n_estimators: int = 100,
    learning_rate: float = 0.1,
    max_depth: int = 3,
    model_name: str = "worldcup_classifier",
):
    # Get the input dataframe (Use DataItem.as_df() to access any data source)
    df = dataset.as_df()

    # Initialize the x & y data
    X = df.drop(label_column, axis=1)
    y = df[label_column]

    # Train/Test split the dataset
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Pick an ideal ML model
    model = ensemble.GradientBoostingClassifier(
        n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth
    )

    # -------------------- The only line you need to add for MLOps -------------------------
    # Wraps the model with MLOps (test set is provided for analysis & accuracy measurements)
    apply_mlrun(model=model, model_name=model_name, x_test=X_test, y_test=y_test)
    # --------------------------------------------------------------------------------------

    # Train the model
    model.fit(X_train, y_train)


Overwriting trainer.py


**Create a serverless function object from the code above, and register it in the project:**

In [3]:
trainer = project.set_function("trainer.py", name="trainer", kind="job", image="mlrun/mlrun", handler="train")


<a id="run-function"></a>
## Run the training function and log the artifacts and model

**Create a dataset for training:**

In [4]:
import pandas as pd
import mlrun
    

def preprocess(data):
    teams = set(list(data['Home Team Name'].unique()) + (list(data['Away Team Name'].unique())))
    stages = list(data['Stage'].unique())
    victories, finals, goals, victories_tur, victories_tur, goals_tur = create_zero_dict(teams, 6)
    year = 1930
    
    for i, row in data.iterrows():
        if row['Stage'] == 'Final' or row['Stage'] == 'finals':
            finals[row['Home Team Name']] += 1
            finals[row['Away Team Name']] += 1
        if year != row['Year']:
            victories_tur, goals_tur = create_zero_dict(teams, 2)
        data = insert_to_data(data, i, row, ['finals', 'victory current tournament', 'victory', 'goals', 'goals current tournament'], 
                              [finals, victories_tur, victories, goals, goals_tur])

        goals, goals_tur = update_goals(data, row, [goals, goals_tur])
        if row['Win'] == 1:
            victories[row['Home Team Name']] += 1
            victories_tur[row['Home Team Name']] += 1
        elif row['Win'] == 2:
            victories[row['Away Team Name']] += 1
            victories_tur[row['Away Team Name']] += 1
        
    data = data[data.Year > 1950]
    data = data[data.Win > 0]
    return data

def create_zero_dict(teams, num_of_dict):
    
    return [dict(zip(teams, [0]*len(teams))) for i in range(num_of_dict)]

def insert_to_data(data, index, row, fields, dictionaries):
    for field, dictionary in zip(fields, dictionaries):
        data.at[index, f'Home {field}'] = dictionary[row['Home Team Name']]
        data.at[index, f'Away {field}'] = dictionary[row['Away Team Name']]
    return data
    
def update_goals(data, row, goals_dictionaries):
    for dictionary in goals_dictionaries:
        dictionary[row['Home Team Name']] += row['Home Team Goals']
        dictionary[row['Away Team Name']] += row['Away Team Goals']
    return goals_dictionaries



data = pd.read_csv('./WorldCupMatches.csv', encoding='UTF-8')
data.dropna(inplace=True)
data = preprocess(data)
teams = set(list(data['Home Team Name'].unique()) + (list(data['Away Team Name'].unique())))
stages = list(data['Stage'].unique())
data['Home Team Name'] = pd.Categorical(data['Home Team Name'], categories=list(teams))
data['Away Team Name'] = pd.Categorical(data['Away Team Name'], categories=list(teams))
data['Stage'] = pd.Categorical(data['Stage'], categories=list(stages))
home_team = pd.get_dummies(data['Away Team Name'], prefix='Away Team Name')
away_team = pd.get_dummies(data['Home Team Name'], prefix='Home Team Name')
stage = pd.get_dummies(data['Stage'], prefix='Stage')

data.drop(columns=['Home Team Goals', 'Away Team Goals', 'Away Team Name', 'Home Team Name', 'Stage', 'Attendance'], inplace=True)
data = pd.concat([data, stage, home_team, away_team], axis=1)
data.dropna(inplace=True)

data.to_csv('./worldcup-dataset.csv')

**Run the function (locally) using the generated dataset:**

In [5]:
trainer_run = project.run_function(
    "trainer", 
    inputs={"dataset": "worldcup-dataset.csv"}, 
    params = {"n_estimators": 100, "learning_rate": 1e-1, "max_depth": 3},
    local=True
)

> 2022-09-01 07:35:37,938 [info] starting run trainer-train uid=40d0e2aa10df43fe9ab1ce51e45dbe5e DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tutorial-davesh,...e45dbe5e,0,Sep 01 07:35:38,completed,trainer-train,v3io_user=Daveshkind=owner=Daveshhost=jupyter-davids-55f4d7f589-q7jwr,dataset,n_estimators=100learning_rate=0.1max_depth=3,accuracy=0.773109243697479f1_score=0.8556149732620321precision_score=0.8333333333333334recall_score=0.8791208791208791,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodel





> 2022-09-01 07:35:43,505 [info] run executed, status=completed


<br>

**View the auto generated results and artifacts:**

In [6]:
trainer_run.outputs

{'accuracy': 0.773109243697479,
 'f1_score': 0.8556149732620321,
 'precision_score': 0.8333333333333334,
 'recall_score': 0.8791208791208791,
 'feature-importance': 'v3io:///projects/tutorial-davesh/artifacts/trainer-train/0/feature-importance.html',
 'test_set': 'store://artifacts/tutorial-davesh/trainer-train_test_set:40d0e2aa10df43fe9ab1ce51e45dbe5e',
 'confusion-matrix': 'v3io:///projects/tutorial-davesh/artifacts/trainer-train/0/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/tutorial-davesh/artifacts/trainer-train/0/roc-curves.html',
 'calibration-curve': 'v3io:///projects/tutorial-davesh/artifacts/trainer-train/0/calibration-curve.html',
 'model': 'store://artifacts/tutorial-davesh/worldcup_classifier:40d0e2aa10df43fe9ab1ce51e45dbe5e'}

In [None]:
trainer_run.artifact('feature-importance').show()

**Export model files + metadata into a zip:** (require MLRun 1.1.0 and above)

You can `export()` the model package (files + metadata) into a zip, and load it on a remote system/cluster (by simply running `model = project.import_artifact(key, path)`). 

In [7]:
trainer_run.artifact('model').meta.export("model.zip")

<a id="hyper-param"></a>
## Hyper-parameter tuning and model/experiment comparison

Run a `GridSearch` with a couple of parameters, and select the best run with respect to the `max accuracy`. <br>
(Read more about MLRun [Hyper-Param and Iterative jobs](https://docs.mlrun.org/en/stable/hyper-params.html).)

For basic usage you can run the hyperparameters tuning job by using the arguments: 
* `hyperparams` for the hyperparameters options and values of choice.
* `selector` for specifying how to select the best model.

**Running a remote function:**

In order to run the hyper-param task over the cluster you need the input data to be available for the job, using object storage or the mlrun versioned artifact store.

The following line logs (and uploads) the dataframe as a project artifact:

In [8]:
dataset_artifact = project.log_dataset("worldcup-dataset", df=data, index=False)

Run the function over the remote Kubernetes cluster (`local` is not set):

In [9]:
hp_tuning_run = project.run_function(
    "trainer", 
    inputs={"dataset": dataset_artifact.uri}, 
    hyperparams={
        "n_estimators": [10, 100, 1000], 
        "learning_rate": [1e-1, 1e-3], 
        "max_depth": [2, 8]
    }, 
    selector="max.f1_score", 
)

> 2022-09-01 07:35:44,267 [info] starting run trainer-train uid=b814387bc6b34cb2abbd9063080c7f4d DB=http://mlrun-api:8080
> 2022-09-01 07:35:44,516 [info] Job is running in the background, pod: trainer-train-r67jq
> 2022-09-01 07:36:38,327 [info] best iteration=4, used criteria max.f1_score
> 2022-09-01 07:36:39,085 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tutorial-davesh,...080c7f4d,0,Sep 01 07:35:48,completed,trainer-train,v3io_user=Daveshkind=jobowner=Daveshmlrun/client_version=0.0.0+unstable,dataset,,best_iteration=4accuracy=0.7647058823529411f1_score=0.8666666666666666precision_score=0.7647058823529411recall_score=1.0,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodeliteration_resultsparallel_coordinates





> 2022-09-01 07:36:40,709 [info] run executed, status=completed


<br>

**View Hyper-param results and the selected run in the MLRun UI:**

![hprun](../_static/images/tutorial/hprun.png)

Interactive Parallel Coordinates Plot:

![pcp](../_static/images/tutorial/pcp.png)

<br>

**List the generated models and compare the different runs:**

In [10]:
hp_tuning_run.outputs

{'best_iteration': 4,
 'accuracy': 0.7647058823529411,
 'f1_score': 0.8666666666666666,
 'precision_score': 0.7647058823529411,
 'recall_score': 1.0,
 'feature-importance': 'v3io:///projects/tutorial-davesh/artifacts/4/feature-importance.html',
 'test_set': 'store://artifacts/tutorial-davesh/trainer-train_test_set:b814387bc6b34cb2abbd9063080c7f4d',
 'confusion-matrix': 'v3io:///projects/tutorial-davesh/artifacts/4/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/tutorial-davesh/artifacts/4/roc-curves.html',
 'calibration-curve': 'v3io:///projects/tutorial-davesh/artifacts/4/calibration-curve.html',
 'model': 'store://artifacts/tutorial-davesh/worldcup_classifier:b814387bc6b34cb2abbd9063080c7f4d',
 'iteration_results': 'v3io:///projects/tutorial-davesh/artifacts/iteration_results.csv',
 'parallel_coordinates': 'v3io:///projects/tutorial-davesh/artifacts/parallel_coordinates.html'}

In [11]:
# list the models in the project (can apply filters)
models = project.list_models()
for model in models:
    print(f"uri: {model.uri}, metrics: {model.metrics}")

uri: store://models/tutorial-davesh/worldcup_classifier#0:40d0e2aa10df43fe9ab1ce51e45dbe5e, metrics: {'accuracy': 0.773109243697479, 'f1_score': 0.8556149732620321, 'precision_score': 0.8333333333333334, 'recall_score': 0.8791208791208791}
uri: store://models/tutorial-davesh/worldcup_classifier#1:b814387bc6b34cb2abbd9063080c7f4d, metrics: {'accuracy': 0.7478991596638656, 'f1_score': 0.8484848484848485, 'precision_score': 0.7850467289719626, 'recall_score': 0.9230769230769231}
uri: store://models/tutorial-davesh/worldcup_classifier#2:b814387bc6b34cb2abbd9063080c7f4d, metrics: {'accuracy': 0.773109243697479, 'f1_score': 0.8571428571428572, 'precision_score': 0.826530612244898, 'recall_score': 0.8901098901098901}
uri: store://models/tutorial-davesh/worldcup_classifier#3:b814387bc6b34cb2abbd9063080c7f4d, metrics: {'accuracy': 0.7394957983193278, 'f1_score': 0.8324324324324325, 'precision_score': 0.8191489361702128, 'recall_score': 0.8461538461538461}
uri: store://models/tutorial-davesh/wor

In [12]:
# to view the full model object use:
# print(models[0].to_yaml())

In [None]:
# compare the runs (generate interactive parallel coordinates plot and a table)
project.list_runs(name="trainer-train", iter=True).compare()

<a id="model-serving"></a>
## Build and test the model serving functions

MLRun serving can produce managed, real-time, serverless, pipelines composed of various data processing and ML tasks. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. For more details and examples, see the [MLRun Serving Graphs](https://docs.mlrun.org/en/stable/serving/serving-graph.html).

**Create a model serving function from our [code](serving.py)  [(view)](serving.html)**

In [None]:
serving_fn = mlrun.new_function("serving", image="mlrun/mlrun", kind="serving")
serving_fn.add_model('worldcup-classifier',model_path=hp_tuning_run.outputs["model"], class_name='mlrun.frameworks.sklearn.SklearnModelServer')

In [None]:
# create a mock (simulator of the real-time function)
server = serving_fn.to_mock_server()

my_data = {"inputs"
           :[[
              1
            ]*187]
}
server.test("/v2/models/worldcup-classifier/infer", body=my_data)

## Done!

Congratulation! You've completed Part 2 of the MLRun getting-started tutorial.
Proceed to [**Part 3: Model serving**](03-model-serving.html) to learn how to deploy and serve your model using a serverless function.