# 5.0: Prerequisites: Training an ML Model to be used for model serving

This notebook is used to train a model for the model serving to use. Run the whole notebook to produce a trained model

<a id="gs-tutorial-2-prerequisites"></a>

## Prerequisites

The following steps are a continuation of the previous part of this getting-started tutorial and rely on the generated outputs.
Therefore, make sure to first run [part 1](01-mlrun-basics.ipynb) of the tutorial.

<a id="gs-tutorial-2-step-setup"></a>

## Step 1: Setup and Configuration and Prepare Data

<a id="gs-tutorial-2-mlrun-envr-init"></a>

### Initializing Your MLRun Environment

Use the `get_or_create_project` MLRun method to create a new project or fetch it from the DB/repository if it already exists.
Set the `project` and `user_project` parameters to the same values that you used in the call to this method in the [Part 1: MLRun Basics](./01-mlrun-basics.ipynb#gs-tutorial-1-mlrun-envr-init) tutorial notebook.

In [1]:
from os import path
import mlrun

# Set the base project name
project_name_base = 'realtime-pipelines'

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name_base, context="./", user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2022-06-22 14:51:35,078 [info] loaded project realtime-pipelines from MLRun DB
Full project name: realtime-pipelines-xingsheng


<a id="gs-tutorial-2-mark-mlrun-code-start"></a>

In [2]:
import pandas as pd

# Fetch and clean a dataset through ingestion
def prep_data(source_url, label_column):
    df = pd.read_csv(source_url)
    df[label_column] = df[label_column].astype('category').cat.codes    
    return df, df.shape[0]

In [3]:
# mlrun: start-code

In [4]:
import mlrun
def prep_data(context, source_url: mlrun.DataItem, label_column='label'):

    # Convert the DataItem to a pandas DataFrame
    df = source_url.as_df()
    df[label_column] = df[label_column].astype('category').cat.codes    
    
    # Log the DataFrame size after the run
    context.log_result('num_rows', df.shape[0])

    # Store the dataset in your artifacts database
    context.log_dataset('cleaned_data', df=df, index=False, format='csv')

In [5]:
# mlrun: end-code

In [6]:
# Convert the local prep_data function to an MLRun project function
data_prep_func = mlrun.code_to_function(name='prep_data', kind='job', image='mlrun/mlrun')

In [7]:
# Set the source-data URL
source_url = mlrun.get_sample_path("data/iris/iris.data.raw.csv")

In [8]:
# Run the `data_prep_func` MLRun function locally
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler=prep_data,
                                   inputs={'source_url': source_url},
                                   local=True)

> 2022-05-25 20:14:46,032 [info] starting run prep_data uid=dc78f7c70a4e42fc86855f2297955bfb DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
realtime-pipeline-xingsheng,...97955bfb,0,May 25 20:14:46,completed,prep_data,v3io_user=xingshengkind=owner=xingshenghost=jupyter-xingsheng-5595874567-6xplh,source_url,,num_rows=150,cleaned_data





> 2022-05-25 20:14:48,502 [info] run executed, status=completed


<a id="gs-tutorial-2-step-create-a-training-function"></a>

## Step 2: Creating a Training Function

In [9]:
# mlrun: start-code

In [10]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from mlrun.frameworks.sklearn import apply_mlrun
import mlrun

In [11]:
def train_iris(dataset: mlrun.DataItem, label_column: str):
    
    # Initialize our dataframes
    df = dataset.as_df()
    X = df.drop(label_column, axis=1)
    y = df[label_column]

    # Train/Test split Iris data-set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Pick an ideal ML model
    model = ensemble.RandomForestClassifier()
    
    # Wrap our model with Mlrun features, specify the test dataset for analysis and accuracy measurements
    apply_mlrun(model=model, model_name='my_model', x_test=X_test, y_test=y_test)
    
    # Train our model
    model.fit(X_train, y_train)

<a id="gs-tutorial-2-mark-mlrun-code-end"></a>

In [12]:
# mlrun: end-code

### Converting the Code to an MLRun Function

In [13]:
train_iris_func = mlrun.code_to_function(name='train_iris',
                                         handler='train_iris',
                                         kind='job',
                                         image='mlrun/mlrun')

<a id="gs-tutorial-2-persistent-volume-mount"></a>

### Running the Function on a Cluster


In [14]:
# Our dataset location (uri)
dataset = project.get_artifact_uri('prep_data_cleaned_data')
print(dataset)

store://artifacts/realtime-pipeline-xingsheng/prep_data_cleaned_data


In [15]:
#train_iris_func.spec.image_pull_policy = "Always"

In [16]:
train_run = train_iris_func.run(inputs={'dataset': dataset},
                                params={'label_column': 'label'},local=True)

> 2022-05-25 20:15:00,634 [info] starting run train-iris-train_iris uid=678b04e8bb8a4dadb4116498ec0beb92 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
realtime-pipeline-xingsheng,...ec0beb92,0,May 25 20:15:00,completed,train-iris-train_iris,v3io_user=xingshengkind=owner=xingshenghost=jupyter-xingsheng-5595874567-6xplh,dataset,label_column=label,accuracy=1.0f1_score=1.0precision_score=1.0recall_score=1.0auc-micro=1.0auc-macro=1.0auc-weighted=1.0,feature-importancetest_setconfusion-matrixroc-curvesmodel





> 2022-05-25 20:15:02,311 [info] run executed, status=completed


<a id='gs-run-ingest-func'></a>

### Reviewing the Run Output

You can view extensive run information and artifacts from Jupyter Notebook and the MLRun dashboard, as well as browse the project artifacts from the dashboard.

The following code extracts and displays the model from the training-job outputs.

In [17]:
train_run.outputs['model']

'store://artifacts/realtime-pipeline-xingsheng/my_model:678b04e8bb8a4dadb4116498ec0beb92'

<a id="gs-tutorial-2-step-test-model"></a>

## Step 3: Testing Your Model

Now that you have a trained model, you can test it:
run a task that uses the [`test_classifier` marketplace function](https://github.com/mlrun/functions/tree/master/test_classifier) to run the selected trained model against the test data set, as returned for the training task (`train`) in the previous step.

<a id="gs-tutorial-2-add-test-function"></a>

### Adding a Test Function

Run the following code to add to your project a `test` function that uses the `test_classifier` marketplace function code, and create a related `test` function object.

In [18]:
test = mlrun.import_function('hub://test_classifier')

<a id="gs-tutorial-2-run-model-testing-job"></a>

### Running a Model-Testing Job

Configure parameters for the test function (`params`), and provide the selected trained model from the `train_run` job as an input artifact (`inputs`).

In [19]:
test_run = test.run(name="test",
                    params={"label_column": "label"},
                    inputs={"models_path": train_run.outputs['model'],
                            "test_set": train_run.outputs['test_set']})

> 2022-05-25 20:15:02,501 [info] starting run test uid=81f4b669979b4a379128371d2f8b121e DB=http://mlrun-api:8080
> 2022-05-25 20:15:02,687 [info] Job is running in the background, pod: test-q96ww
> 2022-05-25 20:15:08,873 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
realtime-pipeline-xingsheng,...2f8b121e,0,May 25 20:15:06,completed,test,v3io_user=xingshengkind=jobowner=xingshengmlrun/client_version=1.0.0host=test-q96ww,models_pathtest_set,label_column=label,accuracy=1.0test-error=0.0auc-micro=1.0auc-weighted=1.0f1-score=1.0precision_score=1.0recall_score=1.0,confusion-matrixfeature-importancesprecision-recall-multiclassroc-multiclasstest_set_preds





> 2022-05-25 20:15:11,995 [info] run executed, status=completed


<a id="gs-test-model-run-output-review"></a>

<a id="gs-tutorial-2-done"></a>