# Using MLRun's Builtin Development Functions

If you have the data, we got you covered in code. MLRun's [Functions Marketplace](https://www.mlrun.org/marketplace)
has many training functions that utilize MLRun's quality of life features. We will use one of the main training
functions in MLRun's Functions Marketplace named `auto-trainer` as an example.

[**Auto Trainer**](https://www.mlrun.org/marketplace/functions/master/auto_trainer/latest/documentation/) is a highly
customizable MLRun training function with multiple parameters to experiment and achieve the best and most agile
solution to your needs.

Remember, there are more functions in the marketplace, you may check the **Model Training** category and see for
yourself:

<img src="./model_training_category_checked.png" alt="Model Training category checked"/>

If you wish to use your own training function, you can click here to see
[how to apply MLRun to an existing code]().

We will cover here the 3 most common handlers of any training function: `train`, `evaluate` and `predict`, using the
Auto Trainer as our example.

## Training

The main and default handler of any training function is called `"train"`. In the Auto Trainer this handler will perform
an ML training function using SciKit-Learn's API, meaning the function follows the structure bellow:

1. **Get the data**: Get the dataset passed to a local path.
2. **Split the data into datasets**: Split the given data into a training set and a testing set.
3. **Get the model**: Initialize a model instance out of a given class or load a provided model. The supported classes
  are anything based on `sklearn.Estimator`, `xgboost.XGBModel`, `lightgbm.LGBMModel`, including custom code as well.
4. **Train**: Call the model's `fit` method to train it on the training set.
5. **Test**: Test the model on the testing set.
6. **Log**: Calculate the metrics and produce the artifacts to log the results and plots.

MLRun is orchestrating on all the steps above. The training is done with our shortcut function `apply_mlrun` that
enable the automatic logging and further features.

### Parameters

* `context` - MLRun context.

**Model Parameters**

*Parameters to initialize a new model object or load a logged one for retraining.*

* `model_class`: `str` - The class of the model to initialize. Can be a module path like
  `"sklearn.linear_model.LogisticRegression"` or a custom model passed through the custom objects parameters below.
  Only one of `model_class` and `model_path` can be given.
* `model_path`: `str` - A `ModelArtifact` URI to load and retrain. Only one of `model_class` and `model_path` can be
  given.
* `model_kwargs`: `dict` - Additional parameters to pass onto the initialization of the model object (the model's class
  `__init__` method).

**Data parameters**

*Parameters to get a dataset and prepare it for training, splitting into training and testing if required.*

* `dataset`: `Union[str, list, dict]` - The dataset to train the model on.
  * Can be passed as part of `inputs` to be parsed as `mlrun.DataItem`, meaning it supports either a URI or a
    FeatureVector.
  * Can be passed as part of `params`, meaning it can be a `list` or a `dict`.
* `drop_columns`: `Union[str, int, List[str], List[int]]` - columns to drop from the dataset. Can be passed as strings
  representing the column names or integers representing the column numbers.
* `test_set`: `Union[str, list, dict]` - The test set to test the model with post training. Notice only one of
  `test_set` or `train_test_split_size` is expected.
  * Can be passed as part of `inputs` to be parsed as `mlrun.DataItem`, meaning it supports either a URI or a
    FeatureVector.
  * Can be passed as part of `params`, meaning it can be a `list` or a `dict`.
* `train_test_split_size`: `float` = `0.2` - The proportion of the dataset to include in the test split. The size of the
  Training set is set to the complement of this value. Must be between 0.0 and 1.0. Defaulted to 0.2
* `label_columns`: `Union[str, int, List[str], List[int]]` - The target label(s) of the column(s) in the dataset. Can
  be passed as strings representing the column names or integers representing the column numbers.
* `random_state`: `int` - Random state (seed) for `train_test_split`.

**Train parameters**

*Parameters to pass to the `fit` method of the model object.*

* `train_kwargs`: `dict` - Additional parameters to pass onto the `fit` method.

**Logging parameters**

*Parameters to control the automatic logging feature of MLRun. You may adjust the logging outputs to your desire and if
not passed, a default list of artifacts and metrics will be produced and calculated.*

* `model_name`: `str` = `"model`" - The model’s name to use for storing the model artifact, default to ‘model’.
* `tag`: `str` - The model’s tag to log with.
* `sample_set`: `Union[str, list, dict]` - A sample set of inputs for the model for logging its stats along the model in
  favour of model monitoring. If not given, the training set will be used instead.
  * Can be passed as part of `inputs` to be parsed as `mlrun.DataItem`, meaning it supports either a URI or a
    FeatureVector.
  * Can be passed as part of `params`, meaning it can be a `list` or a `dict`.
* `_artifacts`: `Dict[str, Union[list, dict]]` - Additional artifacts to produce post training. See the
  `ArtifactsLibrary` of the desired framework to see the available list of artifacts.
* `_metrics`: `Union[List[str], Dict[str, Union[list, dict]]]` - Additional metrics to calculate post training. See how
  to pass metrics and custom metrics in the `MetricsLibrary` of the desired framework.
* `apply_mlrun_kwargs`: `dict` - Framework specific `apply_mlrun` key word arguments. Refer to the framework of choice
  to know more ([SciKit-Learn](), [XGBoost]() or [LightGBM]())

**Custom objects parameters**

*Parameters to include custom objects like custom model class, metric code and artifact plan. Keep in mind that the
model artifact created will be logged with the custom objects, so if `model_path` is used, the custom objects used to
train it are not required for loading it, it will happen automatically.*

* `custom_objects_map`: `Union[str, Dict[str, Union[str, List[str]]]]` - A map of all the custom objects required for
  loading, training and testing the model. Can be passed as a dictionary or a json file path. Each key is a path to a
  python file and its value is the custom object name to import from it. If multiple objects needed to be imported from
  the same py file a list can be given. For example:
  ```python
  {
      "/.../custom_model.py": "MyModel",
      "/.../custom_objects.py": ["object1", "object2"]
  }
  ```
  All the paths will be accessed from the given 'custom_objects_directory', meaning each py file will be read from
  'custom_objects_directory/<MAP VALUE>'. If the model path given is of a store object, the custom objects map will be
  read from the logged custom object map artifact of the model. **Notice**: The custom objects will be imported in the
  order they came in this dictionary (or json). If a custom object is depended on another, make sure to
  put it below the one it relies on.
* `custom_objects_directory`: Path to the directory with all the python files required for the custom objects. Can be
  passed as a zip file as well (will be extracted during the start of the run).

> Notice: The parameters for additional arguments `model_kwargs`, `train_kwargs` and `apply_mlrun_kwargs` can be
  also passed in the global `kwargs` with the matching prefixes: `"MODEL_"`, `"TRAIN_"`, `"MLRUN_"`.

### Outputs

* **Trained model** - The trained model will be logged as a `ModelArtifact` with all the following artifacts registered
  to it.
* **Test dataset** - The test set used to test the model post training will be logged as a `DatasetArtifact`.
* **Plots** - Informative plots regarding the model like confusion matrix and features importance are drawn and logged
  as `PlotArtifact`s.
* **Results** - List of all the calculations of metrics tested on the testing set.

### Example

First, we will import the Auto Trainer from the Functions Marketplace using MLRun's `import_function` function:

In [None]:
import mlrun

auto_trainer = mlrun.import_function("hub://auto_trainer")

Assuming we have a dataset (in the example below we provide a URL to a csv file), all that is needed to be done now is
to run the Auto Trainer using the training handler passing our desired parameters:

In [None]:
dataset_url = "https://s3.wasabisys.com/iguazio/data/function-marketplace-data/xgb_trainer/classifier-data.csv"

training_run = auto_trainer.run(
    handler="train",
    inputs={"dataset": dataset_url},
    params={
        # Model parameters:
        "model_class": "sklearn.ensemble.RandomForestClassifier",
        "model_kwargs": {"max_depth": 8},  # Could be also passed as "MODEL_max_depth": 8
        # Dataset parameters:
        "train_test_split_size": 0.3,
        "random_state": 7,
    }
)

We can review the function's outputs and view its artifacts:

In [None]:
training_run.outputs

In [None]:
training_run.artifact('confusion-matrix').show()

## Evaluating

The `"evaluate"` handler is used to test the model on a given testing set and log its results. This is a common phase in
every model life cycle and should be done periodically on updated testing sets to know your model is still in business.
The function is using SciKit-Learn's API for evaluation, meaning the function follows the structure bellow:

1. **Get the data**: Get the testing dataset passed to a local path.
2. **Get the model**: Get the model object out of the `ModelArtifact` URI.
3. **Predict**: Call the model's `predict` (and `predict_proba` if needed) method to test it on the testing set.
4. **Log**: Test the model on the testing set and log the results and artifacts.

MLRun is orchestrating on all the steps above. The evaluation is done with our shortcut function `apply_mlrun` that
enable the automatic logging and further features.

### Parameters

* `context` - MLRun context.

**Model Parameters**

*Parameters to load a logged model.*

* `model_path`: `str` - A `ModelArtifact` URI to load.

**Data parameters**

*Parameters to get a dataset and prepare it for training, splitting into training and testing if required.*

* `dataset`: `Union[str, list, dict]` - The dataset to train the model on.
  * Can be passed as part of `inputs` to be parsed as `mlrun.DataItem`, meaning it supports either a URI or a
    FeatureVector.
  * Can be passed as part of `params`, meaning it can be a `list` or a `dict`.
* `drop_columns`: `Union[str, int, List[str], List[int]]` - columns to drop from the dataset. Can be passed as strings
  representing the column names or integers representing the column numbers.
* `label_columns`: `Union[str, int, List[str], List[int]]` - The target label(s) of the column(s) in the dataset. Can
  be passed as strings representing the column names or integers representing the column numbers.

**Predict parameters**

*Parameters to pass to the `predict` method of the model object.*

* `predict_kwargs`: `dict` - Additional parameters to pass onto the `predict` method.

**Logging parameters**

*Parameters to control the automatic logging feature of MLRun. You may adjust the logging outputs to your desire and if
not passed, a default list of artifacts and metrics will be produced and calculated.*

* `_artifacts`: `Dict[str, Union[list, dict]]` - Additional artifacts to produce post training. See the
  `ArtifactsLibrary` of the desired framework to see the available list of artifacts.
* `_metrics`: `Union[List[str], Dict[str, Union[list, dict]]]` - Additional metrics to calculate post training. See how
  to pass metrics and custom metrics in the `MetricsLibrary` of the desired framework.
* `apply_mlrun_kwargs`: `dict` - Framework specific `apply_mlrun` key word arguments. Refer to the framework of choice
  to know more ([SciKit-Learn](), [XGBoost]() or [LightGBM]())

**Custom objects parameters**

*Parameters to include custom objects for the evaluation like custom metric code and artifact plans. Keep in mind that
the custom objects used to train the model are not required for loading it, it will happen automatically.*

* `custom_objects_map`: `Union[str, Dict[str, Union[str, List[str]]]]` - A map of all the custom objects required for
  loading, training and testing the model. Can be passed as a dictionary or a json file path. Each key is a path to a
  python file and its value is the custom object name to import from it. If multiple objects needed to be imported from
  the same py file a list can be given. For example:
  ```python
  {
      "/.../custom_metric.py": "MyMetric",
      "/.../custom_plans.py": ["plan1", "plan2"]
  }
  ```
  All the paths will be accessed from the given 'custom_objects_directory', meaning each py file will be read from
  'custom_objects_directory/<MAP VALUE>'. If the model path given is of a store object, the custom objects map will be
  read from the logged custom object map artifact of the model. **Notice**: The custom objects will be imported in the
  order they came in this dictionary (or json). If a custom object is depended on another, make sure to
  put it below the one it relies on.
* `custom_objects_directory`: Path to the directory with all the python files required for the custom objects. Can be
  passed as a zip file as well (will be extracted during the start of the run).

> Notice: The parameters for additional arguments `predict_kwargs` and `apply_mlrun_kwargs` can be also passed in the
global `kwargs` with the matching prefixes: `"PREDICT_"`, `"MLRUN_"`.

### Outputs

* **Evaluated model** - The evaluated model's `ModelArtifact`  is updated with all the following artifacts registered
  to it.
* **Test dataset** - The test set used to test the model post training will be logged as a `DatasetArtifact`.
* **Plots** - Informative plots regarding the model like confusion matrix and features importance are drawn and logged
  as `PlotArtifact`s.
* **Results** - List of all the calculations of metrics tested on the testing set.

### Example

We will evaluate the model trained in the previous example using the `training_run` object. We will use the same
function and for convenience, the same dataset:

In [None]:
evaluation_run = auto_trainer.run(
    handler="evaluate",
    inputs={"dataset": dataset_url},
    params={
        "model": training_run.outputs["model"],  # Take the model from the previous training run.
    },
)

Reviewing the evaluation outputs we can see the evaluation results and artifacts:

In [None]:
evaluation_run.outputs

## Predicting

The `"predict"` handler is used to run a manual prediction using the model provided (can be sometimes referred to as
"batch_predict"). Manually calling predict is usually used to test the model on specific samples or for manually
serving the model, performing prediction when enough requests were collected. The function is simple and straight
forward:

1. **Get the model**: Get the model object out of the `ModelArtifact` URI.
2. **Predict**: Call the model's `predict` (and `predict_proba` if needed) method and return its raw prediction as a
  logged dataset.

Getting the model is done with our shortcut function `apply_mlrun`.

### Parameters

* `context` - MLRun context.

**Model Parameters**

*Parameters to load a logged model.*

* `model_path`: `str` - A `ModelArtifact` URI to load.

**Data parameters**

*Parameters to get a dataset and prepare it for training, splitting into training and testing if required.*

* `dataset`: `Union[str, list, dict]` - The dataset to train the model on.
  * Can be passed as part of `inputs` to be parsed as `mlrun.DataItem`, meaning it supports either a URI or a
    FeatureVector.
  * Can be passed as part of `params`, meaning it can be a `list` or a `dict`.
* `drop_columns`: `Union[str, int, List[str], List[int]]` - columns to drop from the dataset. Can be passed as strings
  representing the column names or integers representing the column numbers.
* `label_columns`: `Union[str, int, List[str], List[int]]` - The target label(s) to give the logged prediction. Can
  be passed as strings representing the column names or integers representing the column numbers.

**Predict parameters**

*Parameters to pass to the `predict` method of the model object.*

* `predict_kwargs`: `dict` - Additional parameters to pass onto the `predict` method.

> Notice: The parameters for additional arguments `predict_kwargs` and `apply_mlrun_kwargs` can be also passed in the
global `kwargs` with the matching prefixes: `"PREDICT_"`, `"MLRUN_"`.

### Outputs

The prediction the model yielded is logged as a `DatasetArtifact`.

### Example

We will run a prediction on a single sample using the model from the training run:

In [None]:
sample = [-0.265115184, -1.932260063, 0.303991713, -1.863833476, -1.045634803]

predicting_run = auto_trainer.run(
    handler="predict",
    params={
        "dataset": sample,  # Notice dataset is is `params` and not in `inputs` now as we pass a raw sample.
        "model": training_run.outputs["model"],  # Take the model from the previous training run.
    },
)

We can get the prediction by:

In [None]:
predicting_run.artifact('prediction').show()
