# MLflow quickstart tutorial

MLflow is a tool to track machine learning experiments (and do a lot more things that we won't use):
https://mlflow.org/docs/latest/quickstart.html  

In this tutorial, we'll cover:
1. Basic concepts of experiment tracking
2. Setup processed used for this project
3. Our extension of MLflow for experiments in this project
  - `start_run` wrapper
  - `log_args` helper
4. How to define functions and what to log
5. Sample experiment

### 1. Basic concepts

As described on it's website:
> MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. You can use MLflow Tracking in any environment (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs. Teams can also use it to compare results from different users.

MLflow Tracking is organized around the concept of runs, which are executions of some piece of data science code. Each run records the following information and is a part of some experiment:

> **Code Version**
Git commit hash used to execute the run, if it was executed from an MLflow Project. We don't use MLflow project, but commit hash is automatically generated and logged from our custom wrapper for `start_run`.

> **Start & End Time**
Start and end time of the run.

> **Source**
Name of the file executed to launch the run, or the project name and entry point for the run if the run was executed from an MLflow Project.

> **Parameters**
Key-value input parameters of your choice. Both keys and values are strings.

> **Metrics**
Key-value metrics where the value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow will record and let you visualize the metric’s full history.

> **Artifacts**
Output files in any format. For example, you can record images (for example, PNGs), models (for example, a pickled scikit-learn model), or even data files (for example, a Parquet file) as artifacts.

##### Useful MLflow functions

Managing experiments:

> **mlflow.create_experiment()** creates a new experiment and returns its ID. Runs can be launched under the experiment by passing the experiment ID to mlflow.start_run.

Managing runs:

> **mlflow.start_run()** returns the currently active run (if one exists), or starts a new run and returns a mlflow.ActiveRun object usable as a context manager for the current run. You do not need to call start_run explicitly: calling one of the logging functions with no active run will automatically start a new one.

> **mlflow.end_run()** ends the currently active run, if any, taking an optional run status.

> **mlflow.active_run()** returns a mlflow.entities.Run object corresponding to the currently active run, if any.

Logging experiments' inputs and results:

> **mlflow.log_param()** logs a key-value parameter in the currently active run. The keys and values are both strings.

> **mlflow.log_metric()** logs a key-value metric. The value must always be a number. MLflow will remember the history of values for each metric.

> **mlflow.log_artifact()** logs a local file as an artifact, optionally taking an artifact_path to place it in within the run’s artifact URI. Run artifacts can be organized into directories, so you can place the artifact in a directory this way.

> **mlflow.log_artifacts()** logs all the files in a given directory as artifacts, again taking an optional artifact_path.


### 2. Setup used in our project

MLflow is in requirements-dev.txt: `pip install -r ../requirements-dev.txt`.
Our wrapper can be imported like this:

In [1]:
from experiments import start_run, log_args

To use MLflow and jupyter, we have 2 tmux terminals running on the server constantly:
- `lab` which runs `jupyter lab` on `localhost:8888`
- `mlflow` which runs `mlflow ui` on `localhost:5000`

To use both of these resources, you have to connect to ssh with two tunnels.

MLflow configuration is default:
- runs are saved to `mlruns` folder, it will be ignored by git

### 3. Experiments wrapper
Our wrapper consists of 2 files: 
- `start_run` wrapper
- `log_params` helper

##### `experiments.start_run`
- enforces run being started in a named experiment (by default, MLflow would create an unnamed one)
- logs version information (git commit hash) that can be used to reproduce the project
- provides a cleaner interface than `mlflow.start_run`

##### `experiments.log_params`
- decorator for fucntions for easy logging of function parameters' values to mlflow
- logs all passed params except ones that start with _underscore
- does NOT support logging artifacts - this needs to be done manually

### 4. Example - wrapper to define experiment functions

In [2]:
# all examples are inspired by: 
# https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedShuffleSplit, ParameterGrid
import mlflow

In [3]:
def train_and_score_cv(colsample_bytree, n_estimators, max_depth, random_state):
    sss = StratifiedShuffleSplit(n_splits=4, test_size=0.25, random_state=random_state)
    scores = np.zeros(sss.get_n_splits(X, y))
    for i, (train_index, test_index) in enumerate(sss.split(X, y)):
        # split dataset into train and test
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # build and train
        model = XGBClassifier(
            objective='multi:softmax',
            num_class=14,
            learning_rate=0.03,
            subsample=0.9,
            colsample_bytree=colsample_bytree,
            reg_alpha=0.01,
            reg_lambda=0.01,
            min_child_weight=10,
            n_estimators=n_estimators,
            max_depth=max_depth,
            nthread=1,
        )
        model.fit(X_train, y_train)
        # predict and score
        scores[i] = model.score(X_test, y_test)
    mlflow.log_metric('mean_score', np.mean(scores))
    mlflow.log_metric('std_score', np.std(scores))
    return np.mean(scores)

### 5. Example - running sample experiment

In [4]:
# some sample data for the demonstration
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1], [-1, -1], [-2, -1], [1, 1], [2, 1], [-1, -1], [-2, -1], [1, 1], [2, 1], [-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2])

In [5]:
# our experiment tests model performance with all possible combinations of these parameters
params_to_test = {
    'colsample_bytree': [0.25, 0.5, 0.75],
    'n_estimators': [2,8,32],
    'max_depth': [2,3,4],
}

In [6]:
# before starting runs, we need to create an experiment
experiment_name = "Demo - testing hyperparameters for SVC"
mlflow.create_experiment(name=experiment_name)

1

In [7]:
params_combinations = list(ParameterGrid(params_to_test))  # generate all combinations of parameters
for params in params_combinations:
    with start_run(experiment=experiment_name, source_name="MLflow-Tutorial"):
        # we use our custom wrapper here, not mlflow.start_run, to log commit hash and other important stuff automatically
        mean_score = train_and_score_cv(random_state=2222, **params)
        print(f"Mean score = {mean_score}")

Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5
Mean score = 0.5


Run results are available in MLflow UI, at `localhost:5000`