# Workshop 03 - MLflow

#### Table of Contents
- [Introduction](#introduction)
- [Machine Learning Lifecycle](#machine-learning-lifecycle)
- [Provided Files](#provided-files)
- [Using MLflow](#using-mlflow)

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment and a central model registry, and it currently offers four components:
<div style="width: 100%;"><img src="img/mlflow_components.jpg"/></div>

- **MLflow Tracking --** keeps track of runs by saving metrics, parameters, tags and artifacts. It allows us to visualize and compare them in a browser in a user-friendly manner. Also, it creates different files with the description of the environment in which the run was executed (MLmodel, conda.yaml, model code).
- **MLflow Project --** is a format for packaging data science code in a reusable and reproducible way. It uses artifacts recorded at the tracking step.
- **MLflow Model --** is a standard format for packaging the models. The format defines a convention that lets you save a model in different flavors (e.g. Python function, R function, Scikit-learn, TensorFlow, Spark MLlib…) that can be understood by different downstream tools.
- **MLflow Registry --** is a centralized model store. It provides model lineage (which run produced the model), model versioning, stage transitions (for example from staging to production) and annotations.

## Machine Learning Lifecycle
Before starting to work with MLflow and PyTorch, we will first discuss a high level overview of the machine learning lifecycle.
<div style="width: 100%;"><img src="img/ml_lifecycle.jpg"/></div>

- **Business Problem --** This is the first step of the machine learning workflow. It may take from few days to a few weeks to complete depending on the use case and complexity of the problem. It is at this stage that data scientists meet with subject-matter experts (SME’s) to gain an understanding of the problem, interview key stakeholders, collect information, and set the overall expectations of the project.
- **Data Sourcing & ETL --** Once a detailed understanding of the business problem is achieved, it then comes to using the information gained during interviews to source the correct data for training your model(s), generally from an enterprise database.
- **Exploratory Data Analysis (EDA) --** EDA is where you analyze the raw data. Your goal is to explore and assess the quality of the data, find missing values, feature distribution, correlation, etc.
- **Data Preparation --** Now it is time to prepare the data for model training. This includes things like dividing the data into training and testing sets, feature encoding (e.g., one-hot-encoding, target encoding), feature engineering and selection, etc.
- **Model Training & Selection --** This is the step everyone is excited about. This involves training a variety of models, tuning hyperparameters, model ensembling, evaluating performance metrics, model analysis (e.g., AUC, confusion matrix, residuals), and finally selecting one best model to be deployed in production for business use.
- **Deployment & Monitoring --** This is the final step which is mostly concerned with MLOps. This includes things like packaging your final model, creating a docker image, writing a scoring script, and then making it all work together. Finally, you publish it as an API that can be used to obtain predictions on new data coming through your pipeline.

## Provided Files

`MLproject`

In [None]:
name: MNIST Classifier

entry_points:
  main:
    parameters:
      batch_size_train: {type: int, default: 64}
      batch_size_test: {type: int, default: 10000}
      n_epochs: {type: int, default: 3}
      learning_rate: {type: float, default: 0.01}
      momentum: {type: float, default: 0.5}
      log_interval: {type: int, default: 10}
      random_seed: {type: int, default: 1}
    command: 'python main.py --batch_size_train={batch_size_train} --batch_size_test={batch_size_test}
                                      --n_epochs={n_epochs} --learning_rate={learning_rate} --momentum={momentum}
                                      --log_interval={log_interval} --random_seed={random_seed}'

`main.py`

In [None]:
import tempfile

import torch
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import mlflow
import click

from net import Net


def load_data(batch_size_train, batch_size_test):
    tmpdir = tempfile.mkdtemp()

    train_loader = torch.utils.data.DataLoader(
        torchvision.datasets.MNIST(
            tmpdir, train=True, download=True,
            transform=torchvision.transforms.Compose([
                torchvision.transforms.ToTensor(),
                torchvision.transforms.Normalize(
                    (0.1307,), (0.3081,))
            ])
        ),
        batch_size=batch_size_train, shuffle=True
    )

    test_loader = torch.utils.data.DataLoader(
        torchvision.datasets.MNIST(
            tmpdir, train=False, download=True,
            transform=torchvision.transforms.Compose([
                torchvision.transforms.ToTensor(),
                torchvision.transforms.Normalize(
                    (0.1307,), (0.3081,))
            ])
        ),
        batch_size=batch_size_test, shuffle=True
    )

    mlflow.log_artifacts(f'{tmpdir}/MNIST', 'mnist_data')
    return train_loader, test_loader


def train(network, optimizer, train_loader, epoch, log_interval):
    tmpdir = tempfile.mkdtemp()

    network.train()
    for batch, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = network(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        if batch % log_interval == 0:
            mlflow.log_metric(f'Training Loss - Epoch {epoch}', loss.item(), batch * len(data))

            n_complete = batch * len(data)
            n_total = len(train_loader.dataset)
            pct_complete = 100. * batch / len(train_loader)
            print(f'Train Epoch: {epoch} [{n_complete}/{n_total} ({pct_complete:.0f}%)]\tLoss: {loss.item():.6f}')

            torch.save(network.state_dict(), f'{tmpdir}/model.pth')
            torch.save(optimizer.state_dict(), f'{tmpdir}/optimizer.pth')

    mlflow.log_artifacts(tmpdir, 'model_state')


def test(network, test_loader, epoch):
    network.eval()
    test_loss = 0
    correct = 0

    with torch.no_grad():
        for data, target in test_loader:
            output = network(data)
            test_loss += F.nll_loss(output, target, size_average=False).item()
            pred = output.data.max(1, keepdim=True)[1]
            correct += pred.eq(target.data.view_as(pred)).sum()

    n_total = len(test_loader.dataset)
    test_loss /= n_total
    pct_correct = 100. * correct / n_total

    mlflow.log_metric('Test Accuracy', pct_correct.item(), epoch)
    mlflow.log_metric('Average Test Loss', test_loss, epoch)
    print(f'\nTest set: Avg. loss: {test_loss:.4f}, Accuracy: {correct}/{n_total} ({pct_correct:.0f}%)\n')


@click.command('Trains a neural network to classify images from the MNIST data set')
@click.option('--batch_size_train', default=64)
@click.option('--batch_size_test', default=10000)
@click.option('--n_epochs', default=3)
@click.option('--learning_rate', default=0.01)
@click.option('--momentum', default=0.5)
@click.option('--log_interval', default=10)
@click.option('--random_seed', default=1)
def train_network(batch_size_train, batch_size_test, n_epochs, learning_rate, momentum, log_interval, random_seed):
    with mlflow.start_run():
        train_loader, test_loader = load_data(batch_size_train, batch_size_test)

        torch.backends.cudnn.enabled = False
        torch.manual_seed(random_seed)

        network = Net()
        optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)

        test(network, test_loader, 0)
        for epoch in range(1, n_epochs + 1):
            train(network, optimizer, train_loader, epoch, log_interval)
            test(network, test_loader, epoch)


if __name__ == '__main__':
    train_network()

`net.py`

In [None]:
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    # noinspection PyTypeChecker
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

## Using MLflow

Now we can finally begin working with MLflow. You first need to create an experiment which will be used to organize the runs of your project. Navigate to the `03 - MLflow` directory in your terminal, and run the following to create a new experiment.

In [None]:
mlflow experiments create --experiment-name mnist

Next, you can initialize a local server to host an interactive, web-based UI for monitoring and viewing results from your experiment runs.

In [None]:
mlflow ui

Then, you can go ahead and run your experiment! The command below will do so using default values for all parameters defined in the `MLproject` file.

In [None]:
mlflow run . --no-conda

If you want to run your experiment with non-default parameters values, you can include the `-P` flag and parameter name and value pairs. Try re-running your experiment, changing the values of one or more parameters defined in the `MLproject` file.

In [None]:
mlflow run . -P <PARAMETER_NAME>=<PARAMETER_VALUE> --no-conda