```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. 

```

# MLflow

## Table of contents

- [Jupyter Lab and Python Environment Setup](#toc00)
- [XXXXAirflow code examples](#toc01)
- [XXXXPython example](#toc02)

Resources:
- https://mlflow.org/
- https://scikit-learn.org/stable/datasets/toy_dataset.html
- https://devopscube.com/run-docker-in-docker/

---
<a id='toc00'></a>

## Jupyter Lab and Python Environment Setup
```
# -----
# In WSL Terminal
IMAGE_NAME='ksatola/ubuntu-python-dev-base' \
CONTAINER_NAME='ubuntu-python-dev-base-mlflow'

# !!! In order to be able to connect to Docker on WSL, the second line (starting with -v) is needed
# https://devopscube.com/run-docker-in-docker/
docker run -d -t -P \
    -v /var/run/docker.sock:/var/run/docker.sock \
    --name $CONTAINER_NAME \
    --mount src='/home/ksatola/git',target='/root/git',type=bind \
    $IMAGE_NAME

# Connect to the container with VSC with Remote Explorer

# -----
# In the container Terminal

# -----
# Install miniconda (can be only done manually) - MLflow needs it
# https://docs.conda.io/en/latest/miniconda.html

# Download the latest shell script
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Make the miniconda installation script executable
chmod +x Miniconda3-latest-Linux-x86_64.sh

# Run miniconda installation script
./Miniconda3-latest-Linux-x86_64.sh

# Reload your shell
exec "$SHELL" # Or just restart your terminal

# Cleanup
rm Miniconda3-latest-Linux-x86_64.sh

# Create and activate an conda environment
#conda create -n newenv


# -----
# MLFlow Installation
pip install mlflow \
    pip install sklearn \
    pip install matplotlib \
    pip install pandas_datareader

# -----
# Install Docker for MLflow experiments
# https://askubuntu.com/questions/1030179/package-docker-ce-has-no-installation-candidate-in-18-04

apt update
apt upgrade

apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg \
    lsb-release \
    software-properties-common

echo \
  "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian \
  $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

apt update
    
#apt install -y software-properties-common
#curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -

add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu `lsb_release -cs` test"
apt update
apt install -y docker-ce
docker -v

# Check if there is connection to the WSL Docker
docker images


# -----
# Docker compose
# Docker Compose is yet another useful Docker tool. It allows users to launch, execute, communicate, 
# and close containers with a single coordinated command. Essentially, Docker Compose is used 
# for defining and running multi-container Docker applications.
# https://phoenixnap.com/kb/install-docker-compose-on-ubuntu-20-04
apt update
apt upgrade

apt install docker-compose
ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
#chmod +x /usr/local/bin/docker-compose

/usr/local/bin/docker-compose --version

#"PATH=$PATH:/home/user/.local/bin" docker-compose
export PATH="/usr/local/bin/:$PATH"

#apt install curl

#curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
#chmod +x /usr/local/bin/docker-compose

docker–compose --version

# -----

pip install jupyterlab

# Run in the container
jupyter lab --no-browser --allow-root

# -----
https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
apt update
apt install git-all

```

---
<a id='toc01'></a>

## MLflow and machine learning engineering practices

Implementing a product based on machine learning can be a laborious task. There is a general need to `reduce the friction between different steps of the machine learning
development life cycle and between the teams of data scientists and engineers` that are involved in the process.

Machine learning practitioners such as data scientists and machine learning engineers operate with different systems, standards, and tools. While data scientists spend most of their time developing models in tools such as Jupyter Notebook, when running in production, the model is deployed in the context of a software application with an environment that's more demanding in terms of scale and reliability.

**MLflow** is an open source platform (created at Databricks) for the machine learning (ML) life cycle, with a focus on `reproducibility`, `training`, and `deployment`. It is based on an open interface design and is able to work with any language or platform, with clients in Python and Java, and is accessible through a REST API. Scalability is also an important benefit that an ML developer can leverage with MLflow.

**MLflow** enables an everyday practitioner in one platform to manage the ML life cycle, from iteration on model development up to deployment in a reliable and scalable environment that is compatible with modern software system requirements.

**MLflow** modules are software components that deliver the core features that aid in the different phases of the ML life cycle. MLflow features are delivered through modules, extensible components that organize related features in the platform. The following are the built-in modules in MLflow:
- **MLflow Tracking:** Provides a mechanism and UI to handle metrics and artifacts generated by ML executions (training and inference).
- **Mlflow Projects:** A package format to standardize ML projects. There are three different environments supported by MLflow projects: the Conda environment, Docker, and the local system.
- **Mlflow Models:** A mechanism that deploys to different types of environments, both on-premises and in the cloud.
- **Mlflow Model Registry:** A module that handles the management of models in MLflow and its life cycle, including state.

---
<a id='toc02'></a>

## Simple MLflow example

In [1]:
# Load dataset

from sklearn import datasets
from sklearn.model_selection import train_test_split

dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.4)

In [2]:
# Train a model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        21
           1       0.82      1.00      0.90        18
           2       1.00      0.81      0.89        21

    accuracy                           0.93        60
   macro avg       0.94      0.94      0.93        60
weighted avg       0.95      0.93      0.93        60



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [3]:
# The same with MLflow
import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Log the experiment in the local directory (see mlruns folder)
mlflow.sklearn.autolog()
with mlflow.start_run():
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    
    y_pred = clf.predict(X_test)

    print(classification_report(y_test, y_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        21
           1       0.82      1.00      0.90        18
           2       1.00      0.81      0.89        21

    accuracy                           0.93        60
   macro avg       0.94      0.94      0.93        60
weighted avg       0.95      0.93      0.93        60



In [4]:
import mlflow
print('MLflow version: {}'.format(mlflow.__version__))

import sklearn
print('sklearn version: {}'.format(sklearn.__version__))

import matplotlib
print('matplotlib version: {}'.format(matplotlib.__version__))

MLflow version: 1.20.2
sklearn version: 1.0
matplotlib version: 3.4.3


---
<a id='toc03'></a>

## MLflow project to train a classifier
The task in this illustrative project is to create a basic MLflow project and produce a working baseline ML model to predict, based on market signals over a certain number of days, whether the stock market will go up or down. 

We will use a Yahoo Finance dataset available for quoting the BTC-USD pair in https://finance.yahoo.com/quote/BTC-USD/ over a period of 3 months. We will train a model to predict whether the quote will be going up or not on a given day. A REST API will be made available for predictions through MLflow.

In [5]:
# Create mlflow_projects folder
# Create project name folder (stockpred) inside the mlflow_projects

In [6]:
import os

_path = f'~/git/MLOps/notebooks/mlflow_projects/'
_path

'~/git/MLOps/notebooks/mlflow_projects/'

In [7]:
# Add MLProject file
_mlproject_file = os.path.join(_path, "stockpred/MLProject")
_mlproject_file

'~/git/MLOps/notebooks/mlflow_projects/stockpred/MLProject'

In [8]:
%%writefile {_mlproject_file}
name: stockpred

#conda_env: conda.yaml
#docker_env:
#  image:  ksatola/miniconda3

entry_points:
  main:
    command: "python train.py"

Overwriting /root/git/MLOps/notebooks/mlflow_projects/stockpred/MLProject


In [9]:
_dockerfile = os.path.join(_path, "stockpred/Dockerfile")
_dockerfile

'~/git/MLOps/notebooks/mlflow_projects/stockpred/Dockerfile'

The Docker image file is based on the open source package **Miniconda**, a free minimal installer with a minimal set of packages for data science that allow us to control the details of the packages that we need in our environment.

In [10]:
%%writefile {_dockerfile}
#FROM continuumio/miniconda:4.5.4
FROM continuumio/miniconda3

RUN pip install mlflow \
    && pip install numpy \
    && pip install scipy \
    && pip install pandas \
    && pip install scikit-learn \
    && pip install cloudpickle \
    && pip install pandas_datareader>=0.8.0

#RUN pip install mlflow==1.2.0 \
#    && pip install numpy==1.14.3 \
#    && pip install scipy \
#    && pip install pandas==0.22.0 \
#    && pip install scikit-learn==0.20.4 \
#    && pip install cloudpickle \
#    && pip install pandas_datareader>=0.8.0


Overwriting /root/git/MLOps/notebooks/mlflow_projects/stockpred/Dockerfile


In [11]:
import warnings

import numpy as np
import datetime
import pandas_datareader.data as web
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

import mlflow.sklearn

### Get data
The format of the data acquired is the classic format for financial securities in exchange APIs. For every day of the period, we retrieve the following data: the highest value of the stock, the lowest, opening, and close values of the stock, as well as the volume. The final column represents the adjusted close value, the value after dividends.

In [12]:
def acquire_training_data():
    start = datetime.datetime(2019, 7, 1)
    end = datetime.datetime(2019, 9, 30)
    df = pdr.DataReader("BTC-USD", 'yahoo', start, end)
    return df

### Make the data usable by scikit-learn
Transform the raw data into a feature vector using the rolling window technique. The feature vector for each day becomes the deltas between the current and previous window days. In this case, we use the previous day's market movement (1 for a stock going up, 0 otherwise).

In [13]:
def digitize(n):
    if n > 0:
        return 1
    return 0

In [14]:
def rolling_window(a, window):
    """
        Takes np.array 'a' and size 'window' as parameters
        Outputs an np.array with all the ordered sequences of values of 'a' of size 'window'
        e.g. Input: ( np.array([1, 2, 3, 4, 5, 6]), 4 )
             Output: 
                     array([[1, 2, 3, 4],
                           [2, 3, 4, 5],
                           [3, 4, 5, 6]])
    """
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

In [15]:
def prepare_training_data(data):
    """
        Return a prepared numpy dataframe
        input : Dataframe with expected schema
    """
    data['Delta'] = data['Close'] - data['Open']
    data['to_predict'] = data['Delta'].apply(lambda d: digitize(d))
    return data

### Train and store your model in **MLflow**
The `mlflow.sklearn.log_model(clf, "model_random_forest")` method takes care of persisting the model upon training. We are explicitly asking **MLflow** to log the model and the metrics that we find relevant. This flexibility in the items to log allows one program to log multiple models into **MLflow**.

In [16]:
# Add train.py file
_tain_file = os.path.join(_path, "stockpred/train.py")
_tain_file

'~/git/MLOps/notebooks/mlflow_projects/stockpred/train.py'

In [20]:
%%writefile {_tain_file}
import warnings

import numpy as np
import datetime
import pandas_datareader.data as web
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

import mlflow.sklearn


def acquire_training_data():
    start = datetime.datetime(2019, 7, 1)
    end = datetime.datetime(2019, 9, 30)
    df = web.DataReader("BTC-USD", 'yahoo', start, end)
    return df

def digitize(n):
    if n > 0:
        return 1
    return 0


def rolling_window(a, window):
    """
        Takes np.array 'a' and size 'window' as parameters
        Outputs an np.array with all the ordered sequences of values of 'a' of size 'window'
        e.g. Input: ( np.array([1, 2, 3, 4, 5, 6]), 4 )
             Output: 
                     array([[1, 2, 3, 4],
                           [2, 3, 4, 5],
                           [3, 4, 5, 6]])
    """
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)


def prepare_training_data(data):
    """
        Return a prepared numpy dataframe
        input : Dataframe with expected schema
    """
    data['Delta'] = data['Close'] - data['Open']
    data['to_predict'] = data['Delta'].apply(lambda d: digitize(d))
    return data

WINDOW_SIZE = 14

if __name__ == "__main__":
    #mlflow.set_tracking_uri("file:////root/git/MLOps/notebooks/mlflow_projects/stockpred")
    #with mlflow.run(uri='/root/git/MLOps/notebooks/mlflow_projects/stockpred/', experiment_name='aaa') as run:
    #with mlflow.start_run(run_name='myrun') as run:
    
    mlflow.set_tracking_uri("http://localhost:5000")
    
    with mlflow.start_run() as run:
        training_data = acquire_training_data()
        prepared_training_data_df = prepare_training_data(training_data)

        btc_mat = prepared_training_data_df.to_numpy()

        X = rolling_window(btc_mat[:, 7], WINDOW_SIZE)[:-1, :]
        Y = prepared_training_data_df['to_predict'].to_numpy()[WINDOW_SIZE:]

        X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=4284, stratify=Y)

        params = {
            "bootstrap": "True",
            "criterion": "gini",
            "min_samples_split": 2,
            "min_weight_fraction_leaf": 0.0,
            "n_estimators": 50,
            "random_state": 4284,
            "verbose": 0,
        }

        clf = RandomForestClassifier(**params)
        clf.fit(X_train, y_train)
        predicted = clf.predict(X_test)

        mlflow.log_params(params)
        #mlflow.sklearn.log_model(clf, artifact_path="sklearn-model")
        #model_uri = "runs:/{}/sklearn-model".format(run.info.run_id)

        #mv = mlflow.register_model(model_uri, "RandomForestRegressionModel")
        #print("Name: {}".format(mv.name))
        #print("Version: {}".format(mv.version))

        print(classification_report(y_test, predicted))

        mlflow.sklearn.log_model(clf, "model_random_forest")
        mlflow.log_metric("precision_label_0", precision_score(y_test, predicted, pos_label=0))
        mlflow.log_metric("recall_label_0", recall_score(y_test, predicted, pos_label=0))
        mlflow.log_metric("f1score_label_0", f1_score(y_test, predicted, pos_label=0))
        mlflow.log_metric("precision_label_1", precision_score(y_test, predicted, pos_label=1))
        mlflow.log_metric("recall_label_1", recall_score(y_test, predicted, pos_label=1))
        mlflow.log_metric("f1score_label_1", f1_score(y_test, predicted, pos_label=1))


Overwriting /root/git/MLOps/notebooks/mlflow_projects/stockpred/train.py


### Test MLflow

In [18]:
os.chdir('/root/git/MLOps/notebooks/mlflow_projects/stockpred')
os.getcwd()

'/root/git/MLOps/notebooks/mlflow_projects/stockpred'

```
# Start MLflow server in the Terminal
cd /root/git/MLOps/notebooks/mlflow_projects/stockpred

mlflow server \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns \
    --host 0.0.0.0
```

In [19]:
!python train.py

              precision    recall  f1-score   support

           0       0.61      1.00      0.76        11
           1       1.00      0.22      0.36         9

    accuracy                           0.65        20
   macro avg       0.81      0.61      0.56        20
weighted avg       0.79      0.65      0.58        20



```
# In the Terminal

cd /root/git/MLOps/notebooks/mlflow_projects/stockpred
mlflow server --backend-store-uri sqlite:///:memory --default-artifact-root ./mlruns

# Serve model (builds conda environment for this)
mlflow models serve -m ./mlruns/0/4f280fc7c70a432ba9e6902d2d634425/artifacts/model_random_forest/

```

### Build your Docker image
```
cd ~/git/MLOps/notebooks/mlflow_projects/stockpred

docker build \
    -f 'Dockerfile' \
    -t ksatola/miniconda3 .
    
#docker push ksatola/miniconda3
```

### Run MLflow experiment inside Docker container
```
#docker login

#mlflow ui --backend-store-uri file:///~/git/MLOps/notebooks/mlflow_projects/stockpred/mlruns
#mlflow.set_tracking_uri("file:///~/git/MLOps/notebooks/mlflow_projects/stockpred/mlruns")

#mlflow run .
```

In [None]:
#IMAGE_NAME='ksatola/miniconda:4.5.4' \
IMAGE_NAME='ksatola/miniconda3' \
CONTAINER_NAME='ksatola-mlflow-exp-test'

# !!! In order to be able to connect to Docker on WSL, the second line (starting with -v) is needed
# https://devopscube.com/run-docker-in-docker/
docker run -d -t -P \
    -v /var/run/docker.sock:/var/run/docker.sock \
    --name $CONTAINER_NAME \
    --mount src='/root/git',target='/root/git',type=bind \
    $IMAGE_NAME
    
docker run -d -t -P \
    -v /var/run/docker.sock:/var/run/docker.sock \
    --name $CONTAINER_NAME \
    $IMAGE_NAME
    
docker run \
    --mount src='/root/git',target='/root/git',type=bind \
    -it ksatola/miniconda3 \

In [None]:
#export MLFLOW_TRACKING_URI='~/git/MLOps/notebooks/mlflow_projects/stockpred/mlruns'
#export MLFLOW_TRACKING_URI='~/git/MLOps/notebooks/mlflow_projects/stockpred'

#export MLFLOW_CONDA_HOME

#mlflow ui --backend-store-uri $MLFLOW_TRACKING_URI

#mlflow run "https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/tree/master/Chapter01/stockpred"

In [None]:
import mlflow

class RandomPredictor(mlflow.pyfunc.PythonModel):

    def __init__(self):
        pass

    # Implement the heuristic model in MLflow
    def predict(self, context, model_input):
        return model_input.apply(lambda column: random.randint(0,1))
    
    # Save the model in MLflow
    model_path = "random_model"
    baseline_model = RandomPredictor()
    mlflow.pyfunc.save_model(path=model_path, python_model=random_model)

In [None]:
Run your mlflow job:
mlflow run .

Start the serving API:
mlflow models serve -m ./mlruns/0/b9ee36e80a934cef9cac3a0513db515c/artifacts/random_model/

Test the API of your model.
You have access to a very simple Flask server that can run your model. You can test
the execution by running a curl command in your server:
curl http://127.0.0.1:5000/invocations -H 'Content-Type:application/json' -d '{"data":[[1,1,1,1,0,1,1,1,0,1,1,1,0,0]]}'

## Data Science Workbench
In order to address common frictions for developing models in data science we need to provide data scientists and practitioners with a standardized environment in which they can develop and manage their work. A data science workbench should allow you to quick-start a project, and the availability of an environment with a set of starting tools and frameworks allows data scientists to rapidly jump-start a project.

<img src="images/ds_workbench.png" alt="" style="width: 600px;"/>

We need the following core features in our data science workbench:
- **Dependency Management:** Having dependency management built into your local environment helps in handling reproducibility issues and preventing library conflicts between different environments. This is generally achieved by using environment managers such as Docker or having environment management frameworks available in your programming language. MLflow provides this through the support of Docker- or Conda-based environments.
- **Data Management:** Managing data in a local environment can be complex and daunting if you have to handle huge datasets. Having a standardized definition of how you handle data in your local projects allows others to freely collaborate on your projects and understand the structures available.
- **Model Management:** Having the different models organized and properly stored provides an easy structure to be able to work through many ideas at the same time and persist the ones that have potential. MLflow helps support this through the model format abstraction and Model Registry component to manage models.
- **Deployment:** Having a development environment aligned with the production environment where the model will be serviced requires deliberation in the local environment. The production environment needs to be ready to receive a model from a model developer, with the least possible friction. This smooth deployment workflow is only possible if the local environment is engineered correctly.
- **Experimentation Management:** Tweaking parameters is the most common thing that a machine learning practitioner does. Being able to keep abreast of the different versions and specific parameters can quickly become cumbersome for the model developer.

We will have the following components in the architecture of our development environment:
- **Docker/Docker Compose:** Docker will be used to handle each of the main component dependencies of the architecture, and Docker Compose will be used as a coordinator between different containers of software pieces. The advantage of having each component of the workbench architecture in Docker is that neither element’s libraries will conflict with the other.
- **JupyterLab:** The de facto environment to develop data science code and analytics in the context of machine learning.
- **MLflow:** MLflow is at the cornerstone of the workbench, providing facilities for experiment tracking, model management, registry, and deployment interface.
- **PostgreSQL database:** The PostgreSQL database is part of the architecture at this stage, as the storage layer for MLflow for backend metadata. Other relational databases could be used as the MLflow backend for metadata, but we will use PostgreSQL.

```
# Copy the contents of the project
https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/tree/master/Chapter03/gradflow

# In the WSL Ubuntu
cd ~/git/gradflow

# Start your local environment
make

# Inspect the created environments (should be 3 containers)
docker ps

# Open: http://localhost:8888/lab/workspaces/auto-L
# Open: http://localhost:5000/#/

# Tear dwon the environment
make down

```
The usual ports used by your workbench are listed as follows: 
- Jupyter serves in port 8888, 
- MLflow serves in port 5000, 
- and PostgreSQL serves in port 5432.

In [None]:
cd ~/git/Machine-Learning-Engineering-with-MLflow-master/Chapter04/gradflow/
make
make down

In [None]:
cd ~/git/Machine-Learning-Engineering-with-MLflow-master/Chapter05/gradflow/
make
make down