<a href="https://colab.research.google.com/github/paudan/sds2022_mlflow_workshop/blob/main/LC4_MLFlow_Model_Registry_Housing_Exe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://www.sds2022.ch/"><img src="https://drive.google.com/uc?id=1S7k7kTXs9qIylw3C7LA9rHkLycjlY8te" width="500" style="background:none; border:none; box-shadow:none;" /></a> </center>

<center><a target="_blank" href="http://www.sit.academy"><img src="https://drive.google.com/uc?id=1x9_jQgLhozCSWDSaOdVxKmxOEAe_OLgV" width="250" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center> <h1> Live Coding  </h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

<center>SIT Academy, 2022</center>



# MLFlow Workshop - Sequence 4 - Model Registry API - Submitting your own runs

MLflow Registry offers a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model. 

It provides model lineage (which MLflow experiment and run produced the model), model versioning, stage transitions (for example from staging to production or archiving), and annotations.

## Install dependencies

In [1]:
!pip -q install mlflow boto3 pyngrok

We need to define an environment variable `MLFLOW_TRACKING_URI` which points to your remote tracking server:

In [7]:
# Local version
# %env MLFLOW_TRACKING_URI=https://6eb0-104-155-206-193.ngrok.io
# Shared server
%env MLFLOW_TRACKING_URI=https://7fc2-34-74-234-173.ngrok.io
#e.g. 
#%env MLFLOW_TRACKING_URI=https://56cb-35-201-159-29.ngrok.io/

env: MLFLOW_TRACKING_URI=https://7fc2-34-74-234-173.ngrok.io


**❗ Make sure to replace `<YOUR_NGROK.IO_URL_HERE>` above with a valid NGROK link!**

## If you are using a (non-proxied) S3 bucket...

...make sure to set up the credentials first!
(Required on BOTH server-side and client-side.)

Also make sure you have `boto3` installed.

In [3]:
%env AWS_ACCESS_KEY_ID=jwhttmo4qhgfncxxdftqlwpxfaia
%env AWS_SECRET_ACCESS_KEY=jyhyd7bxmbax3w3mzfi55tscfk5t34rotzj3nbg2e4nwpofxwuh2y

%env MLFLOW_S3_ENDPOINT_URL=https://gateway.eu1.storjshare.io/

env: AWS_ACCESS_KEY_ID=jwhttmo4qhgfncxxdftqlwpxfaia
env: AWS_SECRET_ACCESS_KEY=jyhyd7bxmbax3w3mzfi55tscfk5t34rotzj3nbg2e4nwpofxwuh2y
env: MLFLOW_S3_ENDPOINT_URL=https://gateway.eu1.storjshare.io/


Now let's do a test run:

In [4]:
!mlflow run --env-manager=local https://github.com/SIT-Academy/sds2022_mlflow_workshop.git#src/mlproject_simple_run/ -P alpha=0.2

2022/06/22 14:30:57 INFO mlflow.projects.utils: === Fetching project from https://github.com/SIT-Academy/sds2022_mlflow_workshop.git#src/mlproject_simple_run/ into /tmp/tmpq8s02y21 ===
2022/06/22 14:31:01 INFO mlflow.projects.utils: === Created directory /tmp/tmp_w2raeso for downloading remote URIs passed to arguments of type 'path' ===
2022/06/22 14:31:01 INFO mlflow.projects.backend.local: === Running command 'python mlflow_regression_housing.py -r all 0.2 0.5' in run with ID '0ffcd3b0064e4849a114369b7ab692e8' === 
['mlflow_regression_housing.py', '-r', 'all', '0.2', '0.5']
Elasticnet model (alpha=0.200000, l1_ratio=0.500000):
  RMSE: 1184313.9608613164
  MAE: 608519.8997720616
  R2: 0.590015037204457
2022/06/22 14:31:17 INFO mlflow.projects: === Run (ID '0ffcd3b0064e4849a114369b7ab692e8') succeeded ===


### Exercise: 


*   Try adjusting parameters and extending what is already there! How do your new models compare to the rest?
*   Add new entities to runs (e.g. `mlflow.set_tag('some_tag')`), or try logging new artifacts (like plots, e.g. `mlflow.log_figure(fig, 'my_plot.png')`).
*   Submit your own runs & models!



In [8]:
%%writefile mlflow_regression_housing.py

import os
import warnings
import sys

import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import mlflow
import mlflow.sklearn

import logging
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

########
# Command-line args - passed in by MLFlow
print(sys.argv)
model_type = sys.argv[1]
alpha = float(sys.argv[2])
l1_ratio = float(sys.argv[3])
########

"""# Utilities for Data and Metrics"""

def prepare_data():
    #id = 1eNTyJc4jXJMkLPXW0eY6LL7_P9YN1GWO
    warnings.filterwarnings("ignore")
    np.random.seed(42)

    # Read the home price csv file from the URL
    orig_url = "https://drive.google.com/file/d/1eNTyJc4jXJMkLPXW0eY6LL7_P9YN1GWO/view"
    file_id = orig_url.split('/')[-2]
    data_path='https://drive.google.com/uc?export=download&id=' + file_id
    
    try:
        data = pd.read_csv(data_path)
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, check your internet connection. Error: %s", e)
    
    #numbers are written in this format "1,235,00" converting them to integers
    data["price"] = data["price"].str.replace(',', '')
    data["price"] = pd.to_numeric(data["price"])
    data = data.drop(["Unnamed: 0", 'zip'], 1)
    data = data.dropna()

    y = data["price"]
    X = data.drop("price", 1)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    return X_train, X_test, y_train, y_test


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

"""# Load Dataset"""

X_train, X_test, y_train, y_test = prepare_data()

data = {
    'X_train': X_train,
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test
}

data['X_train'].head()

data['y_train'].head()

"""# Utilities for Modeling and Tracking Experiments"""

def train_elasticnet(data, alpha=0.5, l1_ratio=0.5):

    # Train and track experiment
    with mlflow.start_run():

        categorical_features = ['type', 'floor', 'city', 'canton']
        continious_features = ['room_num', 'area_m2', 'floors_num', 'year_built', 'last_refurbishment', 'lat', 'lon']

        numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

        categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))])

        preprocessor = ColumnTransformer( transformers = [("num", numeric_transformer, continious_features),
                    ("cat", categorical_transformer, categorical_features)])

        # Execute ElasticNet
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        pipeline_lr = Pipeline([("col_transformer", preprocessor), ("estimator", lr)])
        pipeline_lr.fit(data['X_train'], data['y_train'])

        # Evaluate Metrics
        predicted_qualities = pipeline_lr.predict(data['X_test'])
        (rmse, mae, r2) = eval_metrics(data['y_test'], predicted_qualities)

        # Print out metrics
        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        # Log parameter, metrics, and model to MLflow
        mlflow.set_tag('model_type', 'linear')
        mlflow.log_param('Model', 'ElasticNet')  
        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        mlflow.sklearn.log_model(pipeline_lr, "model")

def train_random_forest(data, n_trees=100, max_depth=None):

    # Train and track experiment   
    with mlflow.start_run():

        categorical_features = ['type', 'floor', 'city', 'canton']
        continious_features = ['room_num', 'area_m2', 'floors_num', 'year_built', 'last_refurbishment', 'lat', 'lon']

        numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

        categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))])

        preprocessor = ColumnTransformer( transformers = [("num", numeric_transformer, continious_features),
                    ("cat", categorical_transformer, categorical_features)])
        
        # Execute RF
        rf = RandomForestRegressor(n_estimators=n_trees, max_depth=max_depth, random_state=42)
        pipeline_rf = Pipeline([("col_transformer", preprocessor), ("estimator", rf)])
        pipeline_rf.fit(data['X_train'], data['y_train'])

        # Evaluate Metrics
        predicted_qualities = pipeline_rf.predict(data['X_test'])
        (rmse, mae, r2) = eval_metrics(data['y_test'], predicted_qualities)

        # Print out metrics
        print("Random Forest model (n_estimators={}, max_depth={}):".format(n_trees, max_depth))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        # Log parameter, metrics, and model to MLflow
        mlflow.set_tag('model_type', 'ensemble')
        mlflow.log_param('Model', 'Random Forest')  
        mlflow.log_param("n_estimators", n_trees)
        mlflow.log_param("max_depth", max_depth)
        
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        mlflow.sklearn.log_model(pipeline_rf, "model")

if model_type == "elastic_net" or "en":
  train_elasticnet(data, alpha, l1_ratio)
elif model_type == "random_forest" or "rf":
  train_random_forest(data, n_trees=500, max_depth=10)
elif model_type == "all":
  train_random_forest(data, n_trees=100, max_depth=None)
  train_elasticnet(data, alpha, l1_ratio)

Overwriting mlflow_regression_housing.py


In [9]:
!python mlflow_regression_housing.py random_forest 0.5 1 

['mlflow_regression_housing.py', 'random_forest', '0.5', '1']
Elasticnet model (alpha=0.500000, l1_ratio=1.000000):
  RMSE: 920032.4310590102
  MAE: 509132.6768334975
  R2: 0.7525767521581358
