# 💼 Model Management 💼 
Well seems like you've learnt to do so**Machine Learning Experiments** with **Experiment-Tracking** and after doing experiments you've found some models which you would like to use but how do you **store** them? How to keep track of versions of those models? Do we really need a versioning system for our models ?

<img src="https://c.tenor.com/XHyzk7O2ndIAAAAS/what-meme.gif" alt="how" width="200"></img>

## Lets find out 🔎

From your ML Experiment, lets say you got bunch of models trained on a bit different distributions of datasets, hyperparameters, optimizers, etc. and you just store them manually one by one in each folder  
something like this :

![folder_struct](./images/folder_struct.jpg)

(btw this is how I usually store my models 😉)

and you just manually add more if you want ...

Hmmmm there's a bit of a problem here 🤔

* There's no consistency in the naming
* No track of model version
* No track of training-validation data used
* Visually kinda seems a bit messy :(

This might not be that of a big issue for a small projects like your school/college projects but for companies and startups this is indeed a big problem as those ML systems needs to be continously monitored and retrained to be able to adapt to the behaviour of new data

# Model Management with MLFlow 🤖

Continuing from [previous](./mlflow-experiment-tracking-intro.ipynb) notebook

In [2]:
%time
import pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [9]:
import xgboost as xgb

In [1]:
%time
import mlflow
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc-taxi-experiment")

<Experiment: artifact_location='./mlruns/1', experiment_id='1', lifecycle_stage='active', name='nyc-taxi-experiment', tags={}>

In [3]:
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
    df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)

    df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

In [4]:
df_train = read_dataframe('./data/green_tripdata_2021-01.parquet')
df_val = read_dataframe('./data/green_tripdata_2021-02.parquet')

In [5]:
df_train['PU_DO'] = df_train['PULocationID'] + '_' + df_train['DOLocationID']
df_val['PU_DO'] = df_val['PULocationID'] + '_' + df_val['DOLocationID']

In [6]:
categorical = ['PU_DO'] #'PULocationID', 'DOLocationID']
numerical = ['trip_distance']

dv = DictVectorizer()

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [7]:
target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

# How to log models? 📒

## Using `log_artifact()`

In [25]:
with mlflow.start_run() :
    mlflow.set_tag("tutorial", "management")
    
    train = xgb.DMatrix(X_train, label=y_train)
    valid = xgb.DMatrix(X_val, label=y_val)

    best_params = {
        'learning_rate': 0.09585355369315604,
        'max_depth': 30,
        'min_child_weight': 1.060597050922164,
        'objective': 'reg:linear',
        'reg_alpha': 0.018060244040060163,
        'reg_lambda': 0.011658731377413597,
        'seed': 42
    }

    mlflow.log_params(best_params)

    booster = xgb.train(
        params=best_params,
        dtrain=train,
        num_boost_round=10,
        evals=[(valid, 'validation')],
        early_stopping_rounds=10
    )
    booster.save_model("models/xgb_model.json")
    
    # tracking model using log_artifact i.e. save model present 
    # at location `local_path` to `artifact_path` present in each run's artifact URI
    mlflow.log_artifact(local_path="models/xgb_model.json", artifact_path="mlflow_model")

    y_pred = booster.predict(valid)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    mlflow.log_metric("rmse",rmse)

[0]	validation-rmse:19.48425
[1]	validation-rmse:17.95634
[2]	validation-rmse:16.59114
[3]	validation-rmse:15.37412
[4]	validation-rmse:14.29011
[5]	validation-rmse:13.32800
[6]	validation-rmse:12.47570
[7]	validation-rmse:11.72140
[8]	validation-rmse:11.05888
[9]	validation-rmse:10.47583


If you visit your MLFlow UI for the current run then you will see your model stored in the run's artifact URI which can be accessed later 
![log_artifact_op](./images/log_artifact_op.jpg)

Okay so ,
* Models are getting saved automatically in each run
* Saved with information about hyperparameters
* Saved model can be accessed and downloaded later  

but what if we have to share this to someone who doesn't have any idea about versions of packages used in our current project ?

![another option](https://www.memecreator.org/static/images/memes/5217550.jpg)

## Using `log_model()`
(specific to each framework)

In [26]:
with mlflow.start_run() :
    mlflow.set_tag("tutorial", "management")
    
    train = xgb.DMatrix(X_train, label=y_train)
    valid = xgb.DMatrix(X_val, label=y_val)

    best_params = {
        'learning_rate': 0.09585355369315604,
        'max_depth': 30,
        'min_child_weight': 1.060597050922164,
        'objective': 'reg:linear',
        'reg_alpha': 0.018060244040060163,
        'reg_lambda': 0.011658731377413597,
        'seed': 42
    }

    mlflow.log_params(best_params)

    booster = xgb.train(
        params=best_params,
        dtrain=train,
        num_boost_round=10,
        evals=[(valid, 'validation')],
        early_stopping_rounds=10
    )
    
    # tracking model using log_model
    mlflow.xgboost.log_model(booster, artifact_path="mlflow_model")
    # simple han ?

    y_pred = booster.predict(valid)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    mlflow.log_metric("rmse",rmse)

[0]	validation-rmse:19.48425
[1]	validation-rmse:17.95634
[2]	validation-rmse:16.59114
[3]	validation-rmse:15.37412
[4]	validation-rmse:14.29011
[5]	validation-rmse:13.32800
[6]	validation-rmse:12.47570
[7]	validation-rmse:11.72140
[8]	validation-rmse:11.05888
[9]	validation-rmse:10.47583




If you check this run then with the information about logged parameters and metric you will also see information about your envionment and packages used in the artifact section

![log_model_op1](./images/log_model_op1.jpg)

It also provides a snippets on how to load the model and make predictions 
![log_model_op2.jpg](./images/log_model_op2.jpg)

Conda environment info :
![log_model_op3.jpg](./images/log_model_op3.jpg)

and so on....

Well looks like we are able to store the models in each runs with all the information about different logged parameters such as `hyperparameters`, `metrics`, `dependencies`, etc. but still we haven't figured out a way for model versioning right ?  
well that's where **Model Registry** comes in which we will learn in next notebook

See ya then 

![thanks](https://c.tenor.com/35hmBwYHYikAAAAM/the-office-bow.gif)