# MLflow's Model Registry

## Interacting with the MLflow tracking server
The `MlflowClient object`allows us to interact with: 
- an MLflow Tracking Server that creates and manages experiments and runs.
- an MLflow Registry Server that creates and manages registered models and model versions.
To instantiate it we need to pass a tracking URI and/or a registry URI

In [7]:
from mlflow.tracking import MlflowClient

MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"
client = MlflowClient(tracking_uri=MLFLOW_TRACKING_URI)
client.search_experiments()

[<Experiment: artifact_location='/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/mlruns/1', creation_time=1685346804195, experiment_id='1', last_update_time=1685346804195, lifecycle_stage='active', name='nyc-taxi-experiment', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/0', creation_time=1685346718761, experiment_id='0', last_update_time=1685346718761, lifecycle_stage='active', name='Default', tags={}>]

In [8]:
client.create_experiment(name="my-cool-experiment")

'2'

In [13]:
from mlflow.entities import ViewType

runs = client.search_runs(
    experiment_ids = '1',
    filter_string = "metrics.rmse < 5.9",
    run_view_type = ViewType.ACTIVE_ONLY,
    max_results = 5,
    order_by=["metrics.rmse ASC"]
)

In [11]:
runs

[<Run: data=<RunData: metrics={'rmse': 5.80905971329649}, params={'learning_rate': '0.42573168654483656',
  'max_depth': '10',
  'min_child_weight': '3.2482026447444605',
  'objective': 'reg:linear',
  'reg_alpha': '0.04266778935139865',
  'reg_lambda': '0.011203537317262537',
  'seed': '42'}, tags={'mlflow.log-model.history': '[{"run_id": "8264fb3eb8fb47f69ce47e08af58922b", '
                              '"artifact_path": "models_mlflow", '
                              '"utc_time_created": "2023-05-30 '
                              '17:49:53.806500", "flavors": {"python_function": '
                              '{"loader_module": "mlflow.xgboost", '
                              '"python_version": "3.10.9", "data": "model.xgb", '
                              '"env": {"conda": "conda.yaml", "virtualenv": '
                              '"python_env.yaml"}}, "xgboost": {"xgb_version": '
                              '"1.7.5", "data": "model.xgb", "model_class": '
                  

In [14]:
for run in runs:
    print(f"run id: {run.info.run_id}, rmse: {run.data.metrics['rmse']:.4f}")

run id: 8264fb3eb8fb47f69ce47e08af58922b, rmse: 5.8091
run id: 8ece72949f984fa3972c265e1b56a8c2, rmse: 5.8091
run id: cf1747fba5e24f5d95d856cdc9c569a3, rmse: 5.8289
run id: 9ac35b42b97b4425995e2fdf5aa75778, rmse: 5.8636


## Interacting with the Model Registry
In this section We will use the MlflowClient instance to:

- Register a new version for the experiment `nyc-taxi-regressor`
- Retrieve the latests versions of the model nyc-taxi-regressor and check that a new version 4 was created.
- Transition the version 4 to `Staging` and adding annotations to it.

In [16]:
import mlflow
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

In [17]:
run_id = "cf1747fba5e24f5d95d856cdc9c569a3"
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri=model_uri, name="nyc-taxi-regressor")

Registered model 'nyc-taxi-regressor' already exists. Creating a new version of this model...
2023/05/31 02:35:16 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: nyc-taxi-regressor, version 2
Created version '2' of model 'nyc-taxi-regressor'.


<ModelVersion: aliases=[], creation_timestamp=1685475316872, current_stage='None', description=None, last_updated_timestamp=1685475316872, name='nyc-taxi-regressor', run_id='cf1747fba5e24f5d95d856cdc9c569a3', run_link=None, source='/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/mlruns/1/cf1747fba5e24f5d95d856cdc9c569a3/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>

In [19]:
client.search_registered_models()

[<RegisteredModel: aliases={}, creation_timestamp=1685475228645, description='', last_updated_timestamp=1685475316872, latest_versions=[<ModelVersion: aliases=[], creation_timestamp=1685475316872, current_stage='None', description=None, last_updated_timestamp=1685475316872, name='nyc-taxi-regressor', run_id='cf1747fba5e24f5d95d856cdc9c569a3', run_link=None, source='/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/mlruns/1/cf1747fba5e24f5d95d856cdc9c569a3/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>], name='nyc-taxi-regressor', tags={}>]

In [22]:
model_name = "nyc-taxi-regressor"
client.get_latest_versions(name=model_name)

[<ModelVersion: aliases=[], creation_timestamp=1685475316872, current_stage='None', description=None, last_updated_timestamp=1685475316872, name='nyc-taxi-regressor', run_id='cf1747fba5e24f5d95d856cdc9c569a3', run_link=None, source='/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/mlruns/1/cf1747fba5e24f5d95d856cdc9c569a3/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>]

In [23]:
for version in latest_versions:
    print(f"version: {version.version}, stage: {version.current_stage}")

version: 2, stage: None


In [25]:
model_version = 2
new_stage = "Staging"
client.transition_model_version_stage(
    name=model_name,
    version=model_version,
    stage=new_stage,
    archive_existing_versions=False
)

<ModelVersion: aliases=[], creation_timestamp=1685475316872, current_stage='Staging', description=None, last_updated_timestamp=1685475689482, name='nyc-taxi-regressor', run_id='cf1747fba5e24f5d95d856cdc9c569a3', run_link=None, source='/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/mlruns/1/cf1747fba5e24f5d95d856cdc9c569a3/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>

In [26]:
from datetime import datetime
date = datetime.today().date()

client.update_model_version(
    name=model_name,
    version=model_version,
    description=f"The model version {model_version} was transitioned to {new_stage} on {date}"
)

<ModelVersion: aliases=[], creation_timestamp=1685475316872, current_stage='Staging', description='The model version 2 was transitioned to Staging on 2023-05-31', last_updated_timestamp=1685475758439, name='nyc-taxi-regressor', run_id='cf1747fba5e24f5d95d856cdc9c569a3', run_link=None, source='/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/mlruns/1/cf1747fba5e24f5d95d856cdc9c569a3/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>

## Comparing versions and selecting the new "Production" model
In the last section, we will retrieve models registered in the `model registry` and compare their performance on an unseen test set. The idea is to simulate the scenario in which a deployment engineer has to interact with the model registry to decide whether to update the model version that is in production or not.

These are the steps:

- Load the `test dataset`, which corresponds to the `NYC Green Taxi` data from the month of March 2021.
- Download the `DictVectorizer` that was fitted using the training data and saved to MLflow as an `artifact`, and load it with `pickle`.
- Preprocess the test set using the `DictVectorizer` so we can properly feed the regressors.
- Make `predictions` on the test set using the model versions that are currently in the `Staging` and `Production` stages, and compare their performance.
- Based on the results, update the `Production` model version accordingly.

**Note: the model registry doesn't actually deploy the model to production when you transition a model to the "Production" stage, it just assign a label to that model version. You should complement the registry with some CI/CD code that does the actual deployment.**

In [32]:
from sklearn.metrics import mean_squared_error
import pandas as pd


def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df


def preprocess(df, dv):
    df['PU_DO'] = df['PULocationID'] + '_' + df['DOLocationID']
    categorical = ['PU_DO']
    numerical = ['trip_distance']
    train_dicts = df[categorical + numerical].to_dict(orient='records')
    return dv.transform(train_dicts)


def test_model(name, stage, X_test, y_test):
    model = mlflow.pyfunc.load_model(f"models:/{name}/{stage}")
    y_pred = model.predict(X_test)
    return {"rmse": mean_squared_error(y_test, y_pred, squared=False)}

In [33]:
df = read_dataframe('./data/green_tripdata_2021-03.parquet')
df

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,...,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,duration
0,2,2021-03-01 00:05:42,2021-03-01 00:14:03,N,1.0,83,129,1.0,1.56,7.50,...,0.5,0.00,0.0,,0.3,8.80,1.0,1.0,0.0,8.350000
1,2,2021-03-01 00:21:03,2021-03-01 00:26:17,N,1.0,243,235,1.0,0.96,6.00,...,0.5,0.00,0.0,,0.3,7.30,2.0,1.0,0.0,5.233333
2,2,2021-03-01 00:02:06,2021-03-01 00:22:26,N,1.0,75,242,1.0,9.93,28.00,...,0.5,2.00,0.0,,0.3,31.30,1.0,1.0,0.0,20.333333
3,2,2021-03-01 00:24:03,2021-03-01 00:31:43,N,1.0,242,208,1.0,2.57,9.50,...,0.5,0.00,0.0,,0.3,10.80,2.0,1.0,0.0,7.666667
4,1,2021-03-01 00:11:10,2021-03-01 00:14:46,N,1.0,41,151,1.0,0.80,5.00,...,0.5,1.85,0.0,,0.3,8.15,1.0,1.0,0.0,3.600000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83822,2,2021-03-31 22:07:00,2021-03-31 22:13:00,,,41,75,,1.48,8.46,...,0.0,1.44,0.0,,0.3,10.20,,,,6.000000
83823,2,2021-03-31 22:56:00,2021-03-31 23:13:00,,,95,95,,0.09,54.25,...,0.0,0.00,0.0,,0.3,57.30,,,,17.000000
83824,2,2021-03-31 22:36:00,2021-03-31 22:45:00,,,95,95,,0.66,8.11,...,0.0,0.00,0.0,,0.3,8.41,,,,9.000000
83825,2,2021-03-31 23:35:00,2021-04-01 00:00:00,,,37,14,,9.58,36.83,...,0.0,0.00,0.0,,0.3,39.88,,,,25.000000


In [35]:
run_id = "8264fb3eb8fb47f69ce47e08af58922b"

client.transition_model_version_stage(
    name=model_name,
    version=1,
    stage="Staging",
    archive_existing_versions=False
)

client.update_model_version(
    name=model_name,
    version=1,
    description=f"The model version {model_version} was transitioned to {new_stage} on {date}"
)

<ModelVersion: aliases=[], creation_timestamp=1685475228716, current_stage='Staging', description='The model version 2 was transitioned to Staging on 2023-05-31', last_updated_timestamp=1685476593663, name='nyc-taxi-regressor', run_id='8264fb3eb8fb47f69ce47e08af58922b', run_link='', source='/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/mlruns/1/8264fb3eb8fb47f69ce47e08af58922b/artifacts/models_mlflow', status='READY', status_message=None, tags={}, user_id=None, version=1>

In [36]:
client.download_artifacts(run_id=run_id, path='preprocessor', dst_path='.')

  client.download_artifacts(run_id=run_id, path='preprocessor', dst_path='.')


'/home/kylepaul/notebooks/mlops-zoom-camp-2022/session_2/preprocessor'

In [37]:
import pickle

with open("preprocessor/preprocessor.b", "rb") as f_in:
    dv = pickle.load(f_in)

In [38]:
X_test = preprocess(df, dv)
target = "duration"
y_test = df[target].values

In [40]:
client.transition_model_version_stage(
    name=model_name,
    version=1,
    stage="Production",
    archive_existing_versions=False
)

%time test_model(name=model_name, stage="Production", X_test=X_test, y_test=y_test)

CPU times: user 3.42 s, sys: 34.7 ms, total: 3.46 s
Wall time: 1.09 s


{'rmse': 6.6181393247183715}

In [None]:
%time test_model(name=model_name, stage="Staging", X_test=X_test, y_test=y_test)