# Introduction
Welcome to this Jupyter notebook dedicated to understanding the core concepts of MLOps using MLflow!

#### Objective
In the rapidly evolving world of Machine Learning (ML), it's vital to maintain a systematic approach to model development, deployment, and monitoring. This approach is commonly referred to as MLOps. Our primary aim is to delve deep into some of its pivotal components, such as model experiment tracking and model registries.

What Will We Cover?
1. **MLOps Overview**: A brief on why MLOps is critical and its primary components.
2. **Experiment Tracking with MLflow**: We'll train two distinct models and demonstrate how to log their parameters, metrics, and details for reproducibility and comparison.
3. **Model Registries**: How to register models, differentiate between stages (like Production and Staging), and manage various versions of models.
4. **Inference**: Using registered models to make predictions, distinguishing between models in different stages.

#### Why MLflow?
MLflow is a versatile open-source platform that streamlines machine learning lifecycle, including experimentation, reproducibility, and deployment. It's known for its simplicity and integrative approach, making it an ideal tool for both beginners and seasoned ML professionals.

By the end of this notebook, you'll have a hands-on understanding of how MLflow facilitates MLOps and why it's an essential tool in today's ML toolkit.

Install dependencies. Only run the first line if you are running this on the Intel Developer Cloud's Jupyter Environment.

In [1]:
# !source /opt/intel/oneapi/setvars.sh #comment out if not running on Intel Developer Cloud Jupyter
# !pip install mlflow
# !pip install scikit-learn
# !pip install numpy

 
   To force a re-execution of setvars.sh, use the '--force' option.
   Using '--force' can result in excessive use of your environment variables.
  
usage: source setvars.sh [--force] [--config=file] [--help] [...]
  --force        Force setvars.sh to re-run, doing so may overload environment.
  --config=file  Customize env vars using a setvars.sh configuration file.
  --help         Display this help message and exit.
  ...            Additional args are passed to individual env/vars.sh scripts
                 and should follow this script's arguments.
  
  Some POSIX shells do not accept command-line options. In that case, you can pass
  command-line options via the SETVARS_ARGS environment variable. For example:
  
  $ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
  $ . path/to/setvars.sh
  
  The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.
  
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user i

## 1. Introduction to Model Experiment Tracking and Model Registries
Model Experiment Tracking
Experiment tracking is the process of keeping a record of the experiments in the machine learning lifecycle. It helps in:

- Keeping a track of various model versions.
- Monitoring metrics across different experiments.
- Reproducing and collaborating on results.

### Model Registries
A model registry maintains a centralized hub of ML models, making it easier to:

Store and version models.
Share and collaborate on models.
Deploy and monitor models in different environments like staging and production.

## 2. Training and Logging Runs with MLflow
First, let's start by importing necessary libraries and initializing MLflow:

The code loads the Iris dataset, a popular dataset containing measurements for iris flowers. It then splits the data into features (X) and target labels (y). Finally, it divides the dataset into training and testing subsets using a 70-30 split, where 70% of the data is allocated for training and 30% for testing.

In [2]:
import mlflow
import mlflow.sklearn
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


Now, let's train two models and log their runs:

The code below trains two machine learning models: a Random Forest classifier and a Logistic Regression classifier, using training data X_train and y_train. During each model's training process, an MLflow run is initiated to log the model's details. For both models, after training, their performance is evaluated on a test set, and the accuracy is computed. This accuracy, along with other model details like type and version, is logged into MLflow. Additionally, the trained model itself is saved within MLflow. At the end of each training block, the run ID is extracted and stored for potential future reference or operations.

In [3]:
# Training Random Forest Model and Logging with MLflow
with mlflow.start_run(run_name="Random Forest Run") as run:
    rf_model = RandomForestClassifier()
    rf_model.fit(X_train, y_train)
    
    # Evaluating the model
    accuracy = rf_model.score(X_test, y_test)
    
    # Logging details with MLflow
    mlflow.log_param("model_type", "Random Forest")
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(rf_model, "model")
    mlflow.set_tags({"version": "1.0", "type": "tree_based"})

rf_run_id = run.info.run_id

# Training Logistic Regression Model and Logging with MLflow
with mlflow.start_run(run_name="Logistic Regression Run") as run:
    lr_model = LogisticRegression(max_iter=1000)
    lr_model.fit(X_train, y_train)
    
    # Evaluating the model
    accuracy = lr_model.score(X_test, y_test)
    
    # Logging details with MLflow
    mlflow.log_param("model_type", "Logistic Regression")
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(lr_model, "model")
    mlflow.set_tags({"version": "1.0", "type": "linear"})

lr_run_id = run.info.run_id



## 3. Registering Models with MLflow
Let's register the models

The code below initiates the MLflow client to interact with MLflow's tracking system. It then registers two machine learning models: a Random Forest model and a Logistic Regression model, both associated with the name "IrisModel". After registering the Random Forest model using its specific run ID, it transitions its stage to "Production". Similarly, post the registration of the Logistic Regression model using its run ID, its stage is set to "Staging". The code essentially registers and assigns deployment stages to two models within the MLflow system.:

In [4]:
# Initializing the MLflow client
client = mlflow.tracking.MlflowClient()

# Registering Random Forest Model using the actual run ID
rf_registered_model = mlflow.register_model(f"runs:/{rf_run_id}/model", "IrisModel")

# Transitioning it to production
client.transition_model_version_stage(
    name="IrisModel", 
    version=rf_registered_model.version, 
    stage="Production"
)

# Registering Logistic Regression Model using the actual run ID
lr_registered_model = mlflow.register_model(f"runs:/{lr_run_id}/model", "IrisModel")

# Transitioning it to staging
client.transition_model_version_stage(
    name="IrisModel", 
    version=lr_registered_model.version, 
    stage="Staging"
)

Successfully registered model 'IrisModel'.
2023/10/20 19:59:29 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: IrisModel, version 1
Created version '1' of model 'IrisModel'.
Registered model 'IrisModel' already exists. Creating a new version of this model...
2023/10/20 19:59:29 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: IrisModel, version 2
Created version '2' of model 'IrisModel'.


<ModelVersion: aliases=[], creation_timestamp=1697857169395, current_stage='Staging', description=None, last_updated_timestamp=1697857169423, name='IrisModel', run_id='3904be5576f746e3bfcae906244af20e', run_link=None, source='file:///home/uad6b15e0ae3d5e407195ab5f044a50f/Eduardo/mlruns/0/3904be5576f746e3bfcae906244af20e/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>

Now let's review the models that we have registered. The below code initializes the MlflowClient from the MLflow library, which facilitates interactions with MLflow's tracking server. It then retrieves a list of all registered machine learning models. For each registered model, the code prints its name, iterates over its versions, and for each version, it fetches and displays its details — including the version number, the associated run ID, and its current deployment stage. The code provides an overview of all models and their versions registered in the MLflow system.

In [5]:
import mlflow
from mlflow.tracking import MlflowClient

# Initialize the client
client = MlflowClient()

# List all registered models
registered_models = client.search_registered_models()

# Print details of each registered model
for rm in registered_models:
    print("Name:", rm.name)
    
    for version in rm.latest_versions:
        model_version_details = client.get_model_version(rm.name, version.version)
        
        print("Version:", version.version)
        print("Run ID:", model_version_details.run_id)
        print("Stage:", model_version_details.current_stage)
        print("------")
    
    print("----------------------------")

Name: IrisModel
Version: 2
Run ID: 3904be5576f746e3bfcae906244af20e
Stage: Staging
------
Version: 1
Run ID: 416035e966234b1cb32c375df6a4ccf7
Stage: Production
------
----------------------------


## 4. Inference using Registered Models
Let's load and use the registered models for inference

The code below imports the necessary mlflow.pyfunc module, specifies the "IrisModel" registered in MLflow's model registry, and then loads the version of this model that's in the 'Production' stage. With a sample iris dataset provided, it uses the loaded model to make predictions and then prints the resultant predictions to the console. In essence, it showcases how to retrieve a model from MLflow's model registry and use it for inference on new data.:

In [6]:
import mlflow.pyfunc

# Define the name of the registered model
model_name = "IrisModel"

# Load the model in 'Production' stage from the model registry
model_production = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/Production")

# Assume you have some new sample data for prediction (modify this according to your data)
sample_data = [[5.1, 3.5, 1.4, 0.2],  # Example iris data
               [6.7, 3.1, 4.7, 1.5]]

# Perform inference
predictions = model_production.predict(sample_data)

print(predictions)

[0 1]


# Conclusion
Through this workshop:

1. We understood the importance of Model Experiment Tracking and how it helps streamline ML processes, reproduce results, and collaborate.
2. We delved into Model Registries and saw its utility in managing, versioning, and deploying models.
3. We trained two models and logged their runs with MLflow, tagging them appropriately.
4. We registered our models in different stages: production and staging. This helps in differentiating models ready for live environments vs those still under evaluation.
5. Lastly, we used our registered models for inference, showcasing how easy it is to fetch and use models from a centralized repository.