# Part 1: Evaluating

The Part 1 Data Prep notebook handles data preparation and quality checking steps. 

The Part 1 Model Training notebook builds a model and writes metrics to MLflow. 

This notebook will handle the following steps:
- Load the test data.
- Load the model registered to staging in the training step.
- Use the trained model to predict on the test data and generate model evaluation metrics.
- If no prior trained model exists, the model will be registered as a baseline model in production.
- If a production model is found, the evaluation metrics for that model will be compared against the newly trained model and if they surpass production, model will be registered to production.


## Requirements
This tutorial requires Databricks Runtime for Machine Learning.

In [None]:
# Multiple people may be running this workshop at the same time.  We want each
# participant to have their own set of files.  To create your own file storage area,
# put your name below:

your_name = ""

try: run_name = dbutils.widgets.get("run_name")
except: run_name = your_name.strip()
run_name = "no_name" if run_name == "" else run_name

In [None]:
# We need to know if this is running as part of a Continuous Integration or as part of a
# Continuous Deployment.  Let's look for a flag that will tell us.
devops_action = ""
try: devops_action = dbutils.widgets.get("devops_action")
except: devops_action = "unknown"
devops_action = devops_action.strip().upper()

### Load the prepared data

In [None]:
import pandas as pd

if devops_action == "INTEGRATION"  or devops_action == "UNKNOWN":
  data = pd.read_csv(f"/dbfs/tutorials/wine-data/{run_name}/wine-quality-all-prepped.csv")
  data = data.drop(["Unnamed: 0"], axis=1)
elif devops_action == "DEPLOYMENT":
  data = spark.read.format("delta").load("dbfs:/tutorials/wine-data/delta")
  data = data.toPandas()

# add in the corresponding parameter to the cd pipeline
# and parameter handling in this notebook

### Split dataset and use test dataset to measure trained model
Split the input data into 3 sets:
- Train (60% of the dataset used to train the model)
- Validation (20% of the dataset used to tune the hyperparameters)
- Test (20% of the dataset used to report the true performance of the model on an unseen dataset)

We use the same seed as in the training notebook, and only the test dataset will be used in this model evaluation notebook.

In [None]:
from sklearn.model_selection import train_test_split

X = data.drop(["quality"], axis=1)
y = data.quality

# Split out the training data
X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)

# Split the remaining data equally into validation and test
X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)

### Load the staged model
If the training notebook succeeds, it registers a model to staging. Load the model for comparison against the current production model.

In [None]:
import mlflow
import mlflow.pyfunc
from sklearn.metrics import roc_auc_score
from mlflow.tracking import MlflowClient

# to create your own version of the model, uncomment the next line, and comment the line after
# model_name = f"wine_quality-{run_name}"
model_name = "wine_quality"
staged_model = mlflow.pyfunc.load_model(f"models:/{model_name}/staging")

staged_model_auc = roc_auc_score(y_test, staged_model.predict(X_test))
print(f'Current staged model AUC on test data: {staged_model_auc}')

### Load the current production model (if any)

In [None]:
try: 
    production_model = mlflow.pyfunc.load_model(f"models:/{model_name}/production")
except:
    production_model = None
    print("No current model in production")    

### Compare staged model to production model (if exists), keep better model in production

In [None]:
client = MlflowClient()

def get_stage_version(model_name, stage_name):
  stages = client.get_latest_versions(model_name)
  version = [i.version for i in stages if i.current_stage == stage_name]
  return version[0] if version else '0'

prod_version = get_stage_version(model_name, "Production")
staging_version = get_stage_version(model_name, "Staging")

if production_model:
    prod_model_auc = roc_auc_score(y_test, production_model.predict(X_test))
    print(f'Current production model AUC on test data: {prod_model_auc}')

    if staged_model_auc > prod_model_auc:
        print("Staged model outperforms current production model.")
        print("Archiving old production model")
        client.transition_model_version_stage(
            name=model_name,
            version=prod_version,
            stage="Archived",
            )
        print("Promoting staging to production")
        client.transition_model_version_stage(
            name=model_name,
            version=staging_version,
            stage="Production",
            )
        
    else:
        raise Exception("Staged model does not outperform current prod, exiting")
        
else:
    print("No production model found, promoting staging to production")
    client.transition_model_version_stage(
        name=model_name,
        version=staging_version,
        stage="Production",
        )

The Models page now shows the best-performing model version in stage "Production".

You can now refer to the model using the path "models:/wine_quality/production".

In [None]:
model = mlflow.pyfunc.load_model(f"models:/{model_name}/production")

# Sanity-check: This should match the AUC logged by MLflow
print(f'AUC: {roc_auc_score(y_test, model.predict(X_test))}')