# Part 1: Model Training

The Part 0 notebook sets up a raw file to be used for model training and validation.

The Part 1 Data Prep notebook handles data preparation and quality checking steps. 

This notebook focuses on model training, making use of the data prepared in the data prep notebook that will run as a previous job in the workflow.

The notebook will handle the following steps:
- Split the prepared data into training and validation datasets.
- Build a simple classifier to predict wine quality based on the available features in the data.
- Register the model in MLflow as the baseline model that we'll try to beat by changing parts of the model development workflow

When these steps complete, we have a baseline model that can generate predictions of the quality of Portugese wines based on the wine's measured physicochemical properties. 

## Requirements
This tutorial requires Databricks Runtime for Machine Learning.

In [None]:
# Multiple people may be running this workshop at the same time.  We want each
# participant to have their own set of files.  To create your own file storage area,
# put your name below:

your_name = ""

try: run_name = dbutils.widgets.get("run_name")
except: run_name = your_name.strip()
run_name = "no_name" if run_name == "" else run_name

## Load the prepared data

In [None]:
import pandas as pd

data = pd.read_csv(f"/dbfs/tutorials/wine-data/{run_name}/wine-quality-all-prepped.csv")
data = data.drop(["Unnamed: 0"], axis=1)

## Split dataset for training baseline model
Split the input data into 3 sets:
- Train (60% of the dataset used to train the model)
- Validation (20% of the dataset used to tune the hyperparameters)
- Test (20% of the dataset used to report the true performance of the model on an unseen dataset)

The test dataset will not be used in this model training notebook.

In [None]:
from sklearn.model_selection import train_test_split

X = data.drop(["quality"], axis=1)
y = data.quality

# Split out the training data
X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)

# Split the remaining data equally into validation and test
X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)

## Build a baseline model
This task seems well suited to a random forest classifier, since the output is binary and there may be interactions between multiple variables.

The following code builds a simple classifier using scikit-learn. It uses MLflow to keep track of the model accuracy, and to save the model for later use.

In [None]:
%sh
mkdir -p /Workspace/Shared/wine_quality/experiments

In [None]:
import mlflow
import mlflow.pyfunc
import mlflow.sklearn
import numpy as np
import sklearn
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlflow.models.signature import infer_signature
from mlflow.utils.environment import _mlflow_conda_env
import cloudpickle
import time

# The predict method of sklearn's RandomForestClassifier returns a binary classification (0 or 1). 
# The following code creates a wrapper function, SklearnModelWrapper, that uses 
# the predict_proba method to return the probability that the observation belongs to each class. 

class SklearnModelWrapper(mlflow.pyfunc.PythonModel):
  def __init__(self, model):
    self.model = model
    
  def predict(self, context, model_input):
    return self.model.predict_proba(model_input)[:,1]

mlflow.set_experiment(f"/Shared/wine_quality/experiments/{run_name}")
# mlflow.start_run creates a new MLflow run to track the performance of this model. 
# Within the context, you call mlflow.log_param to keep track of the parameters used, and
# mlflow.log_metric to record metrics like accuracy.
with mlflow.start_run(run_name='untuned_random_forest'):
  n_estimators = 10
  model = RandomForestClassifier(n_estimators=n_estimators, random_state=np.random.RandomState(123))
  model.fit(X_train, y_train)

  # predict_proba returns [prob_negative, prob_positive], so slice the output with [:, 1]
  predictions_test = model.predict_proba(X_test)[:,1]
  auc_score = roc_auc_score(y_test, predictions_test)
  mlflow.log_param('n_estimators', n_estimators)
  # Use the area under the ROC curve as a metric.
  mlflow.log_metric('auc', auc_score)
  wrappedModel = SklearnModelWrapper(model)
  # Log the model with a signature that defines the schema of the model's inputs and outputs. 
  # When the model is deployed, this signature will be used to validate inputs.
  signature = infer_signature(X_train, wrappedModel.predict(None, X_train))
  
  # MLflow contains utilities to create a conda environment used to serve models.
  # The necessary dependencies are added to a conda.yaml file which is logged along with the model.
  conda_env =  _mlflow_conda_env(
        additional_conda_deps=None,
        additional_pip_deps=["cloudpickle=={}".format(cloudpickle.__version__), "scikit-learn=={}".format(sklearn.__version__)],
        additional_conda_channels=None,
    )
  mlflow.pyfunc.log_model("model", python_model=wrappedModel, conda_env=conda_env, signature=signature)

Examine the learned feature importances output by the model as a sanity-check.

In [None]:
feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])
feature_importances.sort_values('importance', ascending=False)

As illustrated by the boxplots shown previously, both alcohol and density are important in predicting quality.

You logged the Area Under the ROC Curve (AUC) to MLflow. Click **Experiment** at the upper right to display the Experiment Runs sidebar. 

The model achieved an AUC of 0.854.

A random classifier would have an AUC of 0.5, and higher AUC values are better. For more information, see [Receiver Operating Characteristic Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).

#### Register the model in MLflow Model Registry

By registering this model in Model Registry, you can easily reference the model from anywhere within Databricks.

The following section shows how to do this programmatically, but you can also register a model using the UI. See "[Create or register a model using the UI](https://docs.microsoft.com/azure/databricks/applications/machine-learning/manage-model-lifecycle/index#create-or-register-a-model-using-the-ui)".

In [None]:
run_id = mlflow.search_runs(filter_string='tags.mlflow.runName = "untuned_random_forest"').iloc[0].run_id

# uncomment when incorporating hyperparameter search code in Part 4
# best_run = mlflow.search_runs(order_by=['metrics.auc DESC']).iloc[0]
# print(f'AUC of Best Run: {best_run["metrics.auc"]}')
# run_id = best_run.run_id

In [None]:
# If you see the error "PERMISSION_DENIED: User does not have any permission level assigned to the registered model", 
# the cause may be that a model already exists with the name "wine_quality". Try using a different name.

# to create your own version of the model, uncomment the next line, and comment the line after
# model_name = f"wine_quality-{run_name}"
model_name = "wine_quality"
model_version = mlflow.register_model(f"runs:/{run_id}/model", model_name)

# Registering the model takes a few seconds, so add a small delay
time.sleep(15)

You should now see the model in the Models page. To display the Models page, click the Models icon in the left sidebar. 

Next, transition this model to staging and load it into this notebook from Model Registry.

In [None]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
  name=model_name,
  version=model_version.version,
  stage="Staging",
)

The Models page now shows the model version in stage "Staging".

You can now refer to the model using the path "models:/wine_quality-{yourname}/staging".

In [None]:
model = mlflow.pyfunc.load_model(f"models:/{model_name}/staging")

# Sanity-check: This should match the AUC logged by MLflow
print(f'AUC: {roc_auc_score(y_test, model.predict(X_test))}')