# Overview
In this section you will read in the same dataset that we've used in the other two exercises into a dataframe in Databricks, train an scikit-learn logistic regression model, and then export that model from the MLFlow model registry to the Vertex AI model registry so that it can then be used for prediction activities. 

## Setup
1. First, create a new Databricks notebook in python
2. You may need to start the cluster you created earlier, and attach the new notebook to the cluster
3. Install some additional libraries onto the Databricks cluster

In [None]:
%pip install category_encoders
%pip install databricks_registry_webhooks
%pip install db-dtypes
%pip install google-cloud-mlflow
%pip install google-cloud-aiplatform
%pip install google-cloud-bigquery

5. You will need to restart the cluster after these packages have been installed
4. Next, import the required libraries - remember that if the cluster turns off, you'll need to rerun these imports

In [None]:
from pyspark import *
from pyspark.mllib import *
from pyspark.mllib.linalg import SparseVector
from pyspark.sql import Row
from pyspark.sql.functions import col,isnan, when, count
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston
import mlflow
import bamboolib as bam
import os
import urllib
import google.oauth2.id_token
import google.auth.transport.requests
from databricks_registry_webhooks import RegistryWebhooksClient, HttpUrlSpec
from google.protobuf.json_format import MessageToJson
#from google.cloud import aiplatform as vertex_ai
from mlflow import deployments

7. Next, define some variables & helper functions

In [None]:
# auth function - required to be used when making request to deploy model
def make_authorized_get_request(endpoint, audience):
    """
    Make an authorized request to the given endpoint.
    Args:
        endpoint: The endpoint to send the request to.
        audience: The audience to use when validating the JWT.
    Returns:
        The JSON response from the request.
    """
    # Define the request.
    req = urllib.request.Request(endpoint)

    # Get the ID token from the environment.
    auth_req = google.auth.transport.requests.Request()
    id_token = google.oauth2.id_token.fetch_id_token(auth_req, audience)
    
    return id_token

8. Depending on how your GCP / Databricks project environment has been configured to communicate, you may need to generate an authorization token so that Databricks can read in data from BigQuery

* https://github.com/GoogleCloudDataproc/spark-bigquery-connector#how-do-i-authenticate-outside-gce--dataproc
* if you go back to the Google Cloud console and click on "activate cloud shell" in the upper righthand corner, you can generate a token by running the following gcloud command:
gcloud auth application-default print-access-token
![](./gcloud_auth.png)

9. You'll then need to copy the printed access token, and go back to your Databricks notebook and paste this into the a new variable

In [None]:
gcpAccessToken="your-access-token"

10. In order to move models using the mlflow google cloud plugin, you'll need to specify a Google Cloud Storage staging bucket for artifacts to be stored as part of the webhook function. If you don't have an existing bucket you want to use, you can create one. 
* In the GCP cloud console, navigate to Cloud Storage
* Select Create Bucket
* Make sure to use the same region where we've been creating all of our other resources (e.g. us-central1)
* Keep the default settings 
* Navigate to the bucket, and create a folder called "models"
![](./bucket_models.png)

11. Next, go back to your Databricks notebook. Let's define some variables. Make sure to use your project, the same region that you have been using to store the models in the Vertex AI model registry. 

In [None]:
# CHANGE THESE TO REGISTER DIFFERENT MODEL USING MLFLOW API
bucket_uri = "gs://bq-logit-regression-demo"
models_uri = f"{bucket_uri}/models/"

# VERTEX AI SETTINGS
REGION = "us-central1"
PROJECT_ID = "leedeb-experimentation"
FUNCTION_NAME = "deploy"
CLOUD_FUNCTION_URL = f"https://{REGION}-{PROJECT_ID}.cloudfunctions.net/{FUNCTION_NAME}"
CREDENTIAL_PATH = "/dbfs/FileStore/tables/vertexdatabricks/leedeb_experimentation_dbd17a0139eb.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = CREDENTIAL_PATH

# WHEN STREAMING DATASET IN SPECIFY MODEL STAGE TO LOAD IN
# model_stage="Production"

# set registry GCS bucket URI
mlflow.set_registry_uri(models_uri)

## Prepare data

12. Read the data in from BigQuery using the spark connector to read it into a Spark dataframe. Make sure to update the full_table_name variable for your project!

In [None]:
# read in data from BQ into spark df
full_table_name="leedeb-experimentation.bq_databricks_vertex.training_data"

df = spark.read.format("bigquery") \
        .option("gcpAccessToken", gcpAccessToken) \
        .option("inferSchema" , "true") \
        .option("table",full_table_name).load()

df.createOrReplaceTempView("training_data")

13. Since we'll be using scikit-learn, convert from a spark dataframe to a pandas dataframe.

In [None]:
train_data=df.toPandas()
train_data

14. Preprocess the data for model training

In [None]:
# split train and test
X = train_data.drop(columns='label', axis=1) 
y = train_data.label.values

# use binary encoding to encode two categorical features
enc = BinaryEncoder().fit(X)

# transform the dataset
numeric_dataset = enc.transform(X)

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(numeric_dataset, y)

## Train the model
15. Train the model and log the run to MLFlow

In [None]:
with mlflow.start_run() as run:
    mlflow.sklearn.autolog()
    
    # define model
    scaler = StandardScaler()
    lr = LogisticRegression()
    model = Pipeline([('standardize', scaler),
                        ('log_reg', lr)])

    # fit the model
    model.fit(X_train, y_train)
    
    # mlflow.end_run()
    run_id = mlflow.active_run().info.run_id

## Evaluate the model
16. Evaluate the performance 

In [None]:
# evaluate the model - training score
y_train_hat = model.predict(X_train)
y_train_hat_probs = model.predict_proba(X_train)[:,1]

train_accuracy = accuracy_score(y_train, y_train_hat)*100
train_auc_roc = roc_auc_score(y_train, y_train_hat_probs)*100

print('Confusion matrix:\n', confusion_matrix(y_train, y_train_hat))

print('Training AUC: %.4f %%' % train_auc_roc)

print('Training accuracy: %.4f %%' % train_accuracy)

In [None]:
# evaluate the model - testing score
y_test_hat = model.predict(X_test)
y_test_hat_probs = model.predict_proba(X_test)[:,1]

test_accuracy = accuracy_score(y_test, y_test_hat)*100
test_auc_roc = roc_auc_score(y_test, y_test_hat_probs)*100

print('Confusion matrix:\n', confusion_matrix(y_test, y_test_hat))

print('Testing AUC: %.4f %%' % test_auc_roc)

print('Testing accuracy: %.4f %%' % test_accuracy) 

In [None]:
# check precision and recall
print(classification_report(y_test, y_test_hat, digits=6))

## Register model in MLFlow & prepare for deployment to Vertex
17. In order to use the model for online prediction, we need to import the model into the Vertex AI model registry. First, the model needs to be registered in MLFlow staging environment. We create a webhook with the mlflow google cloud ai platform plugin. When the model is moved from Stage to Production in MLFlow, this triggers a deployment to the model registry to an endpoint in Vertex AI. 

In [None]:
# get the mlflow run id
runID = run.info.run_uuid
runID

In [None]:
# register the model in mlflow model registry
import uuid
id = uuid.uuid4().hex[:10]
model_name_mlflow = 'manually_registered_sklearn'
artifact_path='model'
model_uri = "runs:/{run_id}/artifacts/{artifact_path}".format(run_id=run_id, artifact_path=artifact_path)
model_details = mlflow.register_model(model_uri=model_uri, name=model_name_mlflow)

In [None]:
import time
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
 
def wait_until_ready(model_name, model_version):
  client = MlflowClient()
  for _ in range(10):
    model_version_details = client.get_model_version(
      name=model_name,
      version=model_version,
    )
    status = ModelVersionStatus.from_string(model_version_details.status)
    print("Model status: %s" % ModelVersionStatus.to_string(status))
    if status == ModelVersionStatus.READY:
      break
    time.sleep(1)
  
wait_until_ready(model_details.name, model_details.version)

In [None]:
# add a version specific description
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.update_registered_model(
  name=model_details.name,
  description="This classification model version was built using data from BigQuery."
)

In [None]:
token_id  = make_authorized_get_request(CLOUD_FUNCTION_URL, CLOUD_FUNCTION_URL)

In [None]:
http_url_spec = HttpUrlSpec(
  url=CLOUD_FUNCTION_URL,
  authorization=f"Bearer {token_id}"
)
http_webhook = RegistryWebhooksClient().create_webhook(
  model_name=model_details.name,
  events=["MODEL_VERSION_TRANSITIONED_STAGE"],
  http_url_spec=http_url_spec,
  description="Testing deploy model",
  status="TEST_MODE"
)




In [None]:
http_webhook

In [None]:
http_webhook = RegistryWebhooksClient().update_webhook(
  id=http_webhook.id,
  status="ACTIVE"
)

In [None]:
# deploy the model to production stage in mlflow
time.sleep(20)

client.transition_model_version_stage(
  name=model_details.name,
  version=model_details.version,
  stage='Production',
)

In [None]:
import mlflow.pyfunc

model_version_uri = "models:/{model_name}/production".format(model_name=model_name_mlflow)

print("Loading PRODUCTION model stage with name: '{model_uri}'".format(model_uri=model_version_uri))

In [None]:
config = {
    "machine_type": "n1-standard-2",
    "description": 'Serving a logistic regression model',

}

In [None]:
# TODO: Add better configuration
from mlflow import deployments

client = deployments.get_deploy_client("google_cloud")
model_version_uri = "models:/manually_registered_sklearn/production"
deployment = client.create_deployment(
    name=model_name_mlflow,
    model_uri=model_version_uri,
    config=config)