# 06a - Vertex AI > Model Monitoring

In other notebooks the end-to-end workflows include serving trained ML models on Vertex AI endpoints.  In this notebook, an endpoint will be extended by enableing model monitoring.  This enables continuous scheduled monitoring of selected model features for deviations:
- Training-serving skew: feature distribution is different from the feature distribtution in the training data
- Prediction drift: feature distribution is different over time

The monitoring is setup with a threshold that is used to create alerts
- numerical features difference is calculated with [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)
- categorical features difference is calculated with [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance)



### Prerequisites:
-  02a - Vertex AI - AutoML in GCP Console (no code)
    - or any other notebook that creates a Vertex AI Endpoint
    - Picking `02a` because it has and endpoint setup for prediction and explanation

### Overview:
- Find Existing Endpoint
- Prediction from endpoint using Python API
- Start Monitoring Job for Skew and Drift
    - Setup Monitoring Client
    - Setup Monitoring Job
- Run Prediction with Training Data
    - Review Alerts
- Run Predictions with Test Data
    - Review Alerts
- Extended Run of Predictions with Noise
    - Review Alerts and Distributions
- Pause and Delete Monitoring Job

### Resources:
- [Python Client for Vertex AI](https://googleapis.dev/python/aiplatform/latest/aiplatform.html)
- [Model Monitoring Documentation](https://cloud.google.com/vertex-ai/docs/model-monitoring/overview)
- [Blog: Monitor Models with Vertex AI](https://cloud.google.com/blog/topics/developers-practitioners/monitor-models-training-serving-skew-vertex-ai)

---
## Vertex AI - Conceptual Flow

<img src="architectures/slides/06a_arch.png">

---
## Vertex AI - Workflow

<img src="architectures/slides/06a_console.png">

---
## Setup

inputs:

In [7]:
REGION = 'us-central1'
PROJECT_ID='ma-mx-presales-lab'
DATANAME = 'fraud'
NOTEBOOK = '06a'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters

packages:

In [8]:
import google.cloud.aiplatform_v1 as vertex
from datetime import datetime
import copy
import time

from google.cloud import bigquery
from google.protobuf import json_format
from google.protobuf.duration_pb2 import Duration
from google.protobuf.struct_pb2 import Value
import json
import numpy as np

clients:

In [9]:
#aiplatform.init(project=PROJECT_ID, location=REGION)
bigquery = bigquery.Client()
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}
parent = f"projects/{PROJECT_ID}/locations/{REGION}"

parameters:

In [10]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = "vertex-ai-mlops-bucket"
URI = f"gs://{BUCKET}/{DATANAME}/models/{NOTEBOOK}"
DIR = f"temp/{NOTEBOOK}"

environment:

In [11]:
!rm -rf {DIR}
!mkdir -p {DIR}

---
## Endpoint

Setup Client:

In [12]:
endpointClient = vertex.EndpointServiceClient(client_options = client_options)

Find Endpoint:

In [13]:
endpoint_prefix = '02a'
for e in endpointClient.list_endpoints(parent = parent):
    if e.display_name.startswith(endpoint_prefix): endpoint = e
print(endpoint.display_name)
print(endpoint.name)

NameError: name 'endpoint' is not defined

---
## Prediction

### Prepare a record for prediction: instance and parameters lists

In [8]:
pred = bigquery.query(query = f"SELECT * FROM {DATANAME}.{DATANAME}_prepped WHERE splits='TEST' LIMIT 10").to_dataframe()
pred.head(4)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,7148,1.156386,0.193513,0.24222,0.660729,0.236144,0.311471,-0.08842,0.057844,1.123405,...,-0.051662,-0.262183,0.47787,0.556403,-0.046953,-0.021878,0.0,0,0eddc3ef-a61b-4fba-a3ab-0ed9a726dcf0,TEST
1,76311,-0.186529,0.545755,2.432618,3.266129,-0.784549,3.167033,-2.460489,-1.830983,0.389492,...,-0.40038,-1.26528,1.231,0.749402,0.147862,0.187856,0.0,0,b1111e03-a559-4eb4-ab32-e3aea0072ef7,TEST
2,125139,1.879049,0.212473,-0.085529,3.554091,0.205505,1.188395,-0.672662,0.375249,-0.494351,...,0.131433,0.256023,-0.13545,0.048878,0.003082,-0.042219,0.0,0,0a0f4b69-01ee-436e-ae52-02237cd6433e,TEST
3,51632,1.26405,0.182193,0.02091,0.47806,-0.037823,-0.490973,0.16669,-0.130607,-0.1572,...,-0.167644,0.075563,0.698539,0.556361,-0.052595,-0.011799,0.0,0,ed678d6e-8dea-4d45-92b7-74e7eba22402,TEST


In [9]:
newob = pred[pred.columns[~pred.columns.isin(VAR_OMIT.split()+[VAR_TARGET, 'splits'])]].to_dict(orient='records')[0]
newob['Time'] = str(newob['Time'])
newob

{'Time': '7148',
 'V1': 1.1563856546850502,
 'V2': 0.19351304694375798,
 'V3': 0.24222013113132398,
 'V4': 0.660729271767453,
 'V5': 0.236144478904119,
 'V6': 0.311470701249739,
 'V7': -0.0884201179751894,
 'V8': 0.0578444447798684,
 'V9': 1.12340519250933,
 'V10': -0.415125337823525,
 'V11': 2.61390267388756,
 'V12': -1.11995029301977,
 'V13': 1.83257479990526,
 'V14': 1.8000032869791698,
 'V15': -0.9204892527570009,
 'V16': -0.7715317776122379,
 'V17': 1.00872209269001,
 'V18': -0.8199387457522109,
 'V19': -0.5106310991757079,
 'V20': -0.201218353967519,
 'V21': -0.107305238248846,
 'V22': 0.153991860997963,
 'V23': -0.0516623078695162,
 'V24': -0.262182735306937,
 'V25': 0.47786970630759795,
 'V26': 0.556402927216063,
 'V27': -0.0469529718093107,
 'V28': -0.0218776763871274,
 'Amount': 0.0}

### Get Predictions: Python Client

Client Setup:

In [10]:
predictorClient = vertex.PredictionServiceClient(client_options = client_options)

Instance Input:

In [11]:
instances = [json_format.ParseDict(newob, Value())]

Get Prediction:

In [12]:
prediction = predictorClient.predict(endpoint = endpoint.name, instances = instances)
prediction

predictions {
  struct_value {
    fields {
      key: "classes"
      value {
        list_value {
          values {
            string_value: "0"
          }
          values {
            string_value: "1"
          }
        }
      }
    }
    fields {
      key: "scores"
      value {
        list_value {
          values {
            number_value: 0.9181194305419922
          }
          values {
            number_value: 0.08188050240278244
          }
        }
      }
    }
  }
}
deployed_model_id: "4955340576712032256"
model: "projects/715288179162/locations/us-central1/models/5979693987659776000"
model_display_name: "02a_202215193614"

## Start Monitoring

Setup a BigQuery view of just the training data:

In [13]:
query = f"""
CREATE OR REPLACE VIEW `{PROJECT_ID}.{DATANAME}.{DATANAME}_prepped_trainingView` AS
SELECT * EXCEPT(splits, {VAR_OMIT.replace(' ',',')}) FROM `{PROJECT_ID}.{DATANAME}.{DATANAME}_prepped`
WHERE splits = 'TRAIN'
"""
createView = bigquery.query(query)

Get a list of column names (features) in the training data:

In [14]:
query = f"SELECT column_name, data_type FROM {DATANAME}.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '{DATANAME}_prepped_trainingView' and column_name != '{VAR_TARGET}'"
schema = bigquery.query(query).to_dataframe()
features = schema.column_name.tolist()
features

['Time',
 'V1',
 'V2',
 'V3',
 'V4',
 'V5',
 'V6',
 'V7',
 'V8',
 'V9',
 'V10',
 'V11',
 'V12',
 'V13',
 'V14',
 'V15',
 'V16',
 'V17',
 'V18',
 'V19',
 'V20',
 'V21',
 'V22',
 'V23',
 'V24',
 'V25',
 'V26',
 'V27',
 'V28',
 'Amount']

### Setup Monitoring Client

In [15]:
monitorClient = vertex.JobServiceClient(client_options = client_options)

### Setup Monitoring Job

Links to Python API:
- [aiplatform.gapic.JobServiceClient](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/job_service.html#google.cloud.aiplatform_v1.services.job_service.JobServiceClient)
    - [.create_model_deployment_monitoring_job](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/job_service.html#google.cloud.aiplatform_v1.services.job_service.JobServiceClient.create_model_deployment_monitoring_job)
        - [model_deployment_monitoring_job = aiplatform.gapic.types.ModelDeploymentMonitoringJob](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelDeploymentMonitoringJob)
            - [model_deployment_monitoring_objective_configs = aiplatform.gapic.types.ModelDeploymentMonitoringObjectiveConfig](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelDeploymentMonitoringObjectiveConfig)
                - [objective_config = aiplatform.gapic.types.ModelMonitoringObjectiveConfig](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelMonitoringObjectiveConfig)
                    - [training_dataset = aiplatform.gapic.types.ModelMonitoringObjetiveConfig.TrainingDataset](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelMonitoringObjectiveConfig.TrainingDataset)
                    - [training_prediction_skew_detection_config = aiplatform.gapic.types.ModelMonitoringObjectiveConfig.TrainingPredictionsSkewDetectionConfig](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelMonitoringObjectiveConfig.TrainingPredictionSkewDetectionConfig)
                    - [prediction_drift_detection_config = aiplatform.gapic.types.ModelMonitoringObjectiveConfig.PredictionDriftDetectionConfig](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelMonitoringObjectiveConfig.PredictionDriftDetectionConfig)
            - [logging_sampling_strategy = aiplatform.gapic.types.SamplingStrategy](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.SamplingStrategy)
                - [random_sample_config = aiplatform.gapic.types.SamplingStrategy.RandomSampleConfig](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.SamplingStrategy.RandomSampleConfig)
            - [model_deployment_monitoring_schedule_config = aiplatform.gapic.types.ModelMonitoringAlertConfig](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelDeploymentMonitoringScheduleConfig)
            - [model_monitoring_alert_config = aiplatform.gapic.types.ModelMonitoringAlertConfig](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1/types.html#google.cloud.aiplatform_v1.types.ModelMonitoringAlertConfig)
                - [email_alert_config = aiplatform.gapic.types.ModelMonitoringAlertConfig.EmailAlertConfig]()

In [16]:
USER_EMAIL = "foo@foobar.com" # send alerts here
MONITOR_INTERVAL = 3600 # seconds, intervals are round up to nearest hour
SKEW_DEFAULT_THRESHOLD_VALUE = 0.001 # very low for demonstration
DRIFT_DEFAULT_THRESHOLD_VALUE = 0.001 # very low for demonstration
SAMPLE_RATE = 1 # percent of prediction to monitor
FEATURES_TO_MONITOR = features[-5:]

skew_thresholds, drift_thresholds = {}, {}
for feature in FEATURES_TO_MONITOR:
    skew_thresholds[feature] = vertex.types.ThresholdConfig(value = SKEW_DEFAULT_THRESHOLD_VALUE)
    drift_thresholds[feature] = vertex.types.ThresholdConfig(value = DRIFT_DEFAULT_THRESHOLD_VALUE)
skew_config = vertex.types.ModelMonitoringObjectiveConfig.TrainingPredictionSkewDetectionConfig(
    skew_thresholds = skew_thresholds
)    
drift_config = vertex.types.ModelMonitoringObjectiveConfig.PredictionDriftDetectionConfig(
    drift_thresholds = drift_thresholds
)
    
training_dataset = vertex.types.ModelMonitoringObjectiveConfig.TrainingDataset(
    target_field = VAR_TARGET,
    bigquery_source = vertex.types.BigQuerySource(input_uri = f"bq://{PROJECT_ID}.{DATANAME}.{DATANAME}_prepped_trainingView")
)


objective_config = vertex.types.ModelMonitoringObjectiveConfig(
    training_dataset = training_dataset,
    training_prediction_skew_detection_config = skew_config,
    prediction_drift_detection_config = drift_config
)

# list of models deployed to endpoint
models = [m.id for m in endpoint.deployed_models]

objective_template = vertex.types.ModelDeploymentMonitoringObjectiveConfig(
    objective_config = objective_config
)

objective_configs = []
for model_id in models:
    objective_config = copy.deepcopy(objective_template)
    objective_config.deployed_model_id = model_id
    objective_configs.append(objective_config)

random_sampling = vertex.types.SamplingStrategy.RandomSampleConfig(sample_rate = SAMPLE_RATE)
sampling_config = vertex.types.SamplingStrategy(random_sample_config = random_sampling)
schedule_config = vertex.types.ModelDeploymentMonitoringScheduleConfig(monitor_interval = Duration(seconds = MONITOR_INTERVAL))
alerting_config = vertex.types.ModelMonitoringAlertConfig(
    email_alert_config = vertex.types.ModelMonitoringAlertConfig.EmailAlertConfig(user_emails = [USER_EMAIL])
)
predict_schema = ""
analysis_schema = ""

monitorJob = vertex.types.ModelDeploymentMonitoringJob(
    display_name = f'{NOTEBOOK}_{DATANAME}',
    endpoint = endpoint.name,
    model_deployment_monitoring_objective_configs = objective_configs,
    logging_sampling_strategy = sampling_config,
    model_deployment_monitoring_schedule_config = schedule_config,
    model_monitoring_alert_config = alerting_config,
    predict_instance_schema_uri = predict_schema,
    analysis_instance_schema_uri = analysis_schema
)

Run Job:

In [17]:
response = monitorClient.create_model_deployment_monitoring_job(
    parent = f"projects/{PROJECT_ID}/locations/{REGION}", 
    model_deployment_monitoring_job = monitorJob
)

The job will start completes it's initial run:
- Receive a confirmation email at the alert email provided
- Creates a BigQuery dataset and table for Model Monitoring Data
- Adds Model Monitoring the Endpoint in the Vertex AI Console

|Email Alert|BigQuery Dataset Setup|Endpoint Monitoring|
:---:|:---:|:---:
![](./architectures/notebooks/06a_screenshots/email_start.png)|![](./architectures/notebooks/06a_screenshots/bq_start.png)|![](./architectures/notebooks/06a_screenshots/endpoint_start.png)

## Run Predictions with Training Data for 5 Minutes

Get Training Data:

In [19]:
training = bigquery.query(query = f"SELECT * EXCEPT({VAR_TARGET}) FROM {DATANAME}.{DATANAME}_prepped_trainingView").to_dataframe()
training['Time'] = training['Time'].astype(str)

In [20]:
runMinutes = 5
end = time.time() + 60 * runMinutes

while time.time() < end:
    newob = training.sample(n=1).to_dict(orient='records')[0]
    prediction = predictorClient.predict(endpoint = endpoint.name, instances = [json_format.ParseDict(newob, Value())])

The predictions run, about 1500 per minute, and gets logged to the BigQuery dataset:

![](./architectures/notebooks/06a_screenshots/bq_predictions.png)

### Wait for next Monitoring Job Run (less than an hour)
Check for alerts
- Receive email about alerts after job runs
- Adds alerts to Model Monitoring of the Endpoint in the Vertex AI Console

|Email Alert|Endpoint Monitoring Alert|
:---:|:---:
![](./architectures/notebooks/06a_screenshots/email_alert1.png)|![](./architectures/notebooks/06a_screenshots/endpoint_alert1.png)

## Run Predictions with Test Data for 5 Minutes

Get Test Data:

In [21]:
test = bigquery.query(query = f"SELECT * EXCEPT(splits, {VAR_TARGET}, {VAR_OMIT.replace(' ',',')}) FROM {DATANAME}.{DATANAME}_prepped WHERE splits = 'TEST'").to_dataframe()
test['Time'] = test['Time'].astype(str)

In [22]:
runMinutes = 5
end = time.time() + 60 * runMinutes

while time.time() < end:
    newob = test.sample(n=1).to_dict(orient='records')[0]
    prediction = predictorClient.predict(endpoint = endpoint.name, instances = [json_format.ParseDict(newob, Value())])

The predictions run, about 1500 per minute, and gets logged (added/inserted) to the BigQuery dataset table

### Wait for next Monitoring Job Run (less than an hour)
Check for alerts
- Receive email about alerts after job runs
- Adds alerts to Model Monitoring of the Endpoint in the Vertex AI Console

|Email Alert|Endpoint Monitoring Alert|
:---:|:---:
![](./architectures/notebooks/06a_screenshots/email_alert2.png)|![](./architectures/notebooks/06a_screenshots/endpoint_alert2.png)

## Extended Predictions Run with Noise: 8 Hours, Progressive Drift

In [57]:
runHours = 8
runMinutes = 60 * runHours 
end = time.time() + 60 * runMinutes

while time.time() < end:
    newob = test.sample(n=1).to_dict(orient='records')[0]
    # add noise here
    if (runHours*np.random.uniform(0,1)) <= (runHours-((end-time.time())/3600)): # random (0,runHours) <= hoursElapsed
        newob['Amount'] = newob['Amount'] + np.abs(np.random.normal(0, 250.12))
    prediction = predictorClient.predict(endpoint = endpoint.name, instances = [json_format.ParseDict(newob, Value())])

### Evaluate Drift In Console:

|Amount (at start)|Amount (8 hours later)|
:---:|:---:
![](./architectures/notebooks/06a_screenshots/amount_1.png)|![](./architectures/notebooks/06a_screenshots/amount_2.png)

### Monitoring The Endpoint

![](./architectures/notebooks/06a_screenshots/endpoint_monitor2.png)

## Delete Monitoring Job

In [229]:
response.name

'projects/715288179162/locations/us-central1/modelDeploymentMonitoringJobs/8463010268224421888'

In [230]:
pause = monitorClient.pause_model_deployment_monitoring_job(
    name = response.name
)

In [231]:
remove = monitorClient.delete_model_deployment_monitoring_job(
       name = response.name
)

In [232]:
remove.__dict__

{'_retry': <google.api_core.retry.Retry at 0x7ffb4a2b0110>,
 '_result': ,
 '_exception': None,
 '_result_set': True,
 '_polling_thread': None,
 '_done_callbacks': [],
 '_operation': name: "projects/715288179162/locations/us-central1/operations/2313307112618328064"
 metadata {
   type_url: "type.googleapis.com/google.cloud.aiplatform.v1.DeleteOperationMetadata"
   value: "\n\032\022\013\010\341\266\337\220\006\020\220\210\233`\032\013\010\341\266\337\220\006\020\220\210\233`"
 }
 done: true
 response {
   type_url: "type.googleapis.com/google.protobuf.Empty"
 },
 '_refresh': functools.partial(<bound method OperationsClient.get_operation of <google.api_core.operations_v1.operations_client.OperationsClient object at 0x7ffb480e45d0>>, 'projects/715288179162/locations/us-central1/operations/2313307112618328064', metadata=None),
 '_cancel': functools.partial(<bound method OperationsClient.cancel_operation of <google.api_core.operations_v1.operations_client.OperationsClient object at 0x7ffb48

## ToDo:
- Update the Monitoring Job
- Add feature attribution
- Add Batch Prediction
- FS integration