# 05i - Vertex AI > Training > Hyperparameter Tuning Jobs - With Custom Container

### 05 Series Overview
Where a model gets trained is where it consumes computing resources.  With Vertex AI, you have choices for configuring the computing resources available at training.  This notebook is an example of an execution environment.  When it was set up there were choices for machine type and accelerators (GPUs).  

In the `05` notebook, the model training happened directly in the notebook.  The models were then imported to Vertex AI and deployed to an endpoint for online predictions. 

In this `05a-05i` series of demonstrations, the same model is trained using managed computing resources in Vertex AI as custom training jobs.  These jobs will be demonstrated as:

-  Custom Job from a python script (`05a`), python source distribution (`05b`), and custom container (`05c`)
-  Training Pipeline that trains and saves models from a python script (`05d`), python source distribution (`05e`), and custom container (`05f`)
-  Hyperparameter Tuning Jobs from a python script (`05g`), python source distribution (`05h`), and custom container (`05i`)

### This Notebook (`05i`): An extension of `05c` with Hyperparmeter Tuning - And Tensorboard HParams  
This notebook trains the same Tensorflow Keras model from `05` by first modifying and saving the training code as a Python module on a custom container (same as `05c`).  While this example fits nicely in a single script, larger examples will benefit from the flexibility offered by source distributions or module storage and this notebook gives an example of making the shift. 

The training code is stored directly on the custom container as part of the Docker build process.  This build process uses a pre-built container as the base image and adds both packages and the training code as a Python module.  This container is specified in the setup of a custom training job and also assigned compute resources for executing the training in a managed service.  This is done with the [Vertex AI Python SDK](https://googleapis.dev/python/aiplatform/latest/aiplatform.html#) using the class [`aiplatform.CustomJob()`](https://googleapis.dev/python/aiplatform/latest/aiplatform.html#google.cloud.aiplatform.CustomJob).

The Custom Job is then used as the input for a Vertex AI > Training > Hyperparameter Tuning Job.  This runs and manages the tuning loops for the number of trials in each loop, collects the metric(s) and manages the parameters with the selected search algorithm for parameter modification.  This is done with the [Vertex AI Python SDK](https://googleapis.dev/python/aiplatform/latest/aiplatform.html#) using the class [`aiplatform.HyperparameterTuningJob()`](https://googleapis.dev/python/aiplatform/latest/aiplatform.html#google.cloud.aiplatform.HyperparameterTuningJob).

The training can be reviewed with Vertex AI's managed Tensorboard under Experiments > Experiments, or by clicking on the `05i...` job under Training > Hyperparameter Tuning Jobs and then clicking the 'Open Tensorboard' link.  **Click on the HParams tab in Tensorboard to review the hyperparameters and metrics.**

<img src="architectures/overview/Training.png">

### Prerequisites:
-  01 - BigQuery - Table Data Source
-  Understanding:
    -  05 - Vertex AI > Notebooks - Models Built in Notebooks with Tensorflow
        -  Contains a more granular review of the Tensorflow model training

### Overview:
- Setup Environment
- Setup Vertex AI > Experiments for Tensorboard
- Training
    - Assemble a Python file/script for training
    - Create a Custom Container containing the Python Script
    - Store the Custom Container in Artifact Registry
    - Setup the Vertex AI > Training > Custom Job
    - Setup the Vertex AI > Training > Hyperparameter Tuning Job
    - Run the Vertex AI > Training > Hyperparameter Tuning Job
    - Review the metrics across Hyperparamter Tuning Jobs and pick the best model
- Serving
    - Upload the chosen model to Vertex AI > Models
    - Create an Endpoint with Vertex AI > Endpoints
    - Deploy the Model to the Endpoint
- Prediction
    - Prepare a record for prediction
    - Get Predictions with Python Client
    - Get Predictions with REST
    - Get Prediction with gcloud CLI

### Resources:
- [Vertex AI Custom Container For Training](https://cloud.google.com/vertex-ai/docs/training/containers-overview)

---
## Vertex AI - Conceptual Flow

<img src="architectures/slides/05i_arch.png">

---
## Vertex AI - Workflow

<img src="architectures/slides/05i_console.png">

---
## Setup

inputs:

In [2]:
REGION = 'us-central1'
PROJECT_ID='ma-mx-presales-lab'
DATANAME = 'fraud'
NOTEBOOK = '05i'

# Resources
BASE_IMAGE = 'gcr.io/deeplearning-platform-release/tf-cpu.2-3'
DEPLOY_IMAGE ='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-7:latest'
TRAIN_COMPUTE = 'n1-standard-4'
DEPLOY_COMPUTE = 'n1-standard-4'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters
EPOCHS = 10
BATCH_SIZE = 100

packages:

In [3]:
from google.cloud import aiplatform
from datetime import datetime

from google.cloud import bigquery
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
import json
import numpy as np

clients:

In [4]:
aiplatform.init(project=PROJECT_ID, location=REGION)
bigquery = bigquery.Client()

parameters:

In [5]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = "vertex-ai-mlops-bucket"
URI = f"gs://{BUCKET}/{DATANAME}/models/{NOTEBOOK}"
DIR = f"temp/{NOTEBOOK}"

In [6]:
# Give service account roles/storage.objectAdmin permissions
# Console > IMA > Select Account <projectnumber>-compute@developer.gserviceaccount.com > edit - give role
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'825075454589-compute@developer.gserviceaccount.com'

environment:

In [7]:
!rm -rf {DIR}
!mkdir -p {DIR}

---
## Get Vertex AI Experiments Tensorboard Instance Name
[Vertex AI Experiments](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview) has managed [Tensorboard](https://www.tensorflow.org/tensorboard) instances that you can track Tensorboard Experiments (a training run or hyperparameter tuning sweep).  

The training job will show up as an experiment for the Tensorboard instance and have the same name as the training job ID.

This code checks to see if a Tensorboard Instance has been created in the project, retrieves it if so, creates it otherwise:

In [10]:
# tb = aiplatform.Tensorboard.list(filter=f'display_name={DATANAME}')
# if tb:
#     tb = tb[0]
# else:
#     tb = aiplatform.Tensorboard.create(display_name = DATANAME, labels = {'notebook':f'{DATANAME}'})

#tb.resource_name

---
## Training

### Assemble Python File for Training

Create the main python trainer file as `/train.py`:

In [11]:
!mkdir -p {DIR}/source/trainer

In [12]:
%%writefile {DIR}/source/trainer/train.py

# package import
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp
from google.cloud import bigquery
import argparse
import os
import sys
import hypertune

# import argument to local variables
parser = argparse.ArgumentParser()
# the passed param, dest: a name for the param, default: if absent fetch this param from the OS, type: type to convert to, help: description of argument
parser.add_argument('--epochs', dest = 'epochs', default = 10, type = int, help = 'Number of Epochs')
parser.add_argument('--batch_size', dest = 'batch_size', default = 32, type = int, help = 'Batch Size')
parser.add_argument('--var_target', dest = 'var_target', type=str)
parser.add_argument('--var_omit', dest = 'var_omit', type=str, nargs='*')
parser.add_argument('--project_id', dest = 'project_id', type=str)
parser.add_argument('--dataname', dest = 'dataname', type=str)
parser.add_argument('--region', dest = 'region', type=str)
parser.add_argument('--notebook', dest = 'notebook', type=str)
# hyperparameters
parser.add_argument('--lr',dest='learning_rate', required=True, type=float, help='Learning Rate')
parser.add_argument('--m',dest='momentum', required=True, type=float, help='Momentum')
args = parser.parse_args()

# setup tensorboard hparams
#    "lr": aiplatform.hyperparameter_tuning.DoubleParameterSpec(min=0.001, max=0.1, scale="log"),
#    "m": aiplatform.hyperparameter_tuning.DoubleParameterSpec(min=1e-7, max=0.9, scale="linear")
HP_LEARNING_RATE = hp.HParam('learning_rate',hp.RealInterval(0.0, 1.0))
HP_MOMENTUM = hp.HParam('momentum', hp.RealInterval(0.0,1.0))
hparams = {
    HP_LEARNING_RATE: args.learning_rate,
    HP_MOMENTUM: args.momentum
}

# built in parameters for data source:
PROJECT_ID = args.project_id
DATANAME = args.dataname
REGION = args.region
NOTEBOOK = args.notebook

# clients
bigquery = bigquery.Client(project = PROJECT_ID)

# get schema from bigquery source
query = f"SELECT * FROM {DATANAME}.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '{DATANAME}_prepped'"
schema = bigquery.query(query).to_dataframe()

# get number of classes from bigquery source
nclasses = bigquery.query(query = f'SELECT DISTINCT {args.var_target} FROM {DATANAME}.{DATANAME}_prepped WHERE {args.var_target} is not null').to_dataframe()
nclasses = nclasses.shape[0]

# Make a list of columns to omit
OMIT = args.var_omit + ['splits']

# use schema to prepare a list of columns to read from BigQuery
selected_fields = schema[~schema.column_name.isin(OMIT)].column_name.tolist()

# all the columns in this data source are either float64 or int64
output_types = [dtypes.float64 if x=='FLOAT64' else dtypes.int64 for x in schema[~schema.column_name.isin(OMIT)].data_type.tolist()]

# remap input data to Tensorflow inputs of features and target
def transTable(row_dict):
    target=row_dict.pop(args.var_target)
    target = tf.one_hot(tf.cast(target,tf.int64), nclasses)
    target = tf.cast(target, tf.float32)
    return(row_dict, target)

# function to setup a bigquery reader with Tensorflow I/O
def bq_reader(split):
    reader = BigQueryClient()

    training = reader.read_session(
        parent = f"projects/{PROJECT_ID}",
        project_id = PROJECT_ID,
        table_id = f"{DATANAME}_prepped",
        dataset_id = DATANAME,
        selected_fields = selected_fields,
        output_types = output_types,
        row_restriction = f"splits='{split}'",
        requested_streams = 3
    )
    
    return training

train = bq_reader('TRAIN').parallel_read_rows().prefetch(1).map(transTable).shuffle(args.batch_size*10).batch(args.batch_size)
validate = bq_reader('VALIDATE').parallel_read_rows().prefetch(1).map(transTable).batch(args.batch_size)
test = bq_reader('TEST').parallel_read_rows().prefetch(1).map(transTable).batch(args.batch_size)

# Logistic Regression

# model input definitions
feature_columns = {header: tf.feature_column.numeric_column(header) for header in selected_fields if header != args.var_target}
feature_layer_inputs = {header: tf.keras.layers.Input(shape = (1,), name = header) for header in selected_fields if header != args.var_target}

# feature columns to a Dense Feature Layer
feature_layer_outputs = tf.keras.layers.DenseFeatures(feature_columns.values())(feature_layer_inputs)

# batch normalization then Dense with softmax activation to nclasses
layers = tf.keras.layers.BatchNormalization()(feature_layer_outputs)
layers = tf.keras.layers.Dense(nclasses, activation = tf.nn.softmax)(layers)

# the model
model = tf.keras.Model(
    inputs = feature_layer_inputs,
    outputs = layers
)
opt = tf.keras.optimizers.SGD(learning_rate = hparams[HP_LEARNING_RATE], momentum = hparams[HP_MOMENTUM]) #SGD or Adam
loss = tf.keras.losses.CategoricalCrossentropy()
model.compile(
    optimizer = opt,
    loss = loss,
    metrics = ['accuracy', tf.keras.metrics.AUC(curve='PR')]
)

# setup tensorboard logs and train
log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR']
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = log_dir, histogram_freq = 1)
hparams_callback = hp.KerasCallback(log_dir + 'train/', hparams)
history = model.fit(train, epochs = args.epochs, callbacks = [tensorboard_callback, hparams_callback], validation_data = validate)

# output the model save files
model.save(os.getenv("AIP_MODEL_DIR"))

# report hypertune info back to Vertex AI Training > Hyperparamter Tuning Job
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
    hyperparameter_metric_tag = 'loss',
    metric_value = history.history['loss'][-1],
    global_step = 1)

Writing temp/05i/source/trainer/train.py


### Create Custom Container
- https://cloud.google.com/vertex-ai/docs/training/create-custom-container
- https://cloud.google.com/vertex-ai/docs/training/pre-built-containers
- https://cloud.google.com/vertex-ai/docs/general/deep-learning
    - https://cloud.google.com/deep-learning-containers/docs/choosing-container

#### Choose a Base Image

In [13]:
BASE_IMAGE # Defined above in Setup

'gcr.io/deeplearning-platform-release/tf-cpu.2-3'

#### Create the Dockerfile
A basic dockerfile thats take the base image and copies the code in and define an entrypoint - what python script to run first in this case.  Add RUN entries to pip install additional packages.

In this case, hyperparameter tuning uses [reports metrics to Vertex AI](https://cloud.google.com/vertex-ai/docs/training/using-hyperparameter-tuning#report-metrics) using the [cloudml-hypertune Python package](https://github.com/GoogleCloudPlatform/cloudml-hypertune) and is missing from the base image.  

In [14]:
dockerfile = f"""
FROM {BASE_IMAGE}
WORKDIR /
# Install Additional Packages
RUN pip install cloudml-hypertune
## Copies the trainer code to the docker image
COPY trainer /trainer
## Sets up the entry point to invoke the trainer
ENTRYPOINT ["python", "-m", "trainer.train"]
"""
with open(f'{DIR}/source/Dockerfile', 'w') as f:
    f.write(dockerfile)

#### Setup Artifact Registry

The container will need to be stored in Artifact Registry, Container Registry or Docker Hub in order to be used by Vertex AI Training jobs.  This notebook will setup Artifact registry and push a local (to this notebook) built container to it. 

https://cloud.google.com/artifact-registry/docs/docker/store-docker-container-images#gcloud

##### Enable Artifact Registry API:
Check to see if the api is enabled, if not then enable it:

In [15]:
services = !gcloud services list --format="json" --available --filter=name:artifactregistry.googleapis.com
services = json.loads("".join(services))

if (services[0]['config']['name'] == 'artifactregistry.googleapis.com') & (services[0]['state'] == 'ENABLED'):
    print(f"Artifact Registry is Enabled for This Project: {PROJECT_ID}")
else:
    print(f"Enabeling Artifact Registry for this Project: {PROJECT_ID}")
    !gcloud services enable artifactregistry.googleapis.com

Artifact Registry is Enabled for This Project: ma-mx-presales-lab


##### Create A Repository
Check to see if the registry is already created, if not then create it

In [16]:
repositories = !gcloud artifacts repositories list --format="json" --filter=REPOSITORY:{PROJECT_ID}
repositories = json.loads("".join(repositories[2:]))

if len(repositories) > 0:
    print(f'There is already a repository named {PROJECT_ID}')
else:
    print(f'Creating a repository named {PROJECT_ID}')
    !gcloud  artifacts repositories create {PROJECT_ID} --repository-format=docker --location={REGION} --description="Vertex AI Training Custom Containers"

There is already a repository named ma-mx-presales-lab


##### Configure Local Docker to Use GCLOUD CLI

In [17]:
!gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet


{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud",
    "us-central1-docker.pkg.dev": "gcloud"
  }
}
Adding credentials for: us-central1-docker.pkg.dev
gcloud credential helpers already registered correctly.


#### Build The Custom Container (local to notebook)

In [18]:
IMAGE_URI=f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{PROJECT_ID}/{NOTEBOOK}_{DATANAME}:latest"
IMAGE_URI

'us-central1-docker.pkg.dev/ma-mx-presales-lab/ma-mx-presales-lab/05i_fraud:latest'

In [19]:
!docker build {DIR}/source/. -t $IMAGE_URI

Sending build context to Docker daemon  9.216kB
Step 1/5 : FROM gcr.io/deeplearning-platform-release/tf-cpu.2-3
 ---> 1fdfb6e767fe
Step 2/5 : WORKDIR /
 ---> Using cache
 ---> f8eac7fa0565
Step 3/5 : RUN pip install cloudml-hypertune
 ---> Running in c11608bf3123
Collecting cloudml-hypertune
  Downloading cloudml-hypertune-0.1.0.dev6.tar.gz (3.2 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: cloudml-hypertune
  Building wheel for cloudml-hypertune (setup.py): started
  Building wheel for cloudml-hypertune (setup.py): finished with status 'done'
  Created wheel for cloudml-hypertune: filename=cloudml_hypertune-0.1.0.dev6-py2.py3-none-any.whl size=3987 sha256=20c6fc29fba89d18c0c09f4d581ab3a8f10768cdd563af1d27a13895ea954d71
  Stored in directory: /root/.cache/pip/wheels/a7/ff/87/e7bed0c2741fe219b3d6da67c2431d7f7fedb183032e00f81e
Successfully built cloudml-hypertune
Installing collected packa

#### Test The Custom Container (local to notebook)

In [20]:
#!docker run {IMAGE_URI} --PROJECT_ID {PROJECT_ID} --DATANAME {DATANAME}

#### Push The Custom Container To Artifact Registry

In [21]:
!docker push $IMAGE_URI

The push refers to repository [us-central1-docker.pkg.dev/ma-mx-presales-lab/ma-mx-presales-lab/05i_fraud]

[1B2cd0aff4: Preparing 
[1B7b38c047: Preparing 
[1Bdf428d24: Preparing 
[1B7aa87923: Preparing 
[1Baf295f89: Preparing 
[1B91940b32: Preparing 
[1B95a574c8: Preparing 
[1B10151b48: Preparing 
[1Bc089358e: Preparing 
[1B9b36546a: Preparing 
[1B82ce8d0b: Preparing 
[1B467ac3a5: Preparing 
[1B91c31559: Preparing 
[1Bae11254c: Preparing 
[1B2bcbe281: Preparing 
[1B4c112e39: Preparing 
[1B048fd290: Preparing 
[1Bbf18a086: Preparing 
[1B7a45d8d8: Preparing 
[1B6651fb01: Preparing 
[1Bd5cafaa0: Preparing 
[22Bcd0aff4: Pushed lready exists 4kB[21A[2K[16A[2K[13A[2K[9A[2K[5A[2K[1A[2Klatest: digest: sha256:5f6b848fc51806d765b829d7e991fbcbaa4d7c3741a263e4283551898551e060 size: 4922


### Setup Training Job

In [22]:
CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--dataname=" + DATANAME,
    "--region=" + REGION,
    "--notebook=" + NOTEBOOK
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": IMAGE_URI,
            "command": [],
            "args": CMDARGS
        }
    }
]

In [23]:
customJob = aiplatform.CustomJob(
    display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{TIMESTAMP}",
    labels = {'notebook':f'{NOTEBOOK}'}
)

### Setup Hyperparameter Tuning Job

In [24]:
METRIC_SPEC = {
    "loss": "minimize"
}

PARAMETER_SPEC = {
    "lr": aiplatform.hyperparameter_tuning.DoubleParameterSpec(min=0.001, max=0.1, scale="log"),
    "m": aiplatform.hyperparameter_tuning.DoubleParameterSpec(min=1e-7, max=0.9, scale="linear")
}

In [25]:
htJob = aiplatform.HyperparameterTuningJob(
    display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    custom_job = customJob,
    metric_spec = METRIC_SPEC,
    parameter_spec = PARAMETER_SPEC,
    max_trial_count = 20,
    parallel_trial_count = 5,
    search_algorithm = None,
    labels = {'notebook':f'{NOTEBOOK}'}
)

### Run Training Job

In [26]:
htJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

INFO:google.cloud.aiplatform.jobs:Creating HyperparameterTuningJob
INFO:google.cloud.aiplatform.jobs:HyperparameterTuningJob created. Resource name: projects/825075454589/locations/us-central1/hyperparameterTuningJobs/4747405885469360128
INFO:google.cloud.aiplatform.jobs:To use this HyperparameterTuningJob in another session:
INFO:google.cloud.aiplatform.jobs:hpt_job = aiplatform.HyperparameterTuningJob.get('projects/825075454589/locations/us-central1/hyperparameterTuningJobs/4747405885469360128')
INFO:google.cloud.aiplatform.jobs:View HyperparameterTuningJob:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/4747405885469360128?project=825075454589
INFO:google.cloud.aiplatform.jobs:View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+825075454589+locations+us-central1+tensorboards+6243765894325993472+experiments+4747405885469360128
INFO:google.cloud.aiplatform.jobs:HyperparameterTuningJob projects/825075454589/locations/

In [27]:
# if trial.state.name == 'SUCCEEDED'
losses = [trial.final_measurement.metrics[0].value if trial.state.name == 'SUCCEEDED' else 1 for trial in htJob.trials]
losses

[0.004580758512020111,
 0.004569855984300375,
 0.0036965347826480865,
 0.0035011235158890486,
 0.005478507373481989,
 0.004580274224281311,
 0.0033361068926751614,
 0.0038007372058928013,
 0.003558974713087082,
 0.0035873777233064175,
 0.0037931501865386963,
 0.0035744199994951487,
 0.003583424026146531,
 0.0033743027597665787,
 0.003488971386104822,
 0.0035540463868528605,
 0.003455731552094221,
 0.0035074602346867323,
 0.0034617020282894373,
 0.0034086769446730614]

In [28]:
best = htJob.trials[losses.index(min(losses))]
best

id: "7"
state: SUCCEEDED
parameters {
  parameter_id: "lr"
  value {
    number_value: 0.1
  }
}
parameters {
  parameter_id: "m"
  value {
    number_value: 0.5374021865996481
  }
}
final_measurement {
  step_count: 1
  metrics {
    metric_id: "loss"
    value: 0.0033361068926751614
  }
}
start_time {
  seconds: 1649211852
  nanos: 315811020
}
end_time {
  seconds: 1649212464
}

In [30]:
!mkdir $DIR/all_logs

In [31]:
!gsutil cp -r gs://vertex-ai-mlops-bucket/fraud/models/05i/20220406020648/* $DIR/all_logs


==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://vertex-ai-mlops-bucket/fraud/models/05i/20220406020648/1/logs/train/events.out.tfevents.1649211294.51b6e0148ebc.1.723.v2...
Copying gs://vertex-ai-mlops-bucket/fraud/models/05i/20220406020648/1/logs/train/events.out.tfevents.1649211294.51b6e0148ebc.1.739.v2...
Copying gs://vertex-ai-mlops-bucket/fraud/models/05i/20220406020648/1/logs/train/events.out.tfevents.1649211298.51b6e0148ebc.profile-empty...
Copying gs://vertex-ai-mlops-bucket/fraud/models/05i/20220406020648/1/logs/train/plugins/profile/2022_04_06_02_14_57/51b6e0148ebc.input_pipeline.pb...
Copying gs://vertex-ai-mlops-bucket/fraud/models/05i/20220406020648/1/logs/train/plugins/profile/2022_04_06_02_14_57/51b6e0148ebc.kernel_stats.pb...
Copying gs://vertex-ai-mlops-bu

In [29]:
%load_ext tensorboard

In [32]:
%tensorboard --logdir $DIR/all_logs

In [33]:
!tensorboard dev upload --logdir $DIR/all_logs \
  --name "Vertex hyperparameter tunning with custom container" \
  --description "Training results from Notebook 05i: hyperparameter tunning" \
  --one_shot

2022-04-06 03:09:19.271814: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2022-04-06 03:09:19.271870: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-04-06 03:09:19.271895: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (tensorflow-2-7-20220125-120050): /proc/driver/nvidia/version does not exist

New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/XSMdMKilTZCahz52xdRF9g/

[1m[2022-04-06T03:09:19][0m Started scanning logdir.
[1m[2022-04-06T03:09:40][0m Total uploaded: 1200 scalars, 1240 tensors (494.0 kB), 20 binary objects (927.3 kB)
[1m[2022-04-06T03:09:40][0m Done scanning logdir.


Done

---
## Serving

### Upload The Model

In [72]:
model = aiplatform.Model.upload(
    display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    serving_container_image_uri = DEPLOY_IMAGE,
    artifact_uri = f"{URI}/{TIMESTAMP}/{best.id}/model",
    labels = {'notebook':f'{NOTEBOOK}'}
)

INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/715288179162/locations/us-central1/models/8122263918195245056/operations/6548951358253301760
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/715288179162/locations/us-central1/models/8122263918195245056
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/715288179162/locations/us-central1/models/8122263918195245056')


In [73]:
model.display_name

'05i_fraud_20220313140425'

### Create An Endpoint

In [74]:
endpoint = aiplatform.Endpoint.create(
    display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    labels = {'notebook':f'{NOTEBOOK}'}
)

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/715288179162/locations/us-central1/endpoints/2951168373787983872/operations/6166145389926809600
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/715288179162/locations/us-central1/endpoints/2951168373787983872
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/715288179162/locations/us-central1/endpoints/2951168373787983872')


In [75]:
endpoint.display_name

'05i_fraud_20220313140425'

### Deploy Model To Endpoint

In [76]:
endpoint.deploy(
    model = model,
    deployed_model_display_name = f'{NOTEBOOK}_{DATANAME}_{TIMESTAMP}',
    traffic_percentage = 100,
    machine_type = DEPLOY_COMPUTE,
    min_replica_count = 1,
    max_replica_count = 1
)

INFO:google.cloud.aiplatform.models:Deploying Model projects/715288179162/locations/us-central1/models/8122263918195245056 to Endpoint : projects/715288179162/locations/us-central1/endpoints/2951168373787983872
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/715288179162/locations/us-central1/endpoints/2951168373787983872/operations/3860302380713115648
INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/715288179162/locations/us-central1/endpoints/2951168373787983872


---
## Prediction

### Prepare a record for prediction: instance and parameters lists

In [77]:
pred = bigquery.query(query = f"SELECT * FROM {DATANAME}.{DATANAME}_prepped WHERE splits='TEST' LIMIT 10").to_dataframe()

In [78]:
pred.head(4)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,7148,1.156386,0.193513,0.24222,0.660729,0.236144,0.311471,-0.08842,0.057844,1.123405,...,-0.051662,-0.262183,0.47787,0.556403,-0.046953,-0.021878,0.0,0,0eddc3ef-a61b-4fba-a3ab-0ed9a726dcf0,TEST
1,76311,-0.186529,0.545755,2.432618,3.266129,-0.784549,3.167033,-2.460489,-1.830983,0.389492,...,-0.40038,-1.26528,1.231,0.749402,0.147862,0.187856,0.0,0,b1111e03-a559-4eb4-ab32-e3aea0072ef7,TEST
2,125139,1.879049,0.212473,-0.085529,3.554091,0.205505,1.188395,-0.672662,0.375249,-0.494351,...,0.131433,0.256023,-0.13545,0.048878,0.003082,-0.042219,0.0,0,0a0f4b69-01ee-436e-ae52-02237cd6433e,TEST
3,51632,1.26405,0.182193,0.02091,0.47806,-0.037823,-0.490973,0.16669,-0.130607,-0.1572,...,-0.167644,0.075563,0.698539,0.556361,-0.052595,-0.011799,0.0,0,ed678d6e-8dea-4d45-92b7-74e7eba22402,TEST


In [79]:
newob = pred[pred.columns[~pred.columns.isin(VAR_OMIT.split()+[VAR_TARGET, 'splits'])]].to_dict(orient='records')[0]
#newob

In [80]:
instances = [json_format.ParseDict(newob, Value())]
parameters = json_format.ParseDict({}, Value())

### Get Predictions: Python Client

In [81]:
prediction = endpoint.predict(instances=instances, parameters=parameters)
prediction

Prediction(predictions=[[0.999886394, 0.000113604474]], deployed_model_id='5686172761555206144', explanations=None)

In [82]:
prediction.predictions[0]

[0.999886394, 0.000113604474]

In [83]:
np.argmax(prediction.predictions[0])

0

### Get Predictions: REST

In [84]:
with open(f'{DIR}/request.json','w') as file:
    file.write(json.dumps({"instances": [newob]}))

In [85]:
!curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @{DIR}/request.json \
https://{REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:predict

{
  "predictions": [
    [
      0.999886394,
      0.000113604474
    ]
  ],
  "deployedModelId": "5686172761555206144",
  "model": "projects/715288179162/locations/us-central1/models/8122263918195245056",
  "modelDisplayName": "05i_fraud_20220313140425"
}


### Get Predictions: gcloud (CLI)

In [86]:
!gcloud beta ai endpoints predict {endpoint.name.rsplit('/',1)[-1]} --region={REGION} --json-request={DIR}/request.json

Using endpoint [https://us-central1-prediction-aiplatform.googleapis.com/]
[[0.999886394, 0.000113604474]]


---
## Remove Resources
see notebook "99 - Cleanup"