## Get started with AI Plat-form with optimizer

When I first heard of the optimizer, I would think out is auto-ml, but this optimizer isn't exactly with auto-ml, it will just get the model best hyper-parameters, why do we need the optimizer? I think if you are an expert in ML, then you would do that manually to find the best parameters based on data, but the production is for someone don't want to tune the models, but just want to get a better result...

So the main logic here is first we define our hyper-parameters space, then will use some algorithms to find the best suit parameters, means `best` is not true, no matter what algorithms couldn't do that!

There are many strategy could be used like Grid-search, Random-search, Bayes-optimizer, etc. For google cloud implementation uses [Bayes-optimization](https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview#how_hyperparameter_tuning_works), this is high level overview, if you are curious about the detail logic, could find it [here](https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-cloud-machine-learning-engine-using-bayesian-optimization).

Currently during the code implemetation, I found optimizer needs you to provide the min and max space for the hyper-parameters, that's to reduce the  search space by mind. But I have to say that there do add so many codes.... That's why GCP is for developers:)? 

In [0]:
# install the libraries we need
! pip install -U google-api-python-client google-cloud --quiet
! pip install -U google-cloud-storage  --quiet
! pip install -U requests --quiet
! pip install google-cloud-api-python --quiet

[31mERROR: Could not find a version that satisfies the requirement google-cloud-api-python (from versions: none)[0m
[31mERROR: No matching distribution found for google-cloud-api-python[0m


In [0]:
# config the project
! gcloud config set project cloudtutorial-278306

Updated property [core/project].


In [0]:
# auth the notebook
from google.colab import auth
auth.authenticate_user()

## setup parameters for optimizer

In [0]:
import json
import time
import datetime
from googleapiclient import errors


In [0]:
USER = 'lugq'

STUDY_ID = "{}_study_{}".format(USER, datetime.datetime.now().strftime('%Y%m%d'))
REGION = 'us-central1'

PROJECT_ID = "cloudtutorial-278306"

In [0]:
def study_parent():
  return "projects/{}/locations/{}".format(PROJECT_ID, REGION)

def study_name(study_id):
  return 'projects/{}/locations/{}/studies/{}'.format(PROJECT_ID, REGION, study_id)

def trial_parent(study_id):
  return study_name(study_id)

def trial_name(study_id, trial_id):
  return 'projects/{}/locations/{}/studies/{}/trials/{}'.format(PROJECT_ID, REGION, study_id, trial_id)

def operation_name(operation_id):
  return 'projects/{}/locations/{}/operations/{}'.format(PROJECT_ID, REGION, operation_id)


print('USER: {}'.format(USER))
print('PROJECT_ID: {}'.format(PROJECT_ID))
print('REGION: {}'.format(REGION))
print('STUDY_ID: {}'.format(STUDY_ID))

USER: lugq
PROJECT_ID: cloudtutorial-278306
REGION: us-central1
STUDY_ID: lugq_study_20200528


In [0]:
# then we need to upload the file into bucket
from google.cloud import storage
from googleapiclient import discovery

# this is a public bucket that we could use for optimizer, not own bucket.
optimizer_bucket = 'caip-optimizer-public'
optimizer_file = 'api/ml_public_google_rest_v1.json'

def read_api_document():
  client = storage.Client(PROJECT_ID)
  bucket = client.get_bucket(optimizer_bucket)
  blob = bucket.get_blob(optimizer_file)
  return blob.download_as_string()

ml = discovery.build_from_document(service=read_api_document())
print('build client')

build client


## Define parameters space

Then we should try to optimize the the hyper-parameters, but currently is for build-in algorithms. For user algorithm, will try to find out.

In [0]:
# this is json configuration of seach space
learning_rate_space = {
    'parameter': 'learning_rate', 
    'type': 'double',
    'double_value_spec':{
        'min_value': .0001,
        'max_value': 1.0
    },
    'scale_type': 'unit_log_scale',    # how to sample config parameters. 
    'parent_categorical_values':{
        'values': ['linear']
    }
}

# model config
param_model_type = {
    'parameter': 'model_type',
    'type': 'categorical',
    'categorical_value_spec': {'values': ['linear']},
    'child_parameter_specs': [learning_rate_space]
}

# metrics
metric_accuracy = {
    'metric': 'accuracy',
    'goal': "maximize"
}

# define our study config
study_config = {
    'algorithm': 'algorithm_unspecified',    # not to set, let service to find
    'parameters': [param_model_type,],
    'metrics': [metric_accuracy,]
}

study = {'study_config': study_config}

# let's see what we have made.... So many parameter with wrap
print(json.dumps(study, indent=2, sort_keys=True))

{
  "study_config": {
    "algorithm": "algorithm_unspecified",
    "metrics": [
      {
        "goal": "maximize",
        "metric": "accuracy"
      }
    ],
    "parameters": [
      {
        "categorical_value_spec": {
          "values": [
            "linear"
          ]
        },
        "child_parameter_specs": [
          {
            "double_value_spec": {
              "max_value": 1.0,
              "min_value": 0.0001
            },
            "parameter": "learning_rate",
            "parent_categorical_values": {
              "values": [
                "linear"
              ]
            },
            "scale_type": "unit_log_scale",
            "type": "double"
          }
        ],
        "parameter": "model_type",
        "type": "categorical"
      }
    ]
  }
}


In [0]:
# then we could create our study object
# I have to say that with so many configuration dictionary, will always raise 400 error 
# for not recognize the parameter.... I think this should make it into function as function
# parameters, then we don't need to write so many json....

req = ml.projects().locations().studies().create(parent=study_parent(), 
                                                     studyId=STUDY_ID, body=study)

try:
  print(req.execute())
except errors.HttpError as e:
  if e.resp.status == '409':
    print('study already created')
  else:
    raise e

{'name': 'projects/574974437586/locations/us-central1/studies/lugq_study_20200528', 'studyConfig': {'metrics': [{'goal': 'MAXIMIZE', 'metric': 'accuracy'}], 'parameters': [{'parameter': 'model_type', 'type': 'CATEGORICAL', 'categoricalValueSpec': {'values': ['linear']}, 'childParameterSpecs': [{'parameter': 'learning_rate', 'type': 'DOUBLE', 'doubleValueSpec': {'minValue': 0.0001, 'maxValue': 1}, 'scaleType': 'UNIT_LOG_SCALE', 'parentCategoricalValues': {'values': ['linear']}}]}]}, 'state': 'ACTIVE', 'createTime': '2020-05-28T05:19:34Z'}


## Configuration

The most important thing I found with optimizer is that there isn't a processing step in fact, we just load data from bucket, so the optimizer step will assume that preprocessing step has finished and save final result as a file into bucket or even big query. 

Also this is really a tough way to make the whole thing done, as this is not a local server, it's remote container, even with just prepreparing will take at least 2 mins, when we do face error in code, it will about 4 mins.... 

In [0]:
# after we have already created the study object, then we should 
# config where to output the config result, that's to the bucket...
output_bucket = 'first_bucket_lugq'
output_dir = 'optimizer_test'

# Here I think I have to chenge data with label as first column and without columns name...
training_data_path = "gs://{}/sklearn_tutorial/data_label_first.csv".format(output_bucket)


print("Where to get data:", training_data_path)

Where to get data: gs://first_bucket_lugq/sklearn_tutorial/data_label_first.csv


In [0]:
# then we will evaluate the model based on each trial step,
# will write a summary file into bucket as `final_measurement`
# but we have to write our log function by ourself.
import logging
import math
import subprocess
import os
import yaml

logging.basicConfig(level=logging.INFO)

training_job_name_pattern = '{}_condition_parameters_{}_{}'

# as we will use build-in algorithm, we should define our image location...
image_urls = {'linear': 'gcr.io/cloud-ml-algos/linear_learner_cpu:latest'}
_step_count = 'step_count'
_accuracy = 'accuracy'

def evaluate_trials(trials):
  trials_by_job_id = {}
  mesurement_by_trial_id = {}

  # Submits a AI Platform Training job for each trial.
  for trial in trials:
    trial_id = int(trial['name'].split('/')[-1])
    model_type = get_suggest_param_value(trial, 'model_type', 'stringValue')
    learning_rate = get_suggest_param_value(trial, 'learning_rate', 'floatValue')

    job_id = generate_training_job_id(model_type, trial_id)

    # make the trial config
    trials_by_job_id[job_id] = {
        'trial_id': trial_id,
        'model_type': model_type,
        'learning_rate': learning_rate,
    }

    # then we could submit our job...
    submit_job(job_id, trial_id, model_type, learning_rate)

  # then we could wait for the training step finish
  # if and only if the whole job has finished, we could move on.
  while not is_job_complete(trials_by_job_id.keys()):
    time.sleep(5)

  # get the training result
  logging.info("start to get the job metrics...")
  metrics_by_job_id = get_job_metrics(trials_by_job_id.keys())

  # get measurement for accuracy
  logging.info("Metrics has finished, get metrics object:", metrics_by_job_id.items())
  for job_id, metric in metrics_by_job_id.items():
    measurement = create_measurement(trials_by_job_id[job_id]['trial_id'],
                                     trials_by_job_id[job_id]['model_type'],
                                     trials_by_job_id[job_id]['learning_rate'],
                                     metric)
    mesurement_by_trial_id[trials_by_job_id[job_id]['trial_id']] = measurement
    return  mesurement_by_trial_id


def get_suggest_param_value(trial, parameter,  value_type):
  # get suggested parameters values by parameter and type.
  param_found = [p for p in trial['parameters'] if p['parameter'] == parameter]
  if param_found:
    logging.info("Get suggestion value: ", str(param_found[0][value_type]))
    return param_found[0][value_type]
  else:
    return None


def generate_training_job_id(model_type, trial_id):
  # define a name of job
  return training_job_name_pattern.format(STUDY_ID, model_type, trial_id)


def get_training_state(job_id):
  # this is to get the current job status with command
  cmd = ['gcloud', 'ai-platform', 'jobs', 'describe', job_id,
         '--project', PROJECT_ID,
         '--format', 'json']
  try:
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT, timeout=3)

    logging.info("Get training status:", json.loads(output)['state'])
    return json.loads(output)['state']
  except subprocess.CalledProcessError as e:
    logging.error(e.output)
  


def is_job_complete(jobs):
  # check given jobs have completed or not.
  all_done = True
  for job in jobs:
    if get_training_state(job) not in  ['SUCCEEDED', 'FAILED', 'CANCELLED']:
      all_done = False

  logging.info("All jobs have been done.")
  return all_done


def get_job_dir(job_id):
  # this is really like if we do auto-ml training, we would seperate each model with temperate folder...
  return os.path.join('gs://', output_bucket, output_dir, job_id)


def linear_command(job_id, learning_rate):
  # this is used to submit linear training job....
  return ['gcloud', 'ai-platform', 'jobs', 'submit', 'training', job_id,
         '--scale-tier', 'BASIC',
          '--region', 'us-central1',
          '--master-image-uri', image_urls['linear'],
          '--project', PROJECT_ID,
         # '--job-dir', get_job_dir(job_id),   # we do need to set the job-dir, otherwise will face error.
          # I'm not sure why I give a full in gs, will always raise error for: ValueError: Bucket names must start and end with a number or letter.
          '--job-dir', "gs://{}/sklearn_tutorial/output".format(output_bucket),   # dir must be a url start with gs://
          '--',
          '--preprocess',
          '--model_type=classification',    # this is common parameters setting for training like batch size etc.
          '--batch_size=250',
          '--max_steps=1000',
          '--learning_rate={}'.format(learning_rate),
          '--training_data_path={}'.format(training_data_path)
          ]


def get_accuracy(job_id):
  # this is to get trained model accuracy with build-in algorithm
  client = storage.Client(PROJECT_ID)
  bucket = client.get_bucket(output_bucket)
  blob_name = os.path.join(output_dir, job_id, 'model/deployment_config.yaml')
  blob = storage.Blob(blob_name, bucket)

  # what we do is to get the file content with bucket
  try:
    blob.reload()
    content = blob.download_as_string()
    accuracy = float(yaml.safe_load(content)['labels']['accuracy']) / 100
    step_count = int(yaml.safe_load(content)['labels']['global_step'])
    return {_step_count: step_count, _accuracy: accuracy}
  except:
    return None


def get_job_metrics(jobs):
  # this is to get the whole jobs metrics
  accuracies_by_job_id = {}
  for job in jobs:
    accuracies_by_job_id[job] = get_accuracy(job)
  
  logging.info("Get accuracy dictionary:", str(accuracies_by_job_id))
  return accuracies_by_job_id


def create_measurement(trial_id, model_type, learning_rate, metric):
  # this is to get measurement of of job....
  if metric is None:
    logging.error("Get a empty metric!!!!")
  else:
    if not metric[_accuracy]:
      return None
    else:
      measurement = {
        _step_count: metric[_step_count],
        'metrics': [{'metric': _accuracy, 'value': metric[_accuracy]},]}
    
      # logging.info("Get measurement:", str(measurement))
      return measurement
        

def submit_job(job_id, trial_id, model_type, learning_rate):
  # this is to submit job with command line.
  try:
    if model_type == 'linear':
      subprocess.check_output(linear_command(job_id, learning_rate), stderr=subprocess.STDOUT)
    else:
      logging.error('not support')
  except subprocess.CalledProcessError as e:
    logging.error(e.output)


In [0]:
client_id = 'client12' 
suggestion_count_per_request =   1
max_trial_id_to_stop =   1

In [0]:
current_trial_id = 0
while current_trial_id < max_trial_id_to_stop:
  # Request trials
  resp = ml.projects().locations().studies().trials().suggest(
    parent=trial_parent(STUDY_ID), 
    body={'client_id': client_id, 'suggestion_count': suggestion_count_per_request}).execute()
  op_id = resp['name'].split('/')[-1]

  # Polls the suggestion long-running operations.
  get_op = ml.projects().locations().operations().get(name=operation_name(op_id))
  while True:
      operation = get_op.execute()
      if 'done' in operation and operation['done']:
        break
      time.sleep(1)

  # Featches the suggested trials.
  trials = []
  for suggested_trial in get_op.execute()['response']['trials']:
    trial_id = int(suggested_trial['name'].split('/')[-1])
    trial = ml.projects().locations().studies().trials().get(name=trial_name(STUDY_ID, trial_id)).execute()
    if trial['state'] not in ['COMPLETED', 'INFEASIBLE']:
      print("Trial {}: {}".format(trial_id, trial))
      trials.append(trial)

  # Evaluates trials - Submit model training jobs using AI Platform Training built-in algorithms.
  measurement_by_trial_id = evaluate_trials(trials)
  
  # I even forget to update the step...
  current_trial_id += 1
  

Trial 14: {'name': 'projects/574974437586/locations/us-central1/studies/lugq_study_20200528/trials/14', 'state': 'ACTIVE', 'parameters': [{'parameter': 'model_type', 'stringValue': 'linear'}, {'parameter': 'learning_rate', 'floatValue': 0.011290392048492156}], 'startTime': '2020-05-28T08:59:40Z', 'clientId': 'client12'}


ERROR:root:Get a empty metric!!!!


In [0]:
# Completes trials.
for trial in trials:
  trial_id = int(trial['name'].split('/')[-1])
  current_trial_id = trial_id
  measurement = measurement_by_trial_id[trial_id]
  print(("=========== Complete Trial: [{0}] =============").format(trial_id))
  if measurement:
    # Completes trial by reporting final measurement.
    ml.projects().locations().studies().trials().complete(
      name=trial_name(STUDY_ID, trial_id), 
      body={'final_measurement' : measurement}).execute()
  else:
    # Marks trial as `infeasbile` if when missing final measurement.
    ml.projects().locations().studies().trials().complete(
      name=trial_name(STUDY_ID, trial_id), 
      body={'trial_infeasible' : True}).execute()



In [0]:
# final step is to list the rial result
resp = ml.projects().locations().studies().trials().list(parent=trial_parent(STUDY_ID)).execute()
print(json.dumps(resp, indent=2, sort_keys=True))


{
  "trials": [
    {
      "clientId": "client1",
      "name": "projects/574974437586/locations/us-central1/studies/lugq_study_20200528/trials/1",
      "parameters": [
        {
          "parameter": "model_type",
          "stringValue": "linear"
        },
        {
          "floatValue": 0.010000000000000005,
          "parameter": "learning_rate"
        }
      ],
      "startTime": "2020-05-28T06:14:20Z",
      "state": "ACTIVE"
    },
    {
      "clientId": "client1",
      "name": "projects/574974437586/locations/us-central1/studies/lugq_study_20200528/trials/2",
      "parameters": [
        {
          "parameter": "model_type",
          "stringValue": "linear"
        },
        {
          "floatValue": 0.0036307811788057714,
          "parameter": "learning_rate"
        }
      ],
      "startTime": "2020-05-28T06:14:20Z",
      "state": "ACTIVE"
    },
    {
      "clientId": "client2",
      "name": "projects/574974437586/locations/us-central1/studies/lugq_study_

## Show results

As we have already done the whole logic with optimizer, how to check the result? Maybe first is to check that the checkpoint of model and some metrics that is conducted, so we could first to get the [job summary](https://console.cloud.google.com/ai-platform/jobs?project=cloudtutorial-278306), the log is also here, one more thing is that is also support to output a **Tensorboard summary** that we could get info about the model structure etc. The other thing is the output is saved in bucket, we could also check that [bucket objects](https://console.cloud.google.com/storage/browser/first_bucket_lugq/sklearn_tutorial/?forceOnBucketsSortingFiltering=false&project=cloudtutorial-278306).

## Final wods

I have to say there are toooo many codes that I have to write, to config, to test to make the whole thing done, if possible, I won't use this product util the logic is wrapped into the package to easy to use. 

The main thing is to find the great hyper-parameter that suit best for data, why not to use other algorithms to find the sub-best parameters? Even what we do will also get sub-best parameters, that won't that tough than to use current solution...