# Hyperparameter tuning a TensorFlow model with Cloud ML Engine

This notebook demonstrates how to configure a set of hyperparameter tuning experiments for a TensorFlow model and then use Cloud ML Engine to run a collection of parallel experiments (trials) to use a black-box Bayesian optimization algorithm to discover better perfoming hyperparameters.

In [0]:
import os
#@markdown Enter  your GCP Project ID:
PROJECT = "" #@param {type: "string"}
#@markdown Enter  your GCP Storage Bucket ID:
BUCKET = "" #@param {type: "string"}
#@markdown OPTIONAL: Replace with your GCP Storage Bucket Region:
REGION = "us-central1\t" #@param {type:"string"}

MODEL_TYPE='cnn_batch_norm'       # convolutional neural network

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['MODEL_TYPE'] = MODEL_TYPE
os.environ['TFVERSION'] = '1.10'  # Tensorflow version

def start_tensorboard(logdir, url_file):
  get_ipython().system_raw('tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'.format(logdir))
  get_ipython().system_raw('lt --port 6006 >> {} 2>&1 &'.format(url_file))
  get_ipython().system('cat {}'.format(url_file))

def stop_tensorboard(url_file):
  get_ipython().system_raw("ps -Af  | grep -E 'tensorboard|lt --port' | awk '{print $2}' | xargs -I % kill -9 %")
  get_ipython().system_raw("rm {}".format(url_file))

try:  
  from google.colab import auth
  auth.authenticate_user()  
  print("Authenticated")
except:
  print("Failed to authenticate")

In [0]:
%%bash
git clone https://github.com/osipov/training-data-analyst.git
cp -r training-data-analyst/bootcamps/imagereco/fashionmodel .

gcloud config set project $PROJECT
gcloud config set compute/region $REGION

The hyperparam.yaml file specifies a goal for hyperparameter tuning as well as a range, type, and scale of hyperparameter values to explore. Cloud MLE uses  a smart search algorithm to discover better performing values within the specified constraints, in other words it does not try out every single value. The algorithm, based on Bayesian optimization, uses the information gained during search to adaptively choose hyperparameter values to explore.

The following configuration uses a single trial to reduce the amount of time that hyperparameter tuning takes during the workshop. When using Cloud MLE in production, you will commonly use as many parallel trials as your GCP account quotas and project budget permit.

Notice that in the following hyperparam.yaml file the number of the training steps is configured as a categorical value with 3 possible options. This is done just to illustrate how you can use categorical values for hyperparameter tuning. In practice, you could have used a linear or a log scale for the range of traning step parameters.


In [0]:
%%writefile hyperparam.yaml
trainingInput:
  scaleTier: BASIC_GPU
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 1
    maxParallelTrials: 1
    hyperparameterMetricTag: accuracy
    params:
    - parameterName: learning_rate
      type: DOUBLE
      minValue: 0.00001
      maxValue: 0.01
      scaleType: UNIT_LOG_SCALE      
    - parameterName: train_batch_size
      type: INTEGER
      minValue: 128
      maxValue: 1024
      scaleType: UNIT_LOG_SCALE
    - parameterName: train_steps
      type: INTEGER
      minValue: 400
      maxValue: 4000
      scaleType: UNIT_LOG_SCALE        
    - parameterName: dprob
      type: DOUBLE
      minValue: 0.1
      maxValue: 0.4
      scaleType: UNIT_LINEAR_SCALE        
    - parameterName: ksize1
      type: CATEGORICAL
      categoricalValues: ["3", "5", "7", "11"]  
    - parameterName: ksize2
      type: CATEGORICAL
      categoricalValues: ["3", "5", "7", "11"]          
    - parameterName: nfil1
      type: CATEGORICAL
      categoricalValues: ["10", "15", "20", "25"]  
    - parameterName: nfil2
      type: CATEGORICAL
      categoricalValues: ["10", "15", "20", "25"]        

**Next, start a Cloud ML Engine hyperparameter tuning job on BASIC_GPU instances** using the hyperparam.yaml file for configuration.

Notice that when starting a hyperparameter tuning job, the command line parameters that used to be provided via a command line (e.g. learning_rate) are omitted. Instead, the corresponding values will be provided by Cloud MLE for every trial of the hyperparameter tuning job.

In [0]:
%%bash
OUTDIR=gs://${BUCKET}/fashion/trained_${MODEL_TYPE}
JOBNAME=fashion_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/fashionmodel/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=$TFVERSION \
   --config=hyperparam.yaml \
   -- \
   --output_dir=$OUTDIR \
   --model=$MODEL_TYPE

Just like with the regular training jobs, you can monitor hyperparameter turning from the [Jobs](https://console.cloud.google.com/mlengine/jobs) section of the Cloud ML Engine service. Once the hyperparameter tuning job finishes, it should discover values for train_steps, learning_rate, and train_batch_size that can train a  model close to 88-89% accuracy.

## Monitoring training with TensorBoard

Models trained during hyperparameter tuning can also be monitored using TensorBoard. Go ahead and execute the next cell to launch tensorboard. Once you access the TensorBoard link, notice that the model(s) are prefixed with a number, e.g. 1/eval. Since the model.py uses the trial ID as a part of the model output directory, every hyperparameter tuned model will have a unique prefix in TensorBoard. This helps compare performance (e.g. accuracy) of different models in the same dashboard. You can experiment with training multiple models by changing the number of max trails in the hyperparam.yaml file earlier in this notebook and starting new hyperparameter tuning jobs. Be careful not to exceed your quota!

In [0]:
!npm install -g localtunnel
start_tensorboard('gs://{}/fashion/trained_{}'.format(BUCKET, MODEL_TYPE), 'url')

In [0]:
%sx read -p 'Press Enter in the input box to stop TensorBoard '
stop_tensorboard('url')
print("Stopped")

<pre>
# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
</pre>