### Hyperparameter Tuning

CloudML service supports hyperparams tuning. Any program args exposed from your training program can be tuned. To do so, in the cloud run input, add "hyperparameters" section.

In [4]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'census', 'model', 'trainer-0.1.tar.gz')
train_data_path = os.path.join(bucket, 'census', 'preprocessed', 'features_train')
eval_data_path = os.path.join(bucket, 'census', 'preprocessed', 'features_eval')
metadata_path = os.path.join(bucket, 'census', 'preprocessed', 'metadata.yaml')
output_path = os.path.join(bucket, 'census', 'hptuning')
summary_dir_pattern = os.path.join(bucket, 'census', 'hptuning', '*', 'summaries')
eval_dir_pattern = os.path.join(bucket, 'census', 'hptuning', '*', 'eval')

The following sample shows 3 hyperparams "hidden1", "hidden2", and "hidden3", indicating the size for the 3 hidden layers. We will submit 15 runs with various parameter values (the CloudML service will set those values in each trial).

In [5]:
%ml train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: BASIC
region: us-west1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_path: $output_path
  max_steps: 3000
hyperparameters:
  goal: MINIMIZE
  max_trials: 15
  max_parallel_trials: 5
  params:
    - parameter_name: hidden1
      type: INTEGER
      min_value: 100
      max_value: 300
      scale_type: UNIT_LINEAR_SCALE
    - parameter_name: hidden2
      type: INTEGER
      min_value: 50
      max_value: 100
      scale_type: UNIT_LINEAR_SCALE    
    - parameter_name: hidden3
      type: INTEGER
      min_value: 10
      max_value: 50
      scale_type: UNIT_LINEAR_SCALE           

The training results will show in "trainingOutput" field of the job. Initially it is empty, but once some trials finish, they will show up.

In [17]:
%ml jobs --name trainer_task_160922_014919

"--trials" flag tells Datalab to plot the parallel coordination graph given the results.

In [21]:
%ml jobs --name trainer_task_160922_014919 --trials

"%ml summary" can be used to compare the runs with TF events.

In [22]:
%ml summary --dir $summary_dir_pattern --name loss --step

In [20]:
%ml summary --dir $eval_dir_pattern --name error --step