### Hyperparameter Tuning

CloudML service supports hyperparams tuning. Any program args exposed from your training program can be tuned. To do so, in the cloud run input, add "hyperparameters" section.

In [3]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'iris', 'model', 'trainer-0.1.tar.gz')
train_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_train.tfrecord.gz')
eval_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_eval.tfrecord.gz')
metadata_path = os.path.join(bucket, 'iris', 'preprocessed', 'metadata.yaml')
output_path = os.path.join(bucket, 'iris', 'hptuning')
summary_dir_pattern = os.path.join(bucket, 'iris', 'hptuning', '*')
eval_dir_pattern = os.path.join(bucket, 'iris', 'hptuning', '*', 'eval_one_pass')

The following sample shows 2 hyperparams "hidden" and "learning_rate". We will submit 12 runs with various parameter values (the CloudML service will set those values in each trial).

In [5]:
%mlalpha train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: STANDARD_1
region: us-central1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_dir: $output_path
hyperparameters:
  goal: MAXIMIZE
  max_trials: 15
  max_parallel_trials: 3
  params:
    - parameter_name: hidden
      type: INTEGER
      min_value: 10
      max_value: 50
      scale_type: UNIT_LINEAR_SCALE
    - parameter_name: learning_rate
      type: DOUBLE
      min_value: 0.0001
      max_value: 0.1
      scale_type: UNIT_LOG_SCALE      

The training results will show in "trainingOutput" field of the job. Initially it is empty, but once some trials finish, they will show up.

In [1]:
%mlalpha jobs --name trainer_task_160927_174839

If you add '--trials' to the previous input, it plots a parallel coordinates graph after any trials of the hyperparameter tuning job are done. "--trials" only works if your job is a hyperparameter tuning job.

In [2]:
%mlalpha jobs --name trainer_task_160927_174839 --trials

Once some jobs are finished, we can check their TF events. It takes a while because all data is in GCS but not local.

In [4]:
%mlalpha summary --dir $summary_dir_pattern --name loss --step

In [5]:
%mlalpha summary --dir $eval_dir_pattern --name accuracy --step