## Training

Training requires a tarball python package that includes your training program based on TensorFlow. While CloudML provides several generic purpose model training, for this sample we will use a package that is used to train Iris sample.

### Local Training

First copy the package to local.

In [1]:
!gsutil cp gs://cloud-datalab/sampledata/ml/iris/trainer-0.3.tar.gz /content/datalab/tmp/ml/iris

Copying gs://cloud-datalab/sampledata/ml/iris/trainer-0.3.tar.gz...
Downloading ...//content/datalab/tmp/ml/iris/trainer-0.3.tar.gz: 7.32 KiB/7.32 KiB    


Run "%ml train" to generate the training cell template.

In [None]:
%%ml train

Fill in the required fields and run. <br>
Datalab will simulate the CloudML service by creating master, worker, and ps processes (in cloud they are different VMs) to perform a distributed training, although all these processes run in the local container VM.<br>
You can set replica_count to 0 to not using a certain job type, such as ps. But master is required.<br>
The output of the training will be links to the processes output logs, and also refreshed every 3 seconds to show last few lines of the logs. You can use the local run to quickly validate your training program and parameters before submitting it to cloud to do large scale training.<br>
If for any reasons the training is stuck, just click "Reset Session" to reset the kernel. All training processes will be cleaned up. <br><br>

Note that we replaced "scale_tier: BASIC" to "scale_tier: CUSTOM" and set "worker_count" and "parameter_server_count" explicitly.


In [2]:
%ml train
package_uris: /content/datalab/tmp/ml/iris/trainer-0.3.tar.gz
python_module: trainer.task
scale_tier: CUSTOM
worker_count: 1
parameter_server_count: 1
args:
  train_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/iris/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/iris/model
  max_steps: 1000

Note that after training is completed, you can increment "max_steps" and run it again. Training will resume from previous checkpoint.

Check the output of the training. "model" dir includes the model file (last checkpoint, graph metadata, etc). "summaries" dir includes summary events.

In [3]:
!ls /content/datalab/tmp/ml/iris/model

eval  logdir  model  summaries


You can start TensorBoard to view training results.

In [4]:
%tensorboard start --logdir /content/datalab/tmp/ml/iris/model/

Shut down the tensorboard server.

In [5]:
%tensorboard stop --pid 121454

Let's train another one for fun (with learning_rate equal to 0.001). learning_rate is an arg defined in training program in the package and default value is 0.01.

In [6]:
%ml train
package_uris: /content/datalab/tmp/ml/iris/trainer-0.3.tar.gz
python_module: trainer.task
scale_tier: BASIC
args:
  train_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/iris/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/iris/model_lr
  max_steps: 1000
  learning_rate: 0.001

### Cloud Training

Cloud training is similar but with "--cloud" flag, and use all GCS paths instead of local paths. <br>
We will use the preprocessed files created by cloud preprocessing in previous "Preprocess" notebook.

Define variables that will be used later.

In [16]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'iris', 'model', 'trainer-0.3.tar.gz')
train_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_train')
eval_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_eval')
metadata_path = os.path.join(bucket, 'iris', 'preprocessed', 'metadata.yaml')
output_path = os.path.join(bucket, 'iris', 'trained')

In [17]:
!gsutil cp gs://cloud-datalab/sampledata/ml/iris/trainer-0.3.tar.gz $package_path

Copying gs://cloud-datalab/sampledata/ml/iris/trainer-0.3.tar.gz [Content-Type=application/gzip]...
Copying     ...tomated-sampledata/iris/model/trainer-0.3.tar.gz: 7.32 KiB/7.32 KiB    


In [18]:
%ml train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_path: $output_path

View the job status as described in the output. You can also run "%ml jobs --filter state!=SUCCEEDED" to see all active ML jobs in that project.

In [24]:
%ml jobs --name trainer_task_160901_052526

View the trained model once the state is 'SUCCEEDED':

In [21]:
!gsutil ls $output_path

gs://cloud-ml-test-automated-sampledata/iris/trained/eval/
gs://cloud-ml-test-automated-sampledata/iris/trained/logdir/
gs://cloud-ml-test-automated-sampledata/iris/trained/model/
gs://cloud-ml-test-automated-sampledata/iris/trained/summaries/


TensorBoard works with GCS path so it works with Cloud training too.