## Training

Training requires a tarball python package that includes your training program based on TensorFlow. While CloudML provides several generic purpose model training, for this sample we will use a package that is used to train Iris sample.

### Local Training

First copy the package to local.

In [14]:
!gsutil cp gs://cloud-datalab/sampledata/ml/iris/trainer-0.1.tar.gz /content/datalab/ml/iris/

Copying gs://cloud-datalab/sampledata/ml/iris/trainer-0.1.tar.gz...
Downloading file:///content/datalab/ml/iris/trainer-0.1.tar.gz:  6.98 KiB/6.98 KiB    


Run "%ml train" to generate the training cell template.

In [None]:
%%ml train
trainer_uri: REQUIRED_Fill_In_Gcs_or_Local_Path
module_name: REQUIRED_Fill_In
master_spec:
  replica_count: 1
worker_spec:
  replica_count: 1
ps_spec:
  replica_count: 1
train_data_paths:
  - Fill_In_Gcs_or_Local_Path
eval_data_paths:
  - Fill_In_Gcs_or_Local_Path
metadata_path: REQUIRED_Fill_In_Gcs_or_Local_Path
output_path: REQUIRED_Fill_In_Gcs_or_Local_Path
job_args: Your_Program_Args_Goes_Here


Fill in the required fields and run. <br>
Datalab will simulate the CloudML service by creating master, worker, and ps processes (in cloud they are different VMs) to perform a distributed training, although all these processes run in the local container VM.<br>
You can set replica_count to 0 to not using a certain job type, such as ps. But master is required.<br>
The output of the training will be links to the processes output logs, and also refreshed every 3 seconds to show last few lines of the logs. You can use the local run to quickly validate your training program and parameters before submitting it to cloud to do large scale training.<br>
If for any reasons the training is stuck, just click "Reset Session" to reset the kernel. All training processes will be cleaned up.


In [17]:
%ml train
trainer_uri: /content/datalab/ml/iris/trainer-0.1.tar.gz
module_name: trainer.task
master_spec:
  replica_count: 1
worker_spec:
  replica_count: 1
ps_spec:
  replica_count: 1
train_data_paths:
  - /content/datalab/ml/iris/preprocessed/features_train-00000-of-00001
eval_data_paths:
  - /content/datalab/ml/iris/preprocessed/features_eval-00000-of-00001
metadata_path: /content/datalab/ml/iris/preprocessed/metadata.yaml
output_path: /content/datalab/ml/iris/model

Check the output of the training. "model" dir includes the model file (last checkpoint, graph metadata, etc). "summaries" dir includes summary events.

In [9]:
!ls /content/datalab/ml/iris/model

eval  logdir  model  summaries


You can start TensorBoard to view training results.

In [25]:
%tensorboard start --logdir /content/datalab/ml/iris/model/

Shut down the tensorboard server.

In [26]:
%tensorboard stop --pid 129581

Let's train another one for fun (with steps equal to 3000). "max_steps" is an arg defined in training program in the package.

In [18]:
%ml train
trainer_uri: /content/datalab/ml/iris/trainer-0.1.tar.gz
module_name: trainer.task
master_spec:
  replica_count: 1
worker_spec:
  replica_count: 1
ps_spec:
  replica_count: 1
train_data_paths:
  - /content/datalab/ml/iris/preprocessed/features_train-00000-of-00001
eval_data_paths:
  - /content/datalab/ml/iris/preprocessed/features_eval-00000-of-00001
metadata_path: /content/datalab/ml/iris/preprocessed/metadata.yaml
output_path: /content/datalab/ml/iris/model3000
job_args:
  - '--max_steps'
  - '3000'

### Cloud Training

Cloud training is similar but with "--cloud" flag, and use all GCS paths instead of local paths. <br>
You also need to make sure you have a project whitelisted for CloudML, and use "%projects set project-id" to set it.

In [28]:
!gsutil cp /content/datalab/ml/iris/trainer-0.1.tar.gz gs://cloud-ml-test-automated-sampledata/iris/model/trainer-0.1.tar.gz

Copying file:///content/datalab/ml/iris/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
Uploading   ...tomated-sampledata/iris/model/trainer-0.1.tar.gz: 6.98 KiB/6.98 KiB    


In [6]:
%ml train --cloud
trainer_uri: gs://cloud-ml-test-automated-sampledata/iris/model/trainer-0.1.tar.gz
module_name: trainer.task
master_spec:
  replica_count: 1
worker_spec:
  replica_count: 1
ps_spec:
  replica_count: 1
train_data_paths:
  - gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00000-of-00004
  - gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00001-of-00004
  - gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00002-of-00004
  - gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00003-of-00004
eval_data_paths:
  - gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_eval-00000-of-00003
  - gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_eval-00001-of-00003
  - gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_eval-00002-of-00003
metadata_path: gs://cloud-ml-test-automated-sampledata/iris/preprocessing/metadata.yaml
output_path: gs://cloud-ml-test-automated-sampledata/iris/trainedmodel

View the job status. (Also, run "%ml jobs --active" to see all active ML jobs in that project)

In [7]:
%ml jobs --name trainer_task_160818_071848

View the trained model:

In [8]:
!gsutil ls gs://cloud-ml-test-automated-sampledata/iris/trainedmodel

gs://cloud-ml-test-automated-sampledata/iris/trainedmodel/eval/
gs://cloud-ml-test-automated-sampledata/iris/trainedmodel/logdir/
gs://cloud-ml-test-automated-sampledata/iris/trainedmodel/model/
gs://cloud-ml-test-automated-sampledata/iris/trainedmodel/summaries/


TensorBoard works with GCS path so it works with Cloud training too.