## Training

Training requires a tarball python package that includes your training program based on TensorFlow. While CloudML provides several generic purpose model training, for this sample we will use a package that is specifically created to train Census sample.

### Local Training

First copy the package to local.

In [19]:
!gsutil cp gs://cloud-datalab/sampledata/ml/census/trainer-0.1.tar.gz /content/datalab/ml/census/

Copying gs://cloud-datalab/sampledata/ml/census/trainer-0.1.tar.gz...
Downloading ...:///content/datalab/ml/census/trainer-0.1.tar.gz: 6.49 KiB/6.49 KiB    


Run "%ml train" to generate the training cell template.

In [None]:
%%ml train
trainer_uri: REQUIRED_Fill_In_Gcs_or_Local_Path
module_name: REQUIRED_Fill_In
master_spec:
  replica_count: 1
worker_spec:
  replica_count: 1
ps_spec:
  replica_count: 1
train_data_paths:
  - Fill_In_Gcs_or_Local_Path
eval_data_paths:
  - Fill_In_Gcs_or_Local_Path
metadata_path: REQUIRED_Fill_In_Gcs_or_Local_Path
output_path: REQUIRED_Fill_In_Gcs_or_Local_Path
job_args: Your_Program_Args_Goes_Here

Fill in the required fields and run. <br>
Datalab will simulate the CloudML service by creating master, worker, and ps processes (in cloud they are different VMs) to perform a distributed training, although all these processes run in the local container VM.<br>
You can set replica_count to 0 to not using a certain job type, such as ps. But master is required. In this case, we only enable master.<br>
The output of the training will be links to the processes output logs, and also refreshed every 3 seconds to show last few lines of the logs. You can use the local run to quickly validate your training program and parameters before submitting it to cloud to do large scale training.<br>
If for any reasons the training is stuck, just click "Reset Session" to reset the kernel. All training processes will be cleaned up.

In [21]:
%ml train
trainer_uri: /content/datalab/ml/census/trainer-0.1.tar.gz
module_name: trainer.task
master_spec:
  replica_count: 1
worker_spec:
  replica_count: 0
ps_spec:
  replica_count: 0
train_data_paths:
  - /content/datalab/ml/census/preprocessed/features_train-00000-of-00001
eval_data_paths:
  - /content/datalab/ml/census/preprocessed/features_eval-00000-of-00001
metadata_path: /content/datalab/ml/census/preprocessed/metadata.yaml
output_path: /content/datalab/ml/census/model

Check the training output.

In [3]:
!ls /content/datalab/ml/census/model

eval  logdir  model  summaries


You can start TensorBoard to view training results.

In [4]:
%tensorboard start --logdir /content/datalab/ml/census/model

Shut down the tensorboard serverwhen you are done with it.

In [5]:
%tensorboard stop --pid 244

### Cloud Training

Cloud training is similar but with "--cloud" flag, and use all GCS paths instead of local paths. <br>
You also need to make sure you have a project whitelisted for CloudML, and use "%projects set project-id" to set it.

In [6]:
!gsutil cp /content/datalab/ml/census/trainer-0.1.tar.gz gs://cloud-ml-test-automated-sampledata/census/model/trainer-0.1.tar.gz

Copying file:///content/datalab/ml/census/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
Uploading   ...mated-sampledata/census/model/trainer-0.1.tar.gz: 6.49 KiB/6.49 KiB    


Start training using the Cloud DataFlow output from the "2. Preprocessing" notebook.

In [8]:
%ml train --cloud
trainer_uri: gs://cloud-ml-test-automated-sampledata/census/model/trainer-0.1.tar.gz
module_name: trainer.task
master_spec:
  replica_count: 1
worker_spec:
  replica_count: 1
ps_spec:
  replica_count: 1
train_data_paths:
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00000-of-00008
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00001-of-00008
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00002-of-00008
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00003-of-00008
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00004-of-00008
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00005-of-00008
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00006-of-00008
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train-00007-of-00008
eval_data_paths:
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00000-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00001-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00002-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00003-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00004-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00005-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00006-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00007-of-00009
  - gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval-00008-of-00009
metadata_path: gs://cloud-ml-test-automated-sampledata/census/preprocessed/metadata.yaml
output_path: gs://cloud-ml-test-automated-sampledata/census/trainedmodel

View the trained model:

In [10]:
!gsutil ls gs://cloud-ml-test-automated-sampledata/census/trainedmodel

gs://cloud-ml-test-automated-sampledata/census/trainedmodel/eval/
gs://cloud-ml-test-automated-sampledata/census/trainedmodel/logdir/
gs://cloud-ml-test-automated-sampledata/census/trainedmodel/model/
gs://cloud-ml-test-automated-sampledata/census/trainedmodel/summaries/
