## Training

Training requires a tarball python package that includes your training program based on TensorFlow. While CloudML provides several generic purpose model training, for this sample we will use a package that is specifically created to train Census sample.

### Local Training

First copy the package to local.

In [5]:
!gsutil cp gs://cloud-datalab/sampledata/ml/census/trainer-0.3.tar.gz /content/datalab/tmp/ml/census/

Copying gs://cloud-datalab/sampledata/ml/census/trainer-0.3.tar.gz...
Downloading ...content/datalab/tmp/ml/census/trainer-0.3.tar.gz: 6.36 KiB/6.36 KiB    


Run "%ml train" to generate the training cell template.

In [None]:
%%ml train

Fill in the required fields and run. <br>
Datalab will simulate the CloudML service by creating master, worker, and ps processes (in cloud they are different VMs) to perform a distributed training, although all these processes run in the local container VM.<br>
You can set replica_count to 0 to not using a certain job type, such as ps. But master is required. In this case, we only enable master.<br>
The output of the training will be links to the processes output logs, and also refreshed every 3 seconds to show last few lines of the logs. You can use the local run to quickly validate your training program and parameters before submitting it to cloud to do large scale training.<br>
If for any reasons the training is stuck, just click "Reset Session" to reset the kernel. All training processes will be cleaned up.<br><br>

Note that we replaced "scale_tier: BASIC" to "scale_tier: CUSTOM" and set "worker_count" and "parameter_server_count" explicitly.

In [6]:
%ml train
package_uris: /content/datalab/tmp/ml/census/trainer-0.3.tar.gz
python_module: trainer.task
scale_tier: CUSTOM
worker_count: 1
parameter_server_count: 1
args:
  train_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/census/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/census/model
  hidden1: 100
  hidden2: 60
  hidden3: 30

Check the training output.

In [7]:
!ls /content/datalab/tmp/ml/census/model

eval  logdir  model  summaries


You can start TensorBoard to view training results.

In [12]:
%tensorboard start --logdir /content/datalab/tmp/ml/census/model

Shut down the tensorboard serverwhen you are done with it.

In [13]:
%tensorboard stop --pid 19769

Let's train another model with larger hidden layer sizes.

In [8]:
%ml train
package_uris: /content/datalab/tmp/ml/census/trainer-0.3.tar.gz
python_module: trainer.task
scale_tier: CUSTOM
worker_count: 1
parameter_server_count: 1
args:
  train_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/census/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/census/largermodel
  hidden1: 200
  hidden2: 100
  hidden3: 50

### Cloud Training

Cloud training is similar but with "--cloud" flag, and use all GCS paths instead of local paths. <br>
You also need to make sure you have a project whitelisted for CloudML, and use "%projects set project-id" to set it.

Define variables that will be used later.

In [2]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'census', 'model', 'trainer-0.3.tar.gz')
train_data_path = os.path.join(bucket, 'census', 'preprocessed', 'features_train')
eval_data_path = os.path.join(bucket, 'census', 'preprocessed', 'features_eval')
metadata_path = os.path.join(bucket, 'census', 'preprocessed', 'metadata.yaml')
output_path = os.path.join(bucket, 'census', 'trained')

In [4]:
!gsutil cp gs://cloud-datalab/sampledata/ml/census/trainer-0.3.tar.gz $package_path

Copying gs://cloud-datalab/sampledata/ml/census/trainer-0.3.tar.gz [Content-Type=application/gzip]...
Copying     ...mated-sampledata/census/model/trainer-0.3.tar.gz: 6.36 KiB/6.36 KiB    


Start training using the Cloud DataFlow output from the "2. Preprocessing" notebook. We choose a set of hidden layer sizes, and later we will show how to sweep hyperparameter values using CloudML service using hyperparameter tuning feature.

In [17]:
%ml train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: BASIC
region: us-west1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_path: $output_path
  hidden1: 200
  hidden2: 100
  hidden3: 50    

View the job status as described in the output. You can also run "%ml jobs --filter state!=SUCCEEDED" to see all active ML jobs in that project.

In [23]:
%ml jobs --name trainer_task_160901_200438

View the trained model:

In [26]:
!gsutil ls gs://cloud-ml-test-automated-sampledata/census/trained

gs://cloud-ml-test-automated-sampledata/census/trained/eval/
gs://cloud-ml-test-automated-sampledata/census/trained/logdir/
gs://cloud-ml-test-automated-sampledata/census/trained/model/
gs://cloud-ml-test-automated-sampledata/census/trained/summaries/
