<a name="about"></a>
About this notebook
======

This notebook assumes you have ran the local Census Regression notebook and you have not deleted the LOCAL_ROOT folder. In this notebook, we will train a Tensorflow model using the Google Cloud Machine Learning Engine training service. This notebook will does not assume that the notebook "2. Census Regression Cloud Preprocessing" was executed.

<a name="setup"></a>
Setting things up
=====

In [1]:
import datalab_structured_data as sd

Lets look at the versions of structured_data and TF we have. Make sure TF is 1.0.0, and SD is 0.0.1.

In [2]:
import os

import tensorflow as tf
from tensorflow.python.lib.io import file_io

import datalab.ml as ml

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))

tf 1.0.0
sd 0.0.1


This notebook will write files during training. Please give a root folder you wish to use.

In [3]:
LOCAL_ROOT = './census_regression_workspace' # This should be the same as what was used in the local census notebook
CLOUD_ROOT = 'gs://' + datalab_project_id() + '-census-regression-datalab'

if not file_io.file_exists(LOCAL_ROOT):
  raise ValueError('LOCAL_ROOT not found. Did you run the local notebook?')
!gsutil mb {CLOUD_ROOT}

Creating gs://cloud-ml-dev-census-regression-datalab/...
ServiceException: 409 Bucket cloud-ml-dev-census-regression-datalab already exists.


First, let us put the csv files on GCS and the output of preprocessing.

In [4]:
!gsutil -m cp {os.path.join(LOCAL_ROOT, '*_data.csv')} {CLOUD_ROOT}
!gsutil cp {os.path.join(LOCAL_ROOT, 'schema.json')} {CLOUD_ROOT}
!gsutil cp {os.path.join(LOCAL_ROOT, 'transforms.json')} {CLOUD_ROOT}
!gsutil -m cp -r {os.path.join(LOCAL_ROOT, 'preprocess')} {CLOUD_ROOT}

Copying file://./census_regression_workspace/predict_data.csv [Content-Type=text/csv]...
Copying file://./census_regression_workspace/train_data.csv [Content-Type=text/csv]...
Copying file://./census_regression_workspace/eval_data.csv [Content-Type=text/csv]...
/ [3/3 files][200.1 KiB/200.1 KiB] 100% Done                                    
Operation completed over 3 objects/200.1 KiB.                                    
Copying file://./census_regression_workspace/schema.json [Content-Type=application/json]...
/ [1 files][  1.4 KiB/  1.4 KiB]                                                
Operation completed over 1 objects/1.4 KiB.                                      
Copying file://./census_regression_workspace/transforms.json [Content-Type=application/json]...
/ [1 files][  1.0 KiB/  1.0 KiB]                                                
Operation completed over 1 objects/1.0 KiB.                                      
Copying file://./census_regression_workspace/preprocess/schem

In [5]:
!gsutil ls {CLOUD_ROOT}/preprocess

gs://cloud-ml-dev-census-regression-datalab/preprocess/numerical_analysis.json
gs://cloud-ml-dev-census-regression-datalab/preprocess/schema.json
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_AGEP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_COW.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_ESP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_ESR.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_FOD1P.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_HINS4.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_INDP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_JWMNP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_JWTR.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_MAR.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_POWPUMA.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_PUMA.csv

<a name="local_preprocessing"></a>
Training using the ML Engine
=====

In [6]:
!gsutil -m rm -r {CLOUD_ROOT}/training

CommandException: 1 files/objects could not be removed.


In [7]:
train_csv = ml.CsvDataSet(
  file_pattern=os.path.join(CLOUD_ROOT, 'train_data.csv'),
  schema_file=os.path.join(CLOUD_ROOT, 'schema.json'))
eval_csv = ml.CsvDataSet(
  file_pattern=os.path.join(CLOUD_ROOT, 'eval_data.csv'),
  schema_file=os.path.join(CLOUD_ROOT, 'schema.json'))

In [8]:
ctc = ml.CloudTrainingConfig(
  region='us-central1',
  scale_tier='STANDARD_1' #See https://cloud.google.com/ml/reference/rest/v1beta1/projects.jobs#ScaleTier
  )

In [9]:
job = sd.cloud_train(
  train_dataset=train_csv,
  eval_dataset=eval_csv,
  transforms=os.path.join(CLOUD_ROOT, 'transforms.json'),
  preprocess_output_dir=os.path.join(CLOUD_ROOT, 'preprocess'),
  output_dir=os.path.join(CLOUD_ROOT, 'training'),
  model_type='dnn_regression',
  max_steps=2000,
  layer_sizes=[5, 5, 5],
  cloud_training_config=ctc,
)
job.describe()

Building package and uploading to gs://cloud-ml-dev-census-regression-datalab/training/staging/sd.tar.gz
Job request send. View status of job at
https://console.developers.google.com/ml/jobs?project=cloud-ml-dev
createTime: '2017-02-23T18:27:16Z'
jobId: structured_data_train_170223_182715
state: QUEUED
trainingInput:
  args:
  - --train_data_paths=gs://cloud-ml-dev-census-regression-datalab/train_data.csv
  - --eval_data_paths=gs://cloud-ml-dev-census-regression-datalab/eval_data.csv
  - --output_path=gs://cloud-ml-dev-census-regression-datalab/training
  - --preprocess_output_dir=gs://cloud-ml-dev-census-regression-datalab/preprocess
  - --transforms_file=gs://cloud-ml-dev-census-regression-datalab/transforms.json
  - --model_type=dnn_regression
  - --max_steps=2000
  - --train_batch_size=100
  - --eval_batch_size=100
  - --min_eval_frequency=100
  - --learning_rate=0.01
  - --epsilon=0.0005
  - --layer_size1=5
  - --layer_size2=5
  - --layer_size3=5
  packageUris:
  - gs://cloud-ml-d

When training is done, {CLOUD_ROOT}/training should contain the folders train, model, evaluation_model, etc.

In [10]:
!gsutil ls  {CLOUD_ROOT}/training

gs://cloud-ml-dev-census-regression-datalab/training/staging/


Cleaning things up
=====

If you want to delete the files you made on GCS, uncomment and run the next cell.

In [11]:
#!gsutil rm -fr {CLOUD_ROOT}