Table of Contents
===

<a href="#about">About this notebook</a> <br />
<a href="#setup">Setting things up</a>

Local Experience
1. <a href="#local_preprocessing">Local preprocessing starting from csv files</a>
1. <a href="#local_training">Local training</a>
1. <a href="#local_prediction">Local prediction</a>
1. <a href="#local_batch_prediction">Local batch prediction</a>


<a name="about"></a>
About this notebook
======

This notebook uses the datalab structured data package for building and running a Tensorflow regression problem locally. This notebook uses data from the US Census Bureau 2014 American Community Survey for the state of South Dakota.

In the notebooks that follow, an example of running preprocessing, training, and prediction using the Google Cloud Machine Learning Engine services are given. Note that running the cloud versions of preprocessing, training, and prediction take longer than the local versions. The performance advantage of using the cloud applies to very large data sets, and you don't see it with this sample because the data is small and run time is dominated by setup overhead.

<a name="setup"></a>
Setting things up
=====

In [1]:
import datalab_structured_data as sd

Lets look at the versions of structured_data and TF we have. Make sure TF is 1.0.0, and SD is 0.0.1.

In [2]:
import os
import json
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.python.lib.io import file_io

import datalab.ml as ml

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))

tf 1.0.0
sd 0.0.1


This notebook will write files during preprocessing, training, and prediction. Please give a root folder you wish to use.

In [3]:
LOCAL_ROOT = './census_regression_workspace'
if not file_io.file_exists(LOCAL_ROOT):
  file_io.recursive_create_dir(LOCAL_ROOT)

The source of the data is from the <a href="http://www2.census.gov/programs-surveys/acs/data/pums/2014/1-Year/csv_psd.zip">US Census</a>, but we already have a copy on an public bucket on GCS. The raw data has many columns, and some of then are not useful. The next few cells downloads the data, keeps the interesting columns, and splits the data into a training, eval, and prediction set. Click <a href="http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict14.txt">here</a> for a description of each column.

In [7]:
!gsutil cp gs://cloud-ml-data/census/ss14psd.csv {LOCAL_ROOT}
csv_columns = ('RT', 'SERIALNO', 'SPORDER', 'PUMA', 'ST', 'ADJINC', 'PWGTP',
                 'AGEP', 'CIT', 'CITWP', 'COW', 'DDRS', 'DEAR', 'DEYE', 'DOUT',
                 'DPHY', 'DRAT', 'DRATX', 'DREM', 'ENG', 'FER', 'GCL', 'GCM',
                 'GCR', 'HINS1', 'HINS2', 'HINS3', 'HINS4', 'HINS5', 'HINS6',
                 'HINS7', 'INTP', 'JWMNP', 'JWRIP', 'JWTR', 'LANX', 'MAR',
                 'MARHD', 'MARHM', 'MARHT', 'MARHW', 'MARHYP', 'MIG', 'MIL',
                 'MLPA', 'MLPB', 'MLPCD', 'MLPE', 'MLPFG', 'MLPH', 'MLPI',
                 'MLPJ', 'MLPK', 'NWAB', 'NWAV', 'NWLA', 'NWLK', 'NWRE', 'OIP',
                 'PAP', 'RELP', 'RETP', 'SCH', 'SCHG', 'SCHL', 'SEMP', 'SEX',
                 'SSIP', 'SSP', 'WAGP', 'WKHP', 'WKL', 'WKW', 'WRK', 'YOEP',
                 'ANC', 'ANC1P', 'ANC2P', 'DECADE', 'DIS', 'DRIVESP', 'ESP',
                 'ESR', 'FHICOVP', 'FOD1P', 'FOD2P', 'HICOV', 'HISP', 'INDP',
                 'JWAP', 'JWDP', 'LANP', 'MIGPUMA', 'MIGSP', 'MSP', 'NAICSP',
                 'NATIVITY', 'NOP', 'OC', 'OCCP', 'PAOC', 'PERNP', 'PINCP',
                 'POBP', 'POVPIP', 'POWPUMA', 'POWSP', 'PRIVCOV', 'PUBCOV',
                 'QTRBIR', 'RAC1P', 'RAC2P', 'RAC3P', 'RACAIAN', 'RACAS',
                 'RACBLK', 'RACNH', 'RACNUM', 'RACPI', 'RACSOR', 'RACWHT',
                 'RC', 'SCIENGP', 'SCIENGRLP', 'SFN', 'SFR', 'SOCP', 'VPS',
                 'WAOB', 'FAGEP', 'FANCP', 'FCITP', 'FCITWP', 'FCOWP',
                 'FDDRSP', 'FDEARP', 'FDEYEP', 'FDISP', 'FDOUTP', 'FDPHYP',
                 'FDRATP', 'FDRATXP', 'FDREMP', 'FENGP', 'FESRP', 'FFERP',
                 'FFODP', 'FGCLP', 'FGCMP', 'FGCRP', 'FHINS1P', 'FHINS2P',
                 'FHINS3C', 'FHINS3P', 'FHINS4C', 'FHINS4P', 'FHINS5C',
                 'FHINS5P', 'FHINS6P', 'FHINS7P', 'FHISP', 'FINDP', 'FINTP',
                 'FJWDP', 'FJWMNP', 'FJWRIP', 'FJWTRP', 'FLANP', 'FLANXP',
                 'FMARHDP', 'FMARHMP', 'FMARHTP', 'FMARHWP', 'FMARHYP',
                 'FMARP', 'FMIGP', 'FMIGSP', 'FMILPP', 'FMILSP', 'FOCCP',
                 'FOIP', 'FPAP', 'FPERNP', 'FPINCP', 'FPOBP', 'FPOWSP',
                 'FPRIVCOVP', 'FPUBCOVP', 'FRACP', 'FRELP', 'FRETP', 'FSCHGP',
                 'FSCHLP', 'FSCHP', 'FSEMP', 'FSEXP', 'FSSIP', 'FSSP', 'FWAGP',
                 'FWKHP', 'FWKLP', 'FWKWP', 'FWRKP', 'FYOEP', 'pwgtp1',
                 'pwgtp2', 'pwgtp3', 'pwgtp4', 'pwgtp5', 'pwgtp6', 'pwgtp7',
                 'pwgtp8', 'pwgtp9', 'pwgtp10', 'pwgtp11', 'pwgtp12',
                 'pwgtp13', 'pwgtp14', 'pwgtp15', 'pwgtp16', 'pwgtp17',
                 'pwgtp18', 'pwgtp19', 'pwgtp20', 'pwgtp21', 'pwgtp22',
                 'pwgtp23', 'pwgtp24', 'pwgtp25', 'pwgtp26', 'pwgtp27',
                 'pwgtp28', 'pwgtp29', 'pwgtp30', 'pwgtp31', 'pwgtp32',
                 'pwgtp33', 'pwgtp34', 'pwgtp35', 'pwgtp36', 'pwgtp37',
                 'pwgtp38', 'pwgtp39', 'pwgtp40', 'pwgtp41', 'pwgtp42',
                 'pwgtp43', 'pwgtp44', 'pwgtp45', 'pwgtp46', 'pwgtp47',
                 'pwgtp48', 'pwgtp49', 'pwgtp50', 'pwgtp51', 'pwgtp52',
                 'pwgtp53', 'pwgtp54', 'pwgtp55', 'pwgtp56', 'pwgtp57',
                 'pwgtp58', 'pwgtp59', 'pwgtp60', 'pwgtp61', 'pwgtp62',
                 'pwgtp63', 'pwgtp64', 'pwgtp65', 'pwgtp66', 'pwgtp67',
                 'pwgtp68', 'pwgtp69', 'pwgtp70', 'pwgtp71', 'pwgtp72',
                 'pwgtp73', 'pwgtp74', 'pwgtp75', 'pwgtp76', 'pwgtp77',
                 'pwgtp78', 'pwgtp79', 'pwgtp80')

# If you change categorical_columns, target_column, or key_column, the training
# transforms.json file also needs to change. 
categorical_columns = ['AGEP', 'COW', 'ESP', 'ESR', 'FOD1P', 'HINS4', 'INDP',
                       'JWMNP', 'JWTR', 'MAR', 'POWPUMA', 'PUMA', 'RAC1P', 'SCHL',
                       'SCIENGRLP', 'SEX', 'WKW']
# WAGP will be the target column. Feel free to change the target to any of the other income like columns:
# PERNP, INTP, OIP, PAP, PERNP, PINCP, RETP, SEMP, SSIP, SSP, WAGP, etc
target_column = 'WAGP'
key_column = 'SERIALNO'

all_raw_data = pd.read_csv(os.path.join(LOCAL_ROOT, 'ss14psd.csv'),
                           header=None,
                           names=csv_columns,
                           dtype=str)

#dtype = {'WAGP': np.}
csv_columns_to_keep = [key_column] + [target_column] + categorical_columns

# Keep only some of the columns
all_data = all_raw_data[csv_columns_to_keep]

# Replace whitespace with NaN
all_data = all_data.replace('\s+', np.nan, regex=True)

# Replace NaN with 0
all_data = all_data.fillna(0)

# Also convert income unit from $1 to $1000.
all_data = all_data.loc[ all_data[target_column] != ' ']
all_data[target_column] = all_data[target_column].astype(float)/1000.0

# Keep rows with a non-extream target range.
all_data = all_data.loc[ np.logical_and(all_data[target_column] > 10.0, all_data[target_column] < 150.0) ]




Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update

Copying gs://cloud-ml-data/census/ss14psd.csv...
/ [1 files][  7.8 MiB/  7.8 MiB]                                                
Operation completed over 1 objects/7.8 MiB.                                      


In [8]:
np.random.seed(1234321)
random_numbers = np.random.rand(len(all_data))

# slit all_data into %80, %10, %10 percent sets.
train_data = all_data[random_numbers < 0.8]
eval_data = all_data[ np.logical_and(random_numbers >= 0.8, random_numbers < 0.9)]
predict_data = all_data[random_numbers >= 0.9]

# remove target column from prediction set
del predict_data[target_column]

A schema file is used to describe each column of the csv files. It is assumed that the train, eval, and prediction csv files all have the same schema, but the prediction file has a missing target column. The format of the  schema file is a valid BigQuery table schema file. This allows BigQuery to be used later in cloud preprocessing. Only 3 BigQuery types are supported: STRING (for categorical columns) and INTEGER and FLOAT (for numerical columns).

In [9]:
# Save the data to a file.
train_data.to_csv(os.path.join(LOCAL_ROOT, 'train_data.csv'),
                  header=False,
                  index=False)
eval_data.to_csv(os.path.join(LOCAL_ROOT, 'eval_data.csv'),
                  header=False,
                  index=False)
predict_data.to_csv(os.path.join(LOCAL_ROOT, 'predict_data.csv'),
                  header=False,
                  index=False)

# Also write a BigQuery schema file for the csv files.
schema = (
  [
    {'name': key_column, 'type': 'STRING', 'mode': 'NULLABLE'},
    {'name': target_column, 'type': 'FLOAT', 'mode': 'NULLABLE'}
  ] + [{'name': name, 'type': 'STRING', 'mode': 'NULLABLE'} for name in categorical_columns]
)
file_io.write_string_to_file(os.path.join(LOCAL_ROOT, 'schema.json'),
                             json.dumps(schema, indent=2))

<a name="local_preprocessing"></a>
Local preprocessing starting from csv files
=====

In [10]:
!rm -fr {LOCAL_ROOT}/preprocess

In [11]:
train_csv = ml.CsvDataSet(
  file_pattern=os.path.join(LOCAL_ROOT, 'train_data.csv'),
  schema_file=os.path.join(LOCAL_ROOT, 'schema.json'))
eval_csv = ml.CsvDataSet(
  file_pattern=os.path.join(LOCAL_ROOT, 'eval_data.csv'),
  schema_file=os.path.join(LOCAL_ROOT, 'schema.json'))

In [12]:
sd.local_preprocess(
  dataset=train_csv,
  output_dir=os.path.join(LOCAL_ROOT, 'preprocess'),
)

Starting local preprocessing.
Local preprocessing done.


The output of preprocessing is a numerical_analysis file that contains analysis from the numerical columns, and a vocab file from each categorical column. The files produced by preprocessing are consumed in training, and you should not have to worry about these files. Just for fun, lets look at them.

In [13]:
!ls  {LOCAL_ROOT}/preprocess

numerical_analysis.json  vocab_HINS4.csv    vocab_RAC1P.csv
schema.json		 vocab_INDP.csv     vocab_SCHL.csv
vocab_AGEP.csv		 vocab_JWMNP.csv    vocab_SCIENGRLP.csv
vocab_COW.csv		 vocab_JWTR.csv     vocab_SERIALNO.csv
vocab_ESP.csv		 vocab_MAR.csv	    vocab_SEX.csv
vocab_ESR.csv		 vocab_POWPUMA.csv  vocab_WKW.csv
vocab_FOD1P.csv		 vocab_PUMA.csv


<a name="local_training"></a>
Local Training
===========

The files in the output folder of preprocessing are consumed by the trainer. Training requires a transform config file to describe what transforms to apply on the data. The key and target transform are the only required transform, a default transform will be applied to every other column if it is not listed in the transforms.


In [14]:
transforms = {
  "WAGP": {"transform": "target"},
  "SERIALNO": {"transform": "key"},
  "AGEP": {"transform": "embedding", "embedding_dim": 2}, # age
  "COW": {"transform": "one_hot"}, # class of worker
  "ESP": {"transform": "embedding", "embedding_dim": 2}, # Employment status of parents
  "ESR": {"transform": "one_hot"}, #Employment status
  "FOD1P": {"transform": "embedding", "embedding_dim": 3}, #field of degree
  "HINS4": {"transform": "one_hot"}, #Medicaid
  "INDP": {"transform": "embedding", "embedding_dim": 5}, # industry
  "JWMNP": {"transform": "hash_embedding", "hash_bucket_size": 20, "embedding_dim": 2}, # travel time to work
  "JWTR": {"transform": "one_hot"}, #Means of transportation to work
  "MAR": {"transform": "one_hot"}, #Marital status
  "POWPUMA": {"transform": "hash_one_hot", "hash_bucket_size": 4}, #Place of work
  "PUMA": {"transform": "one_hot"}, #area code
  "RAC1P": {"transform": "one_hot"}, #race code
  "SCHL": {"transform": "one_hot"}, #school
  "SCIENGRLP": {"transform": "one_hot"}, # Science
  "SEX": {"transform": "one_hot"},
  "WKW": {"transform": "one_hot"}, #Weeks worked
}
file_io.write_string_to_file(os.path.join(LOCAL_ROOT, 'transforms.json'),
                             json.dumps(transforms, indent=2))

In [15]:
!rm -fr {LOCAL_ROOT}/training

In [16]:
sd.local_train(
  train_dataset=train_csv,
  eval_dataset=eval_csv,
  transforms=os.path.join(LOCAL_ROOT, 'transforms.json'),
  preprocess_output_dir=os.path.join(LOCAL_ROOT, 'preprocess'),
  output_dir=os.path.join(LOCAL_ROOT, 'training'),
  model_type='dnn_regression',
  max_steps=2000,
  layer_sizes=[5, 5, 5]
)

Starting local training.




























































































INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6670658d50>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}


INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6670658d50>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}


Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.


Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Saving checkpoints for 1 into ./census_regression_workspace/training/train/model.ckpt.


INFO:tensorflow:Saving checkpoints for 1 into ./census_regression_workspace/training/train/model.ckpt.


INFO:tensorflow:loss = 1927.57, step = 1


INFO:tensorflow:loss = 1927.57, step = 1


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.


INFO:tensorflow:Starting evaluation at 2017-02-23-18:06:00


INFO:tensorflow:Starting evaluation at 2017-02-23-18:06:00


INFO:tensorflow:Evaluation [1/100]


INFO:tensorflow:Evaluation [1/100]


INFO:tensorflow:Evaluation [2/100]


INFO:tensorflow:Evaluation [2/100]


INFO:tensorflow:Evaluation [3/100]


INFO:tensorflow:Evaluation [3/100]


INFO:tensorflow:Finished evaluation at 2017-02-23-18:06:01


INFO:tensorflow:Finished evaluation at 2017-02-23-18:06:01


INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 1565.84


INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 1565.84






INFO:tensorflow:Validation (step 100): loss = 1565.84, global_step = 1


INFO:tensorflow:Validation (step 100): loss = 1565.84, global_step = 1


INFO:tensorflow:global_step/sec: 20.9157


INFO:tensorflow:global_step/sec: 20.9157


INFO:tensorflow:loss = 407.338, step = 101


INFO:tensorflow:loss = 407.338, step = 101


INFO:tensorflow:global_step/sec: 247.967


INFO:tensorflow:global_step/sec: 247.967


INFO:tensorflow:loss = 318.402, step = 201


INFO:tensorflow:loss = 318.402, step = 201


INFO:tensorflow:global_step/sec: 235.649


INFO:tensorflow:global_step/sec: 235.649


INFO:tensorflow:loss = 291.46, step = 301


INFO:tensorflow:loss = 291.46, step = 301


INFO:tensorflow:global_step/sec: 236.687


INFO:tensorflow:global_step/sec: 236.687


INFO:tensorflow:loss = 192.56, step = 401


INFO:tensorflow:loss = 192.56, step = 401


INFO:tensorflow:global_step/sec: 240.448


INFO:tensorflow:global_step/sec: 240.448


INFO:tensorflow:loss = 367.718, step = 501


INFO:tensorflow:loss = 367.718, step = 501


INFO:tensorflow:global_step/sec: 231.424


INFO:tensorflow:global_step/sec: 231.424


INFO:tensorflow:loss = 189.296, step = 601


INFO:tensorflow:loss = 189.296, step = 601


INFO:tensorflow:global_step/sec: 232.624


INFO:tensorflow:global_step/sec: 232.624


INFO:tensorflow:loss = 219.035, step = 701


INFO:tensorflow:loss = 219.035, step = 701


INFO:tensorflow:global_step/sec: 243.817


INFO:tensorflow:global_step/sec: 243.817


INFO:tensorflow:loss = 386.178, step = 801


INFO:tensorflow:loss = 386.178, step = 801


INFO:tensorflow:global_step/sec: 224.913


INFO:tensorflow:global_step/sec: 224.913


INFO:tensorflow:loss = 324.573, step = 901


INFO:tensorflow:loss = 324.573, step = 901


INFO:tensorflow:global_step/sec: 256.383


INFO:tensorflow:global_step/sec: 256.383


INFO:tensorflow:loss = 292.532, step = 1001


INFO:tensorflow:loss = 292.532, step = 1001


INFO:tensorflow:global_step/sec: 234.221


INFO:tensorflow:global_step/sec: 234.221


INFO:tensorflow:loss = 200.719, step = 1101


INFO:tensorflow:loss = 200.719, step = 1101


INFO:tensorflow:global_step/sec: 228.768


INFO:tensorflow:global_step/sec: 228.768


INFO:tensorflow:loss = 200.692, step = 1201


INFO:tensorflow:loss = 200.692, step = 1201


INFO:tensorflow:global_step/sec: 237.978


INFO:tensorflow:global_step/sec: 237.978


INFO:tensorflow:loss = 276.001, step = 1301


INFO:tensorflow:loss = 276.001, step = 1301


INFO:tensorflow:global_step/sec: 235.607


INFO:tensorflow:global_step/sec: 235.607


INFO:tensorflow:loss = 152.407, step = 1401


INFO:tensorflow:loss = 152.407, step = 1401


INFO:tensorflow:global_step/sec: 239.374


INFO:tensorflow:global_step/sec: 239.374


INFO:tensorflow:loss = 265.338, step = 1501


INFO:tensorflow:loss = 265.338, step = 1501


INFO:tensorflow:global_step/sec: 231.299


INFO:tensorflow:global_step/sec: 231.299


INFO:tensorflow:loss = 231.588, step = 1601


INFO:tensorflow:loss = 231.588, step = 1601


INFO:tensorflow:global_step/sec: 238.814


INFO:tensorflow:global_step/sec: 238.814


INFO:tensorflow:loss = 288.931, step = 1701


INFO:tensorflow:loss = 288.931, step = 1701


INFO:tensorflow:global_step/sec: 231.536


INFO:tensorflow:global_step/sec: 231.536


INFO:tensorflow:loss = 249.24, step = 1801


INFO:tensorflow:loss = 249.24, step = 1801


INFO:tensorflow:global_step/sec: 230.892


INFO:tensorflow:global_step/sec: 230.892


INFO:tensorflow:loss = 227.053, step = 1901


INFO:tensorflow:loss = 227.053, step = 1901


INFO:tensorflow:Saving checkpoints for 2000 into ./census_regression_workspace/training/train/model.ckpt.


INFO:tensorflow:Saving checkpoints for 2000 into ./census_regression_workspace/training/train/model.ckpt.


INFO:tensorflow:Loss for final step: 149.588.


INFO:tensorflow:Loss for final step: 149.588.


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.


INFO:tensorflow:Starting evaluation at 2017-02-23-18:06:13


INFO:tensorflow:Starting evaluation at 2017-02-23-18:06:13


INFO:tensorflow:Evaluation [1/100]


INFO:tensorflow:Evaluation [1/100]


INFO:tensorflow:Evaluation [2/100]


INFO:tensorflow:Evaluation [2/100]


INFO:tensorflow:Evaluation [3/100]


INFO:tensorflow:Evaluation [3/100]


INFO:tensorflow:Finished evaluation at 2017-02-23-18:06:14


INFO:tensorflow:Finished evaluation at 2017-02-23-18:06:14


INFO:tensorflow:Saving dict for global step 2000: global_step = 2000, loss = 399.458


INFO:tensorflow:Saving dict for global step 2000: global_step = 2000, loss = 399.458






INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:SavedModel written to: ./census_regression_workspace/training/train/export/intermediate_evaluation_models/1487873178087/saved_model.pb


INFO:tensorflow:SavedModel written to: ./census_regression_workspace/training/train/export/intermediate_evaluation_models/1487873178087/saved_model.pb


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:SavedModel written to: ./census_regression_workspace/training/train/export/intermediate_prediction_models/1487873181183/saved_model.pb


INFO:tensorflow:SavedModel written to: ./census_regression_workspace/training/train/export/intermediate_prediction_models/1487873181183/saved_model.pb


Local training done.


In one run of the notebook, the loss on the evaluation set was ~400 (or a RMSE of ~$20,000). For you the loss might be different. Let us check if this loss is decent by comparing with a model that simply returns the mean value.

In [17]:
eval_mean = eval_data[target_column].mean()
loss = ((eval_data[target_column] -  eval_mean)*(eval_data[target_column] -  eval_mean)).mean()
print(loss)

560.563578606


In one run of the notebook, the eval loss using the mean as a prediction was ~560. Our model is better than simply guessing the mean, but not by much. Try playing with the transforms, or changing the layer sizes, learning rate, or training steps to reduce the loss. Try adding other columns from the dataset to the model, or try predicting a different income-line column. You might find that predicting income from just demographic-like information is challenging.

<a name="local_prediction"></a>
Local prediction
================

Local predict uses the model produced by training. The input data can be a csv string or Pandas DataFrame, but the schema must match the data set used for training, except the target column is missing. That is, if the training dataset had the values "id,target,value1,value2", the prediction data must be in the form "id,value1,value2".

In [18]:
sd.local_predict(
  training_ouput_dir=os.path.join(LOCAL_ROOT, 'training'),
  data=['490,64,2,0,1,0,2,8090,015,01,1,00590,00500,1,18,0,2,1',
        '1225,32,5,0,4,5301,2,9680,015,01,1,00100,00100,1,21,2,1,1',
        '1226,30,1,0,1,0,2,8680,020,01,1,00100,00100,1,16,0,2,1']
)

Starting local prediction.
Local prediction done.


Unnamed: 0,key_from_input,predicted_target
0,490,20.630501
1,1225,59.4538
2,1226,20.864326


In [19]:
sd.local_predict(
  training_ouput_dir=os.path.join(LOCAL_ROOT, 'training'),
  data=pd.DataFrame(
    [[490,64,2,0,1,0,2,8090,"015","01",1,"00590","00500",1,18,0,2,1],
     [1225,32,5,0,4,5301,2,9680,"015","01",1,"00100","00100",1,21,2,1,1],
     [1226,30,1,0,1,0,2,8680,"020","01",1,"00100","00100",1,16,0,2,1]])
)

Starting local prediction.
Local prediction done.


Unnamed: 0,key_from_input,predicted_target
0,490,20.630501
1,1225,59.4538
2,1226,20.864326


<a name="local_batch_prediction"></a>
Local batch prediction
============

Local batch prediction runs prediction on batched input data. This is ideal if the input dataset is very large or you have limited available main memory. However, for very large datasets, it is better to run batch prediction using the Google Cloud Machine Learning Engine services. Two output formats are supported, csv and json. The output may also be shardded. Another feature of batch prediction is the option to run evaluation--prediction on data that contains the target column. Like local_predict, the input data must batch the schema used for training.

In [20]:
!rm -fr {LOCAL_ROOT}/predict_out

In [21]:
sd.local_batch_predict(
  training_ouput_dir=os.path.join(LOCAL_ROOT, 'training'),
  prediction_input_file=os.path.join(LOCAL_ROOT, 'eval_data.csv'),
  output_dir=os.path.join(LOCAL_ROOT, 'predict_out'),
  output_format='json',
  mode='evaluation'
)

Starting local batch prediction.
Local batch prediction done.


In [22]:
!ls {LOCAL_ROOT}/predict_out

errors-00000-of-00001.txt  predictions-00000-of-00001.json


In [23]:
!cat {LOCAL_ROOT}/predict_out/errors*

In [24]:
!head {LOCAL_ROOT}/predict_out/predictions-00000*

{"target_from_input": 40.0,"predicted_target": 26.240245819091797,"key_from_input": "11735"}
{"target_from_input": 20.0,"predicted_target": 35.266021728515625,"key_from_input": "12794"}
{"target_from_input": 70.0,"predicted_target": 34.8465461730957,"key_from_input": "23402"}
{"target_from_input": 45.0,"predicted_target": 54.92388153076172,"key_from_input": "30351"}
{"target_from_input": 32.0,"predicted_target": 45.367523193359375,"key_from_input": "30470"}
{"target_from_input": 17.0,"predicted_target": 35.007110595703125,"key_from_input": "32995"}
{"target_from_input": 39.0,"predicted_target": 27.419919967651367,"key_from_input": "40409"}
{"target_from_input": 35.0,"predicted_target": 36.686256408691406,"key_from_input": "49363"}
{"target_from_input": 38.0,"predicted_target": 57.945594787597656,"key_from_input": "54409"}
{"target_from_input": 28.0,"predicted_target": 34.59382247924805,"key_from_input": "68049"}


Let us run batch prediction again this time using data that does not have a target column. 

In [5]:
!rm -fr {LOCAL_ROOT}/predict_out

In [11]:
sd.local_batch_predict(
  training_ouput_dir=os.path.join(LOCAL_ROOT, 'training'),
  prediction_input_file=os.path.join(LOCAL_ROOT, 'predict_data.*'),
  output_dir=os.path.join(LOCAL_ROOT, 'predict_out'),
  output_format='csv',
  mode='prediction'
)

Starting local batch prediction.
Local batch prediction done.


In [8]:
!ls {LOCAL_ROOT}/predict_out

csv_header.json  errors-00000-of-00001.txt  predictions-00000-of-00001.csv


In [9]:
!cat {LOCAL_ROOT}/predict_out/csv_header.json

[
  {
    "type": "STRING", 
    "mode": "NULLABLE", 
    "name": "key_from_input"
  }, 
  {
    "type": "STRING", 
    "mode": "NULLABLE", 
    "name": "predicted_target"
  }
]


In [10]:
!head {LOCAL_ROOT}/predict_out/predictions-00000*

2892,55.0839347839
3137,57.2630462646
10412,30.8309402466
14981,8.9282541275
32166,12.1347923279
32166,28.7190818787
32881,50.8943595886
36459,25.4952888489
37563,38.2368469238
42471,24.4276676178


Cleaning Things up
-------

As everything was written to LOCAL_ROOT, we can simply remove this folder. If you want to delete those files, uncomment and run the next cell.

In [None]:
#!rm -fr {LOCAL_ROOT}