# Census Regression


# About this notebook

This notebook uses the datalab structured data package for building and running a Tensorflow regression model locally. This notebook uses the datalab structured data package for building and running a Tensorflow regression model locally. 

In the notebooks that follow, an example of running preprocessing, training, and prediction using the Google Cloud Machine Learning Engine services are given. Note that running the cloud versions of preprocessing, training, and prediction take longer than the local versions. The performance advantage of using the cloud applies to very large data sets, and you don't see it with this sample because the data is small and run time is dominated by setup overhead.

# Setting things up

In [1]:
import mltoolbox.regression.dnn as sd

Lets look at the versions of structured_data and TF we have. Both should be 1.0.0

In [2]:
import os
import json
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.python.lib.io import file_io

import datalab.ml as ml

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))

tf 1.0.0
sd 1.0.0


This notebook will write files during preprocessing, training, and prediction. Please give a root folder you wish to use.

In [3]:
# Edit LOCAL_ROOT if you want to save files to a different location.
# otherwise don't change anything is this cell. If the folder
# already exists, it will be deleted.
LOCAL_ROOT = './census_regression_workspace'

# No need to edit anything else in this cell. But if you do, you 
# might need to chagne the global variables in the cloud notebooks.
LOCAL_PREPROCESSING_DIR = os.path.join(LOCAL_ROOT, 'preprocessing')
LOCAL_TRAINING_DIR = os.path.join(LOCAL_ROOT, 'training')
LOCAL_BATCH_PREDICTION_DIR = os.path.join(LOCAL_ROOT, 'batch_prediction')

LOCAL_TRAIN_FILE = os.path.join(LOCAL_ROOT, 'train.csv')
LOCAL_EVAL_FILE = os.path.join(LOCAL_ROOT, 'eval.csv')
LOCAL_PREDICT_FILE = os.path.join(LOCAL_ROOT, 'predict.csv')

LOCAL_SCHEMA_FILE = os.path.join(LOCAL_ROOT, 'schema.json')
LOCAL_FEATURES_FILE = os.path.join(LOCAL_ROOT, 'features.json')

if file_io.file_exists(LOCAL_ROOT):
  file_io.delete_recursively(LOCAL_ROOT)  
  
file_io.recursive_create_dir(LOCAL_ROOT)  

The source of the data is from the <a href="http://www2.census.gov/programs-surveys/acs/data/pums/2014/1-Year/csv_psd.zip">US Census</a>, but we already have a copy on an public bucket on GCS. The raw data has many columns, and some of then are not useful. The next few cells downloads the data, keeps the interesting columns, and splits the data into a training, eval, and prediction set. Click <a href="http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict14.txt">here</a> for a description of each column.

In [4]:
!gsutil cp gs://cloud-ml-data/census/ss14psd.csv {LOCAL_ROOT}
csv_columns = ('RT', 'SERIALNO', 'SPORDER', 'PUMA', 'ST', 'ADJINC', 'PWGTP',
                 'AGEP', 'CIT', 'CITWP', 'COW', 'DDRS', 'DEAR', 'DEYE', 'DOUT',
                 'DPHY', 'DRAT', 'DRATX', 'DREM', 'ENG', 'FER', 'GCL', 'GCM',
                 'GCR', 'HINS1', 'HINS2', 'HINS3', 'HINS4', 'HINS5', 'HINS6',
                 'HINS7', 'INTP', 'JWMNP', 'JWRIP', 'JWTR', 'LANX', 'MAR',
                 'MARHD', 'MARHM', 'MARHT', 'MARHW', 'MARHYP', 'MIG', 'MIL',
                 'MLPA', 'MLPB', 'MLPCD', 'MLPE', 'MLPFG', 'MLPH', 'MLPI',
                 'MLPJ', 'MLPK', 'NWAB', 'NWAV', 'NWLA', 'NWLK', 'NWRE', 'OIP',
                 'PAP', 'RELP', 'RETP', 'SCH', 'SCHG', 'SCHL', 'SEMP', 'SEX',
                 'SSIP', 'SSP', 'WAGP', 'WKHP', 'WKL', 'WKW', 'WRK', 'YOEP',
                 'ANC', 'ANC1P', 'ANC2P', 'DECADE', 'DIS', 'DRIVESP', 'ESP',
                 'ESR', 'FHICOVP', 'FOD1P', 'FOD2P', 'HICOV', 'HISP', 'INDP',
                 'JWAP', 'JWDP', 'LANP', 'MIGPUMA', 'MIGSP', 'MSP', 'NAICSP',
                 'NATIVITY', 'NOP', 'OC', 'OCCP', 'PAOC', 'PERNP', 'PINCP',
                 'POBP', 'POVPIP', 'POWPUMA', 'POWSP', 'PRIVCOV', 'PUBCOV',
                 'QTRBIR', 'RAC1P', 'RAC2P', 'RAC3P', 'RACAIAN', 'RACAS',
                 'RACBLK', 'RACNH', 'RACNUM', 'RACPI', 'RACSOR', 'RACWHT',
                 'RC', 'SCIENGP', 'SCIENGRLP', 'SFN', 'SFR', 'SOCP', 'VPS',
                 'WAOB', 'FAGEP', 'FANCP', 'FCITP', 'FCITWP', 'FCOWP',
                 'FDDRSP', 'FDEARP', 'FDEYEP', 'FDISP', 'FDOUTP', 'FDPHYP',
                 'FDRATP', 'FDRATXP', 'FDREMP', 'FENGP', 'FESRP', 'FFERP',
                 'FFODP', 'FGCLP', 'FGCMP', 'FGCRP', 'FHINS1P', 'FHINS2P',
                 'FHINS3C', 'FHINS3P', 'FHINS4C', 'FHINS4P', 'FHINS5C',
                 'FHINS5P', 'FHINS6P', 'FHINS7P', 'FHISP', 'FINDP', 'FINTP',
                 'FJWDP', 'FJWMNP', 'FJWRIP', 'FJWTRP', 'FLANP', 'FLANXP',
                 'FMARHDP', 'FMARHMP', 'FMARHTP', 'FMARHWP', 'FMARHYP',
                 'FMARP', 'FMIGP', 'FMIGSP', 'FMILPP', 'FMILSP', 'FOCCP',
                 'FOIP', 'FPAP', 'FPERNP', 'FPINCP', 'FPOBP', 'FPOWSP',
                 'FPRIVCOVP', 'FPUBCOVP', 'FRACP', 'FRELP', 'FRETP', 'FSCHGP',
                 'FSCHLP', 'FSCHP', 'FSEMP', 'FSEXP', 'FSSIP', 'FSSP', 'FWAGP',
                 'FWKHP', 'FWKLP', 'FWKWP', 'FWRKP', 'FYOEP', 'pwgtp1',
                 'pwgtp2', 'pwgtp3', 'pwgtp4', 'pwgtp5', 'pwgtp6', 'pwgtp7',
                 'pwgtp8', 'pwgtp9', 'pwgtp10', 'pwgtp11', 'pwgtp12',
                 'pwgtp13', 'pwgtp14', 'pwgtp15', 'pwgtp16', 'pwgtp17',
                 'pwgtp18', 'pwgtp19', 'pwgtp20', 'pwgtp21', 'pwgtp22',
                 'pwgtp23', 'pwgtp24', 'pwgtp25', 'pwgtp26', 'pwgtp27',
                 'pwgtp28', 'pwgtp29', 'pwgtp30', 'pwgtp31', 'pwgtp32',
                 'pwgtp33', 'pwgtp34', 'pwgtp35', 'pwgtp36', 'pwgtp37',
                 'pwgtp38', 'pwgtp39', 'pwgtp40', 'pwgtp41', 'pwgtp42',
                 'pwgtp43', 'pwgtp44', 'pwgtp45', 'pwgtp46', 'pwgtp47',
                 'pwgtp48', 'pwgtp49', 'pwgtp50', 'pwgtp51', 'pwgtp52',
                 'pwgtp53', 'pwgtp54', 'pwgtp55', 'pwgtp56', 'pwgtp57',
                 'pwgtp58', 'pwgtp59', 'pwgtp60', 'pwgtp61', 'pwgtp62',
                 'pwgtp63', 'pwgtp64', 'pwgtp65', 'pwgtp66', 'pwgtp67',
                 'pwgtp68', 'pwgtp69', 'pwgtp70', 'pwgtp71', 'pwgtp72',
                 'pwgtp73', 'pwgtp74', 'pwgtp75', 'pwgtp76', 'pwgtp77',
                 'pwgtp78', 'pwgtp79', 'pwgtp80')

# If you change categorical_columns, target_column, or key_column, the training
# transforms.json file also needs to change. 
categorical_columns = ['AGEP', 'COW', 'ESP', 'ESR', 'FOD1P', 'HINS4', 'INDP',
                       'JWMNP', 'JWTR', 'MAR', 'POWPUMA', 'PUMA', 'RAC1P', 'SCHL',
                       'SCIENGRLP', 'SEX', 'WKW']
# WAGP will be the target column. Feel free to change the target to any of the other income like columns:
# PERNP, INTP, OIP, PAP, PERNP, PINCP, RETP, SEMP, SSIP, SSP, WAGP, etc
target_column = 'WAGP'
key_column = 'SERIALNO'

all_raw_data = pd.read_csv(os.path.join(LOCAL_ROOT, 'ss14psd.csv'),
                           header=None,
                           names=csv_columns,
                           dtype=str)

csv_columns_to_keep = [key_column] + [target_column] + categorical_columns

# Keep only some of the columns
all_data = all_raw_data[csv_columns_to_keep]

# Replace whitespace with NaN
all_data = all_data.replace('\s+', np.nan, regex=True)

# Replace NaN with 0
all_data = all_data.fillna(0)

# Also convert income unit from $1 to $1000.
all_data = all_data.loc[ all_data[target_column] != ' ']
all_data[target_column] = all_data[target_column].astype(float)/1000.0

# Keep rows with a non-extream target range.
all_data = all_data.loc[ np.logical_and(all_data[target_column] > 10.0, all_data[target_column] < 150.0) ]


Copying gs://cloud-ml-data/census/ss14psd.csv...
/ [1 files][  7.8 MiB/  7.8 MiB]                                                
Operation completed over 1 objects/7.8 MiB.                                      


In [5]:
np.random.seed(1234321)
random_numbers = np.random.rand(len(all_data))

# slit all_data into %80, %10, %10 percent sets.
train_data = all_data[random_numbers < 0.8]
eval_data = all_data[ np.logical_and(random_numbers >= 0.8, random_numbers < 0.9)]
predict_data = all_data[random_numbers >= 0.9]

# remove target column from prediction set
del predict_data[target_column]

A schema file is used to describe each column of the csv files. It is assumed that the train, eval, and prediction csv files all have the same schema, but the prediction file has a missing target column. The format of the  schema file is a valid BigQuery table schema file. This allows BigQuery to be used later in cloud preprocessing. Only 3 BigQuery types are supported: STRING (for categorical columns) and INTEGER and FLOAT (for numerical columns).

In [6]:
# Save the data to a file.
train_data.to_csv(LOCAL_TRAIN_FILE,
                  header=False,
                  index=False)
eval_data.to_csv(LOCAL_EVAL_FILE,
                  header=False,
                  index=False)
predict_data.to_csv(LOCAL_PREDICT_FILE,
                    header=False,
                    index=False)

# Also write a BigQuery schema file for the csv files.
schema = (
  [
    {'name': key_column, 'type': 'STRING'},
    {'name': target_column, 'type': 'FLOAT'}
  ] + [{'name': name, 'type': 'STRING'} for name in categorical_columns]
)
file_io.write_string_to_file(LOCAL_SCHEMA_FILE,
                             json.dumps(schema, indent=2))

# Local preprocessing starting from csv files

In [7]:
!rm -fr {LOCAL_PREPROCESSING_DIR}

In [8]:
train_csv = ml.CsvDataSet(
  file_pattern=LOCAL_TRAIN_FILE,
  schema_file=LOCAL_SCHEMA_FILE)
eval_csv = ml.CsvDataSet(
  file_pattern=LOCAL_EVAL_FILE,
  schema_file=LOCAL_SCHEMA_FILE)

In [9]:
job = sd.analyze(
  cloud=False,
  dataset=train_csv,
  output_dir=LOCAL_PREPROCESSING_DIR,
)
job.wait()

Job 3af9dcdf-965c-41f2-b10a-7ee6b0030e24 completed

The output of preprocessing is a numerical_analysis file that contains analysis from the numerical columns, and a vocab file from each categorical column. The files produced by preprocessing are consumed in training, and you should not have to worry about these files. Just for fun, lets look at them.

In [10]:
!ls {LOCAL_PREPROCESSING_DIR}

numerical_analysis.json  vocab_HINS4.csv    vocab_RAC1P.csv
schema.json		 vocab_INDP.csv     vocab_SCHL.csv
vocab_AGEP.csv		 vocab_JWMNP.csv    vocab_SCIENGRLP.csv
vocab_COW.csv		 vocab_JWTR.csv     vocab_SERIALNO.csv
vocab_ESP.csv		 vocab_MAR.csv	    vocab_SEX.csv
vocab_ESR.csv		 vocab_POWPUMA.csv  vocab_WKW.csv
vocab_FOD1P.csv		 vocab_PUMA.csv


# Local Training

The files in the output folder of preprocessing are consumed by the trainer. Training requires a transform config file to describe what transforms to apply on the data. The key and target transform are the only required transform, a default transform will be applied to every other column if it is not listed in the transforms.


In [11]:
features = {
  "WAGP": {"transform": "target"},
  "SERIALNO": {"transform": "key"},
  "AGEP": {"transform": "embedding", "embedding_dim": 2}, # age
  "COW": {"transform": "one_hot"}, # class of worker
  "ESP": {"transform": "embedding", "embedding_dim": 2}, # Employment status of parents
  "ESR": {"transform": "one_hot"}, #Employment status
  "FOD1P": {"transform": "embedding", "embedding_dim": 3}, #field of degree
  "HINS4": {"transform": "one_hot"}, #Medicaid
  "INDP": {"transform": "embedding", "embedding_dim": 5}, # industry
  "JWMNP": {"transform": "embedding", "embedding_dim": 2}, # travel time to work
  "JWTR": {"transform": "one_hot"}, #Means of transportation to work
  "MAR": {"transform": "one_hot"}, #Marital status
  "POWPUMA": {"transform": "one_hot"}, #Place of work
  "PUMA": {"transform": "one_hot"}, #area code
  "RAC1P": {"transform": "one_hot"}, #race code
  "SCHL": {"transform": "one_hot"}, #school
  "SCIENGRLP": {"transform": "one_hot"}, # Science
  "SEX": {"transform": "one_hot"},
  "WKW": {"transform": "one_hot"}, #Weeks worked
}

# Write the features to a file so the cloud notebook can use the same values.
file_io.write_string_to_file(LOCAL_FEATURES_FILE,
                             json.dumps(features, indent=2))

In [12]:
!rm -fr {LOCAL_TRAINING_DIR}

In [13]:
job = sd.train(
  cloud=False,
  train_dataset=train_csv,
  eval_dataset=eval_csv,
  features=features,
  analysis_output_dir=LOCAL_PREPROCESSING_DIR,
  output_dir=LOCAL_TRAINING_DIR,
  max_steps=2000,
  layer_sizes=[5, 5, 5]
)
job.wait()

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff5c19d9e10>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
W tensorflow/cor

Job 0d2b4e4c-6a5f-4915-a4b8-2bb4f8062387 completed

In one run of the notebook, the loss on the evaluation set was ~400 (or a RMSE of ~$20,000). For you the loss might be different. Let us check if this loss is decent by comparing with a model that simply returns the mean value.

In [14]:
eval_mean = eval_data[target_column].mean()
loss = ((eval_data[target_column] -  eval_mean)*(eval_data[target_column] -  eval_mean)).mean()
print(loss)

560.563578606


In one run of the notebook, the eval loss using the mean as a prediction was ~560. Our model is better than simply guessing the mean, but not by much. Try playing with the transforms, or changing the layer sizes, learning rate, or training steps to reduce the loss. Try adding other columns from the dataset to the model, or try predicting a different income-line column. You might find that predicting income from just demographic-like information is challenging.

# Local prediction

Local predict uses the model produced by training. The input data can be a csv string or Pandas DataFrame, but the schema must match the data set used for training, except the target column is missing. That is, if the training dataset had the values "id,target,value1,value2", the prediction data must be in the form "id,value1,value2".

In [15]:
sd.predict(
  cloud=False,
  training_output_dir=LOCAL_TRAINING_DIR,
  data=['490,64,2,0,1,0,2,8090,015,01,1,00590,00500,1,18,0,2,1',
        '1225,32,5,0,4,5301,2,9680,015,01,1,00100,00100,1,21,2,1,1',
        '1226,30,1,0,1,0,2,8680,020,01,1,00100,00100,1,16,0,2,1']
)

Starting local prediction.
Local prediction done.


Unnamed: 0,SERIALNO,predicted_target
0,490,22.044622
1,1225,66.338898
2,1226,21.513874


In [16]:
sd.predict(
  cloud=False,
  training_output_dir=LOCAL_TRAINING_DIR,
  data=pd.DataFrame(
    [[490,64,2,0,1,0,2,8090,"015","01",1,"00590","00500",1,18,0,2,1],
     [1225,32,5,0,4,5301,2,9680,"015","01",1,"00100","00100",1,21,2,1,1],
     [1226,30,1,0,1,0,2,8680,"020","01",1,"00100","00100",1,16,0,2,1]])
)

Starting local prediction.
Local prediction done.


Unnamed: 0,SERIALNO,predicted_target
0,490,22.044622
1,1225,66.338898
2,1226,21.513874


# Local batch prediction

Local batch prediction runs prediction on batched input data. This is ideal if the input dataset is very large or you have limited available main memory. However, for very large datasets, it is better to run batch prediction using the Google Cloud Machine Learning Engine services. Two output formats are supported, csv and json. The output may also be shardded. Another feature of batch prediction is the option to run evaluation--prediction on data that contains the target column. Like local_predict, the input data must batch the schema used for training.

In [17]:
!rm -fr {LOCAL_BATCH_PREDICTION_DIR}

In [18]:
job = sd.batch_predict(
  cloud=False,
  training_output_dir=LOCAL_TRAINING_DIR,
  prediction_input_file=LOCAL_EVAL_FILE,
  output_dir=LOCAL_BATCH_PREDICTION_DIR,
  output_format='json',
  mode='evaluation'
)
job.wait()

Job ef73cec9-e2fd-4e85-9ff1-ec53d3bb8454 completed

In [19]:
!ls {LOCAL_BATCH_PREDICTION_DIR}

errors-00000-of-00001.txt  predictions-00000-of-00001.json


In [20]:
!cat {LOCAL_BATCH_PREDICTION_DIR}/errors*

In [21]:
!head {LOCAL_BATCH_PREDICTION_DIR}/predictions-00000*

{"SERIALNO": "11735","target_from_input": 40.0,"predicted_target": 22.862749099731445}
{"SERIALNO": "12794","target_from_input": 20.0,"predicted_target": 27.07731056213379}
{"SERIALNO": "23402","target_from_input": 70.0,"predicted_target": 39.61308670043945}
{"SERIALNO": "30351","target_from_input": 45.0,"predicted_target": 54.14183807373047}
{"SERIALNO": "30470","target_from_input": 32.0,"predicted_target": 48.34682083129883}
{"SERIALNO": "32995","target_from_input": 17.0,"predicted_target": 28.33970832824707}
{"SERIALNO": "40409","target_from_input": 39.0,"predicted_target": 23.186382293701172}
{"SERIALNO": "49363","target_from_input": 35.0,"predicted_target": 44.507320404052734}
{"SERIALNO": "54409","target_from_input": 38.0,"predicted_target": 55.829681396484375}
{"SERIALNO": "68049","target_from_input": 28.0,"predicted_target": 31.43916893005371}


Let us run batch prediction again this time using data that does not have a target column. 

In [22]:
!rm -fr {LOCAL_BATCH_PREDICTION_DIR}

In [23]:
job = sd.batch_predict(
  cloud=False,
  training_output_dir=LOCAL_TRAINING_DIR,
  prediction_input_file=LOCAL_PREDICT_FILE,
  output_dir=LOCAL_BATCH_PREDICTION_DIR,
  output_format='csv',
  mode='prediction'
)
job.wait()

Job 94dd4ae5-bfbc-4af2-8007-b062efc90770 completed

In [24]:
!ls {LOCAL_BATCH_PREDICTION_DIR}

csv_schema.json  errors-00000-of-00001.txt  predictions-00000-of-00001.csv


In [25]:
!cat {LOCAL_BATCH_PREDICTION_DIR}/csv*

[
  {
    "type": "STRING", 
    "mode": "NULLABLE", 
    "name": "SERIALNO"
  }, 
  {
    "type": "FLOAT", 
    "mode": "NULLABLE", 
    "name": "predicted_target"
  }
]


In [26]:
!head {LOCAL_BATCH_PREDICTION_DIR}/predictions-00000*

2892,64.9897003174
3137,55.2213287354
10412,51.7473716736
14981,19.0005187988
32166,18.5371932983
32166,35.0706214905
32881,52.2031517029
36459,19.4232711792
37563,26.6305561066
42471,27.6680927277


Cleaning Things up
-------

As everything was written to LOCAL_ROOT, we can simply remove this folder. If you want to delete those files, uncomment and run the next cell.

In [27]:
#!rm -fr {LOCAL_ROOT}