<a name="about"></a>
About this notebook
======

This notebook assumes you have ran the local Census Regression notebook and you have not deleted the LOCAL_ROOT folder. In this notebook, we will use BigQuery to preprocess the data files. 

<a name="setup"></a>
Setting things up
=====

In [1]:
import datalab_structured_data as sd

Lets look at the versions of structured_data and TF we have. Make sure TF is 1.0.0, and SD is 0.0.1.

In [2]:
import os

import tensorflow as tf
from tensorflow.python.lib.io import file_io

import datalab.ml as ml

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))

tf 1.0.0
sd 0.0.1


This notebook will write files during preprocessing. Please give a root folder you wish to use.

In [3]:
LOCAL_ROOT = './census_regression_workspace' # This should be the same as what was used in the local census notebook
CLOUD_ROOT = 'gs://' + datalab_project_id() + '-census-regression-datalab'

if not file_io.file_exists(LOCAL_ROOT):
  raise ValueError('LOCAL_ROOT not found. Did you run the local notebook?')
!gsutil mb {CLOUD_ROOT}

Creating gs://cloud-ml-dev-census-regression-datalab/...
ServiceException: 409 Bucket cloud-ml-dev-census-regression-datalab already exists.


First, let us put the csv files on GCS

In [4]:
!gsutil cp {os.path.join(LOCAL_ROOT, '*_data.csv')} {CLOUD_ROOT}
!gsutil cp {os.path.join(LOCAL_ROOT, 'schema.json')} {CLOUD_ROOT}

Copying file://./census_regression_workspace/eval_data.csv [Content-Type=text/csv]...
Copying file://./census_regression_workspace/predict_data.csv [Content-Type=text/csv]...
Copying file://./census_regression_workspace/train_data.csv [Content-Type=text/csv]...
- [3 files][200.1 KiB/200.1 KiB]                                                
Operation completed over 3 objects/200.1 KiB.                                    
Copying file://./census_regression_workspace/schema.json [Content-Type=application/json]...
/ [1 files][  1.4 KiB/  1.4 KiB]                                                
Operation completed over 1 objects/1.4 KiB.                                      


<a name="local_preprocessing"></a>
Preprocessing with BigQuery starting from csv files on GCS
=====

In [5]:
!gsutil rm -fr {CLOUD_ROOT}/preprocess

Removing gs://cloud-ml-dev-census-regression-datalab/preprocess/numerical_analysis.json#1487874068723727...
Removing gs://cloud-ml-dev-census-regression-datalab/preprocess/schema.json#1487874111183352...
Removing gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_AGEP.csv#1487874073942074...
Removing gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_COW.csv#1487874076399053...
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m -o ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_ESP.csv#1487874078769735...
Removing gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_ESR.csv#1487874081184682...
Removing gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_

In [None]:
train_csv = ml.CsvDataSet(
  file_pattern=os.path.join(CLOUD_ROOT, 'train_data.csv'),
  schema_file=os.path.join(CLOUD_ROOT, 'schema.json'))

In [7]:
sd.cloud_preprocess(
  dataset=train_csv,
  output_dir=os.path.join(CLOUD_ROOT, 'preprocess'),
)

Starting cloud preprocessing.
Track BigQuery status at
https://bigquery.cloud.google.com/queries/cloud-ml-dev
Running numerical analysis...done.
Running categorical analysis...done.
Cloud preprocessing done.


The output of preprocessing is a numerical_analysis file that contains analysis from the numerical columns, and a vocab file from each categorical column. The files produced by preprocessing are consumed in training, and you should not have to worry about these files. Just for fun, lets look at them.

In [8]:
!gsutil ls  {CLOUD_ROOT}/preprocess

gs://cloud-ml-dev-census-regression-datalab/preprocess/numerical_analysis.json
gs://cloud-ml-dev-census-regression-datalab/preprocess/schema.json
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_AGEP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_COW.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_ESP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_ESR.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_FOD1P.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_HINS4.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_INDP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_JWMNP.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_JWTR.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_MAR.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_POWPUMA.csv
gs://cloud-ml-dev-census-regression-datalab/preprocess/vocab_PUMA.csv

In [9]:
!gsutil cat  {CLOUD_ROOT}/preprocess/schema.json

[
  {
    "type": "STRING",
    "name": "SERIALNO",
    "mode": "NULLABLE"
  },
  {
    "type": "FLOAT",
    "name": "WAGP",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "AGEP",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "COW",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "ESP",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "ESR",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "FOD1P",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "HINS4",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "INDP",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "JWMNP",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "JWTR",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "MAR",
    "mode": "NULLABLE"
  },
  {
    "type": "STRING",
    "name": "POWPUM

Cleaning things up
=====

If you want to delete the files you made on GCS, uncomment and run the next cell.

In [10]:
#!gsutil rm -fr {CLOUD_ROOT}