# About this notebook

This notebook assumes you have ran the local Census Regression notebook and you have not deleted the LOCAL_ROOT folder.In this notebook, we will use BigQuery to analyze the data files for training. 

# Setting things up

In [1]:
import mltoolbox.regression.dnn as sd

In [2]:
import os
import tensorflow as tf
from tensorflow.python.lib.io import file_io
import datalab.ml as ml

This notebook will write files during preprocessing. Please give a root folder you wish to use.

In [3]:
LOCAL_ROOT = './census_regression_workspace' # This should be the same as what was used in the local notebook
CLOUD_ROOT = 'gs://' + datalab_project_id() + '-census-regression-datalab'

# No need to edit anything else in this cell.
LOCAL_PREPROCESSING_DIR = os.path.join(LOCAL_ROOT, 'preprocessing')
CLOUD_PREPROCESSING_DIR = os.path.join(CLOUD_ROOT, 'cloud_preprocessing') 

LOCAL_TRAIN_FILE = os.path.join(LOCAL_ROOT, 'train.csv')
CLOUD_TRAIN_FILE = os.path.join(CLOUD_ROOT, 'train.csv')


LOCAL_SCHEMA_FILE = os.path.join(LOCAL_ROOT, 'schema.json')
CLOUD_SCHEMA_FILE = os.path.join(CLOUD_ROOT, 'schema.json')

if not file_io.file_exists(LOCAL_ROOT):
  raise ValueError('LOCAL_ROOT not found. Did you run the local notebook?')
  
!gsutil mb {CLOUD_ROOT}

Creating gs://cloud-ml-dev-census-regression-datalab/...
ServiceException: 409 Bucket cloud-ml-dev-census-regression-datalab already exists.


First, let us put the csv files on GCS

In [4]:
!gsutil cp {LOCAL_TRAIN_FILE} {CLOUD_TRAIN_FILE}
!gsutil cp {LOCAL_SCHEMA_FILE} {CLOUD_SCHEMA_FILE}

Copying file://./census_regression_workspace/train.csv [Content-Type=text/csv]...
/ [1 files][162.9 KiB/162.9 KiB]                                                
Operation completed over 1 objects/162.9 KiB.                                    
Copying file://./census_regression_workspace/schema.json [Content-Type=application/json]...
/ [1 files][  998.0 B/  998.0 B]                                                
Operation completed over 1 objects/998.0 B.                                      


# Analysis with BigQuery starting from csv files on GCS

In [5]:
!gsutil -m rm -fr {CLOUD_PREPROCESSING_DIR}

Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/numerical_analysis.json#1488558995585167...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/schema.json#1488559041254855...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_AGEP.csv#1488559000284624...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_COW.csv#1488559002435525...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_ESP.csv#1488559004980695...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_ESR.csv#1488559007060838...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_FOD1P.csv#1488559009571115...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_HINS4.csv#1488559011731415...
Removing gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_INDP.csv#1488559013674325...
Removing gs://cloud-ml-

In [6]:
train_csv = ml.CsvDataSet(
  file_pattern=CLOUD_TRAIN_FILE,
  schema_file=CLOUD_SCHEMA_FILE
)

In [7]:
job = sd.analyze(
  cloud=True,
  dataset=train_csv,
  output_dir=CLOUD_PREPROCESSING_DIR,
)
job.wait()

Track BigQuery status at
https://bigquery.cloud.google.com/queries/cloud-ml-dev
Running numerical analysis...done.
Running categorical analysis...done.


Job c912fd7c-19d9-4443-b565-665708d34e44 completed

The output of preprocessing is a numerical_analysis file that contains analysis from the numerical columns, and a vocab file from each categorical column. The files produced by preprocessing are consumed in training, and you should not have to worry about these files. Just for fun, lets look at them.

In [8]:
!gsutil ls  {CLOUD_PREPROCESSING_DIR}

gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/numerical_analysis.json
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/schema.json
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_AGEP.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_COW.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_ESP.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_ESR.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_FOD1P.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_HINS4.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_INDP.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_JWMNP.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_JWTR.csv
gs://cloud-ml-dev-census-regression-datalab/cloud_preprocessing/vocab_MAR.csv
gs://cloud-ml-dev-census-regression

In [9]:
!gsutil cat  {CLOUD_PREPROCESSING_DIR}/schema.json

[
  {
    "type": "STRING",
    "name": "SERIALNO"
  },
  {
    "type": "FLOAT",
    "name": "WAGP"
  },
  {
    "type": "STRING",
    "name": "AGEP"
  },
  {
    "type": "STRING",
    "name": "COW"
  },
  {
    "type": "STRING",
    "name": "ESP"
  },
  {
    "type": "STRING",
    "name": "ESR"
  },
  {
    "type": "STRING",
    "name": "FOD1P"
  },
  {
    "type": "STRING",
    "name": "HINS4"
  },
  {
    "type": "STRING",
    "name": "INDP"
  },
  {
    "type": "STRING",
    "name": "JWMNP"
  },
  {
    "type": "STRING",
    "name": "JWTR"
  },
  {
    "type": "STRING",
    "name": "MAR"
  },
  {
    "type": "STRING",
    "name": "POWPUMA"
  },
  {
    "type": "STRING",
    "name": "PUMA"
  },
  {
    "type": "STRING",
    "name": "RAC1P"
  },
  {
    "type": "STRING",
    "name": "SCHL"
  },
  {
    "type": "STRING",
    "name": "SCIENGRLP"
  },
  {
    "type": "STRING",
    "name": "SEX"
  },
  {
    "type"

# Cleaning things up

If you want to delete the files you made on GCS, uncomment and run the next cell.

In [10]:
#!gsutil rm -fr {CLOUD_ROOT}