<a name="about"></a>
About this notebook
======

This notebook assumes you have ran the local Iris classification notebook and you have not deleted the LOCAL_ROOT folder. In this notebook, we will use BigQuery to preprocess the data files. 

<a name="setup"></a>
Setting things up
=====

In [1]:
import datalab_structured_data as sd

Lets look at the versions of structured_data and TF we have. Make sure TF is 1.0.0, and SD is 0.0.1.

In [12]:
import os

import tensorflow as tf
from tensorflow.python.lib.io import file_io

import datalab.ml as ml

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))

tf 1.0.0
sd 0.0.1


This notebook will write files during preprocessing. Please give a root folder you wish to use.

In [13]:
LOCAL_ROOT = './iris_notebook_workspace' # This should be the same as what was used in the local iris notebook
CLOUD_ROOT = 'gs://' + datalab_project_id() + 'iris-classification-datalab'

if not file_io.file_exists(LOCAL_ROOT):
  raise ValueError('LOCAL_ROOT not found. Did you run the local notebook?')
!gsutil mb {CLOUD_ROOT}



Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update

Creating gs://cloud-ml-deviris-classification-datalab/...
ServiceException: 409 Bucket cloud-ml-deviris-classification-datalab already exists.


First, let us put the csv files on GCS

In [14]:
!gsutil cp {os.path.join(LOCAL_ROOT, 'train.csv')} {CLOUD_ROOT}
!gsutil cp {os.path.join(LOCAL_ROOT, 'schema.json')} {CLOUD_ROOT}

Copying file://./iris_notebook_workspace/train.csv [Content-Type=text/csv]...
/ [1 files][  3.8 KiB/  3.8 KiB]                                                
Operation completed over 1 objects/3.8 KiB.                                      
Copying file://./iris_notebook_workspace/schema.json [Content-Type=application/json]...

Operation completed over 1 objects/573.0 B.                                      


<a name="local_preprocessing"></a>
Preprocessing with BigQuery starting from csv files on GCS
=====

In [15]:
!gsutil rm -fr {CLOUD_ROOT}/preprocess

Removing gs://cloud-ml-deviris-classification-datalab/preprocess/numerical_analysis.json#1487959460975199...
Removing gs://cloud-ml-deviris-classification-datalab/preprocess/schema.json#1487959464566198...
Removing gs://cloud-ml-deviris-classification-datalab/preprocess/vocab_flower.csv#1487959463775178...

Operation completed over 3 objects.                                              


In [16]:
train_csv = ml.CsvDataSet(
  file_pattern=os.path.join(CLOUD_ROOT, 'train.csv'),
  schema_file=os.path.join(CLOUD_ROOT, 'schema.json'))

In [17]:
sd.cloud_preprocess(
  dataset=train_csv,
  output_dir=os.path.join(CLOUD_ROOT, 'preprocess'),
)

Starting cloud preprocessing.
Track BigQuery status at
https://bigquery.cloud.google.com/queries/cloud-ml-dev
Running numerical analysis...done.
Running categorical analysis...done.
Cloud preprocessing done.


The output of preprocessing is a numerical_analysis file that contains analysis from the numerical columns, and a vocab file from each categorical column. The files produced by preprocessing are consumed in training, and you should not have to worry about these files. Just for fun, lets look at them.

In [18]:
!gsutil ls  {CLOUD_ROOT}/preprocess

gs://cloud-ml-deviris-classification-datalab/preprocess/numerical_analysis.json
gs://cloud-ml-deviris-classification-datalab/preprocess/schema.json
gs://cloud-ml-deviris-classification-datalab/preprocess/vocab_flower.csv


In [19]:
!gsutil cat  {CLOUD_ROOT}/preprocess/schema.json

[
  {
    "type": "STRING",
    "mode": "NULLABLE",
    "name": "flower"
  },
  {
    "type": "INTEGER",
    "mode": "REQUIRED",
    "name": "key"
  },
  {
    "type": "FLOAT",
    "mode": "NULLABLE",
    "name": "sepal_length"
  },
  {
    "type": "FLOAT",
    "mode": "NULLABLE",
    "name": "sepal_width"
  },
  {
    "type": "FLOAT",
    "mode": "NULLABLE",
    "name": "petal_length"
  },
  {
    "type": "FLOAT",
    "mode": "NULLABLE",
    "name": "petal_width"
  }
]

Cleaning things up
=====

If you want to delete the files you made on GCS, uncomment and run the next cell.

In [20]:
#!gsutil rm -fr {CLOUD_ROOT}