# About this notebook

This notebook assumes you have ran the local Iris classification notebook ("1 Local End to End") and you have not deleted the LOCAL_ROOT folder. In this notebook, we will use BigQuery to analyze the data files for training. 

# Setting things up

In [1]:
import mltoolbox.classification.dnn as sd

In [2]:
import os
import tensorflow as tf
from tensorflow.python.lib.io import file_io
import google.datalab.ml as ml

This notebook will write files during preprocessing. Please give a root folder you wish to use.

In [3]:
LOCAL_ROOT = './iris_notebook_workspace' # This should be the same as what was used in the local iris notebook
CLOUD_ROOT = 'gs://' + datalab_project_id() + '-iris-classification-datalab' # Feel free to change this line.

# No need to edit anything else in this cell.
LOCAL_PREPROCESSING_DIR = os.path.join(LOCAL_ROOT, 'preprocessing')
CLOUD_PREPROCESSING_DIR = os.path.join(CLOUD_ROOT, 'cloud_preprocessing') 

LOCAL_TRAIN_FILE = os.path.join(LOCAL_ROOT, 'train.csv')
CLOUD_TRAIN_FILE = os.path.join(CLOUD_ROOT, 'train.csv')


LOCAL_SCHEMA_FILE = os.path.join(LOCAL_ROOT, 'schema.json')
CLOUD_SCHEMA_FILE = os.path.join(CLOUD_ROOT, 'schema.json')

if not file_io.file_exists(LOCAL_ROOT):
  raise ValueError('LOCAL_ROOT not found. Did you run the local notebook?')
  
!gsutil mb {CLOUD_ROOT}

Creating gs://cloud-ml-dev-iris-classification-datalab/...


First, let us put the csv files on GCS

In [4]:
!gsutil cp {LOCAL_TRAIN_FILE} {CLOUD_TRAIN_FILE}
!gsutil cp {LOCAL_SCHEMA_FILE} {CLOUD_SCHEMA_FILE}

Copying file://./iris_notebook_workspace/train.csv [Content-Type=text/csv]...
/ [1 files][  3.8 KiB/  3.8 KiB]                                                
Operation completed over 1 objects/3.8 KiB.                                      
Copying file://./iris_notebook_workspace/schema.json [Content-Type=application/json]...
/ [1 files][  341.0 B/  341.0 B]                                                
Operation completed over 1 objects/341.0 B.                                      


# Analysis with BigQuery starting from csv files on GCS

In [5]:
!gsutil rm -fr {CLOUD_PREPROCESSING_DIR}

CommandException: 1 files/objects could not be removed.


In [6]:
train_csv = ml.CsvDataSet(
  file_pattern=CLOUD_TRAIN_FILE,
  schema_file=CLOUD_SCHEMA_FILE)

In [7]:
sd.analyze(
  dataset=train_csv,
  output_dir=CLOUD_PREPROCESSING_DIR,
  cloud=True
)

Track BigQuery status at
https://bigquery.cloud.google.com/queries/cloud-ml-dev
Running numerical analysis...done.
Running categorical analysis...done.


The output of preprocessing is a numerical_analysis file that contains analysis from the numerical columns, and a vocab file from each categorical column. The files produced by preprocessing are consumed in training, and you should not have to worry about these files. Just for fun, lets look at them.

In [8]:
!gsutil ls  {CLOUD_PREPROCESSING_DIR}

gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/schema.json
gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/stats.json
gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/vocab_flower.csv


In [9]:
!gsutil cat  {CLOUD_PREPROCESSING_DIR}/vocab_flower.csv

Iris-setosa
Iris-versicolor
Iris-virginica


In [10]:
!gsutil cat  {CLOUD_PREPROCESSING_DIR}/numerical_analysis.json

CommandException: No URLs matched: gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/numerical_analysis.json


# Cleaning things up

If you want to delete the files you made on GCS, uncomment and run the next cell.

In [11]:
#!gsutil -m rm -rf {CLOUD_ROOT}