### Preprocess

Once you have gathered your data and decided how to preprocess them (a featureset class is already defined), we can preprocess the data. One way to preprocess the data is to use DataFlow. If your data is large, DataFlow can run in cloud in a distributed fashion. If not large, you can also run the DataFlow locally. <br><br>

CloudML provides a preprocess DataFlow transformation so it can be easily plugged into the pipeline.

What Datalab provides is generated code template with "%ml preprocess" command, so you don't have to start from scratch to author your DataFlow pipeline.

Preprocessing requires a featureset class. We've done that in previous "1.Feature" notebook but we need to define it again here in this notebook scope.
Note that we choose to preprocess all numeric feature columns with [-1, 1] scale.

In [19]:
import google.cloud.ml.features as features

class IrisFeatures(object):
  """This class is generated from command line:
        %ml features
        path: /content/datalab/ml/data_train.csv
        headers: key,species,sepal_length,sepal_width,petal_length,petal_width
        target: species
        id: key
        Please modify it as appropriate!!!
  """
  csv_columns = ('key','species','sepal_length','sepal_width','petal_length','petal_width')
  species = features.target('species').classification()
  key = features.key('key')
  measurements = [
      features.numeric('petal_width').max_abs_scale(1),
      features.numeric('sepal_length').max_abs_scale(1),
      features.numeric('petal_length').max_abs_scale(1),
      features.numeric('sepal_width').max_abs_scale(1),
  ]

Run %preprocess, and it generates the input cell for you to fill out.

In [None]:
%ml preprocess

Fill in the cell input.

In [None]:
%%ml preprocess
train_data_path: /content/datalab/ml/iris/data_train.csv
eval_data_path: /content/datalab/ml/iris/data_eval.csv
data_format: CSV
output_dir: /content/datalab/ml/iris/preprocessed
feature_set_class_name: IrisFeatures


Run it and below is what you get. You can run the pipeline directly (it is a local pipeline), or extend it with more DataFlow transforms.

In [17]:

# header
"""
Following code is generated from command line:
%%ml preprocess
train_data_path: /content/datalab/ml/iris/data_train.csv
eval_data_path: /content/datalab/ml/iris/data_eval.csv
data_format: CSV
output_dir: /content/datalab/ml/iris/preprocessed
feature_set_class_name: IrisFeatures

Please modify as appropriate!!!
"""

# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.io as io
import os

# defines
feature_set = IrisFeatures()
OUTPUT_DIR = '/content/datalab/ml/iris/preprocessed'
pipeline = beam.Pipeline('DirectPipelineRunner')


# preprocessing
training_data = beam.io.TextFileSource(
    '/content/datalab/ml/iris/data_train.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)
eval_data = beam.io.TextFileSource(
    '/content/datalab/ml/iris/data_eval.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline | beam.Read('ReadEvalData', eval_data)
(metadata, train_features, eval_features) = ((train, eval) |
    ml.Preprocess('Preprocess', feature_set))
metadata | io.SaveMetadata(os.path.join(OUTPUT_DIR, "metadata.yaml"))
train_features | beam.Write('WriteTraining', beam.io.TextFileSink(
    os.path.join(OUTPUT_DIR, 'features_train')))

eval_features | beam.Write('WriteEval', beam.io.TextFileSink(
    os.path.join(OUTPUT_DIR, 'features_eval')))

# run pipeline
pipeline.run()


<apache_beam.runners.direct_runner.DirectPipelineResult at 0x7f1085dbe390>

Run the pipeline without modification, and check the output. The output is example JSON format that can be consumed by TensorFlow directly. 

In [6]:
!ls /content/datalab/ml/iris/preprocessed

features_eval-00000-of-00001  features_train-00000-of-00001  metadata.yaml


You can also generate Cloud DataFlow pipeline. Just add "--cloud" to "%ml preprocess". <br>
Note that if you need to get it running in cloud, you need: <br>
1. Sign In using the up right sign-in button, if you have not done so. <br>
2. Set a default project by running '%projects set Your-Project-Id'.

In [7]:
%projects set cloud-ml-test-automated

Also need to copy the data to GCS so Cloud DataFlow works.

In [11]:
%%storage create --bucket gs://cloud-ml-test-automated-sampledata

In [13]:
!gsutil cp /content/datalab/ml/iris/data_train.csv gs://cloud-ml-test-automated-sampledata/iris/data_train.csv 
!gsutil cp /content/datalab/ml/iris/data_eval.csv gs://cloud-ml-test-automated-sampledata/iris/data_eval.csv 

Copying file:///content/datalab/ml/iris/data_train.csv [Content-Type=text/csv]...
Uploading   ...ml-test-automated-sampledata/iris/data_train.csv: 3.83 KiB/3.83 KiB    
Copying file:///content/datalab/ml/iris/data_eval.csv [Content-Type=text/csv]...
Uploading   ...-ml-test-automated-sampledata/iris/data_eval.csv: 974 B/974 B    


Below is the input before it generates the pipeline code.

In [None]:
%%ml preprocess --cloud
train_data_path: gs://cloud-ml-test-automated-sampledata/iris/data_train.csv 
eval_data_path: gs://cloud-ml-test-automated-sampledata/iris/data_eval.csv 
data_format: CSV
output_dir: gs://cloud-ml-test-automated-sampledata/iris/preprocessed
feature_set_class_name: IrisFeatures

After you run the following generated code, you can go to Developer Console to see the DataFlow job. For example, https://pantheon.corp.google.com/dataflow?project=cloud-ml-test-automated.

In [32]:

# header
"""
Following code is generated from command line:
%%ml preprocess --cloud
train_data_path: gs://cloud-ml-test-automated-sampledata/iris/data_train.csv 
eval_data_path: gs://cloud-ml-test-automated-sampledata/iris/data_eval.csv 
data_format: CSV
output_dir: gs://cloud-ml-test-automated-sampledata/iris/preprocessed
feature_set_class_name: IrisFeatures

Please modify as appropriate!!!
"""

# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.io as io
import os

# defines
feature_set = IrisFeatures()
OUTPUT_DIR = 'gs://cloud-ml-test-automated-sampledata/iris/preprocessing'
options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': 'preprocess-irisfeatures-160818-050916',
    'project': 'cloud-ml-test-automated',
    'extra_packages': ['gs://cloud-datalab/deploy/cloudml.tar.gz'],
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True
}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline('DataflowPipelineRunner', options=opts)


# preprocessing
training_data = beam.io.TextFileSource(
    'gs://cloud-ml-test-automated-sampledata/iris/data_train.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)
eval_data = beam.io.TextFileSource(
    'gs://cloud-ml-test-automated-sampledata/iris/data_eval.csv',
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline | beam.Read('ReadEvalData', eval_data)
(metadata, train_features, eval_features) = ((train, eval) |
    ml.Preprocess('Preprocess', feature_set))
metadata | io.SaveMetadata(os.path.join(OUTPUT_DIR, "metadata.yaml"))
train_features | beam.Write('WriteTraining', beam.io.TextFileSink(
    os.path.join(OUTPUT_DIR, 'features_train')))

eval_features | beam.Write('WriteEval', beam.io.TextFileSink(
    os.path.join(OUTPUT_DIR, 'features_eval')))

# run pipeline
pipeline.run()




<DataflowPipelineResult <Job
 id: u'2016-08-17_23_58_18-1351193934937065117'
 projectId: u'cloud-ml-test-automated'
 steps: []
 tempFiles: []
 type: TypeValueValuesEnum(JOB_TYPE_BATCH, 1)> at 0x7f1085dbe290>

In [34]:
!gsutil list gs://cloud-ml-test-automated-sampledata/iris/preprocessing

gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_eval-00000-of-00003
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_eval-00001-of-00003
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_eval-00002-of-00003
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00000-of-00004
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00001-of-00004
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00002-of-00004
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/features_train-00003-of-00004
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/metadata.yaml
gs://cloud-ml-test-automated-sampledata/iris/preprocessing/tmp/
