### Preprocess

Once you have gathered your data and decided how to preprocess them (a featureset class is already defined), we can preprocess the data. One way to preprocess the data is to use DataFlow. If your data is large, DataFlow can run in cloud in a distributed fashion. If not large, you can also run the DataFlow locally. <br><br>

CloudML provides a preprocess DataFlow transformation so it can be easily plugged into the pipeline.

What Datalab provides is generated code template with "%mlalpha preprocess" command, so you don't have to start from scratch to author your DataFlow pipeline.

Preprocessing requires a featureset class. We've done that in previous "1.Feature" notebook but we need to define it again here in this notebook scope.
Note that we choose to preprocess all numeric feature columns with [-1, 1] scale.

In [1]:
import google.cloud.ml.features as features

class IrisFeatures(object):
  """This class is generated from command line:
        %ml features
        path: /content/datalab/tmp/ml/iris/data_train.csv
        headers: key,species,sepal_length,sepal_width,petal_length,petal_width
        target: species
        id: key
        Please modify it as appropriate!!!
  """
  csv_columns = ('key','species','sepal_length','sepal_width','petal_length','petal_width')
  species = features.target('species').discrete()
  key = features.key('key')
  measurements = [
      features.numeric('petal_width').max_abs_scale(1),
      features.numeric('sepal_length').max_abs_scale(1),
      features.numeric('petal_length').max_abs_scale(1),
      features.numeric('sepal_width').max_abs_scale(1),
  ]

Run %mlalpha preprocess, and it generates the input cell for you to fill out.

In [None]:
%mlalpha preprocess

### Local Preprocessing

Fill in the cell input: so it looks like
```
%%mlalpha preprocess
train_data_path: /content/datalab/tmp/ml/iris/data_train.csv
eval_data_path: /content/datalab/tmp/ml/iris/data_eval.csv
data_format: CSV
output_dir: /content/datalab/tmp/ml/iris/preprocessed
feature_set_class_name: IrisFeatures
```

And then run it. Next cell is what you get. You can run the pipeline directly (it is a local pipeline), or extend it with more DataFlow transforms.

In [2]:

# header
"""
Following code is generated from command line:
%%mlalpha preprocess
train_data_path: /content/datalab/tmp/ml/iris/data_train.csv
eval_data_path: /content/datalab/tmp/ml/iris/data_eval.csv
data_format: CSV
output_dir: /content/datalab/tmp/ml/iris/preprocessed
feature_set_class_name: IrisFeatures

Please modify as appropriate!!!
"""

# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.io as io
import os

# defines
feature_set = IrisFeatures()
OUTPUT_DIR = '/content/datalab/tmp/ml/iris/preprocessed'
pipeline = beam.Pipeline('DirectPipelineRunner')


# preprocessing
training_data = beam.io.TextFileSource(
    '/content/datalab/tmp/ml/iris/data_train.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)

eval_data = beam.io.TextFileSource(
    '/content/datalab/tmp/ml/iris/data_eval.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline  | beam.Read('ReadEvalData', eval_data)

(metadata, train_features, eval_features) = ((train, eval) | 'Preprocess'
    >> ml.Preprocess(feature_set, input_format='csv',
                  format_metadata={'headers': feature_set.csv_columns}))

(metadata        | 'SaveMetadata'
    >> io.SaveMetadata(os.path.join(OUTPUT_DIR, 'metadata.yaml')))

(train_features  | 'SaveTrain'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_train'), shard_name_template=''))

(eval_features   | 'SaveEval'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_eval'), shard_name_template=''))

# run pipeline
pipeline.run()




<apache_beam.runners.direct_runner.DirectPipelineResult at 0x7f9cb47c5410>

Run the pipeline without modification, and check the output. The output is compressed TF Record format that can be consumed by TensorFlow directly. 

In [3]:
!ls /content/datalab/tmp/ml/iris/preprocessed

features_eval.tfrecord.gz  features_train.tfrecord.gz  metadata.yaml


### Cloud Preprocessing
You can also generate Cloud DataFlow pipeline. Just add "--cloud" to "%ml preprocess". <br>
Note that if you need to get it running in cloud, you need: <br>
1. Sign In using the up right sign-in button, if you have not done so. <br>
2. Set a default project by running '%projects set Your-Project-Id'.
3. Your data need to be in Cloud Storage.

Define variables that will be used later.

In [4]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
train_data_path = os.path.join(bucket, 'iris', 'data_train.csv')
eval_data_path = os.path.join(bucket, 'iris', 'data_eval.csv')
output_dir = os.path.join(bucket, 'iris', 'preprocessed')

Create GCS bucket and copy training data.

In [10]:
%%storage create --bucket $bucket

In [11]:
!gsutil cp gs://cloud-datalab/sampledata/ml/iris/data_train.csv $train_data_path
!gsutil cp gs://cloud-datalab/sampledata/ml/iris/data_train.csv $eval_data_path 

Copying gs://cloud-datalab/sampledata/ml/iris/data_train.csv [Content-Type=text/csv]...
/ [1 files][  3.8 KiB/  3.8 KiB]                                                
Operation completed over 1 objects/3.8 KiB.                                      
Copying gs://cloud-datalab/sampledata/ml/iris/data_train.csv [Content-Type=text/csv]...
/ [1 files][  3.8 KiB/  3.8 KiB]                                                
Operation completed over 1 objects/3.8 KiB.                                      


Below is the input before it generates the pipeline code.
```
%%mlalpha preprocess --cloud
train_data_path: $train_data_path
eval_data_path: $eval_data_path
data_format: CSV
output_dir: $output_dir
feature_set_class_name: IrisFeatures
```

It generates a Cloud DataFlow pipeline. Run the code and it will start DataFlow in Cloud.

In [5]:

# header
"""
Following code is generated from command line:
%%mlalpha preprocess --cloud
train_data_path: $train_data_path
eval_data_path: $eval_data_path
data_format: CSV
output_dir: $output_dir
feature_set_class_name: IrisFeatures

Please modify as appropriate!!!
"""

# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.io as io
import os

# defines
feature_set = IrisFeatures()
OUTPUT_DIR = 'gs://cloud-ml-test-automated-sampledata/iris/preprocessed'
import datetime
options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': 'preprocess-irisfeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
    'project': 'cloud-ml-test-automated',
    'extra_packages': ['gs://cloud-ml/sdk/cloudml-0.1.6-alpha.tar.gz'],
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True
}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline('DataflowPipelineRunner', options=opts)


# preprocessing
training_data = beam.io.TextFileSource(
    'gs://cloud-ml-test-automated-sampledata/iris/data_train.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)

eval_data = beam.io.TextFileSource(
    'gs://cloud-ml-test-automated-sampledata/iris/data_eval.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline  | beam.Read('ReadEvalData', eval_data)

(metadata, train_features, eval_features) = ((train, eval) | 'Preprocess'
    >> ml.Preprocess(feature_set, input_format='csv',
                  format_metadata={'headers': feature_set.csv_columns}))

(metadata        | 'SaveMetadata'
    >> io.SaveMetadata(os.path.join(OUTPUT_DIR, 'metadata.yaml')))

(train_features  | 'SaveTrain'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_train'), shard_name_template=''))

(eval_features   | 'SaveEval'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_eval'), shard_name_template=''))

# run pipeline
pipeline.run()




<DataflowPipelineResult <Job
 id: u'2016-09-28_20_35_53-16610951082666315627'
 projectId: u'cloud-ml-test-automated'
 steps: []
 tempFiles: []
 type: TypeValueValuesEnum(JOB_TYPE_BATCH, 1)> at 0x7f9ca475f590>

After you run the above generated code, you can go to Developer Console to see the DataFlow job: https://pantheon.corp.google.com/dataflow (and select the right project). Also, run the following to make sure the files were generated.

In [6]:
!gsutil ls $output_dir

gs://cloud-ml-test-automated-sampledata/iris/preprocessed/features_eval.tfrecord.gz
gs://cloud-ml-test-automated-sampledata/iris/preprocessed/features_train.tfrecord.gz
gs://cloud-ml-test-automated-sampledata/iris/preprocessed/metadata.yaml
gs://cloud-ml-test-automated-sampledata/iris/preprocessed/tmp/
