### Preprocess

Once you have gathered your data and decided how to preprocess them (a featureset class is already defined), we can preprocess the data. One way to preprocess the data is to use DataFlow. If your data is large, DataFlow can run in cloud in a distributed fashion. If not large, you can also run the DataFlow locally. <br><br>

CloudML provides a preprocess DataFlow transformation so it can be easily plugged into the pipeline.

What Datalab provides is generated code template with "%ml preprocess" command, so you don't have to start from scratch to author your DataFlow pipeline.

Preprocessing requires a featureset class. We've done that in previous "1.Feature" notebook but we need to define it again here in this notebook scope.
Note that we choose to preprocess all numeric feature columns with [-1, 1] scale by removing the .identity() transform so it uses default transform (scaling to [-1, 1]).

In [1]:
import google.cloud.ml.features as features

class CensusFeatures(object):
  """This class is generated from command line:
        %ml features
        path: /content/datalab/tmp/ml/census/data_train.csv
        headers: SERIALNO,PUMA,NP,ACCESS,ACR,AGS,BATH,BDSP,BLD,BROADBND,BUS,COMPOTHX,CONP,DIALUP,DSL,ELEP,FIBEROP,FS,FULP,GASP,HANDHELD,HFL,INSP,LAPTOP,MHP,MODEM,MRGI,MRGP,MRGT,MRGX,OTHSVCEX,REFR,RMSP,RNTM,RNTP,RWAT,SATELLITE,SINK,SMP,STOV,TEL,TEN,TOIL,VALP,VEH,WATP,YBL,FES,FPARC,GRNTP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,KIT,LNGI,MULTG,MV,NOC,NPF,NPP,NR,NRC,PARTNER,PLM,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SSMC,SVAL,TAXP,WIF,WKEXREL,WORKSTAT
        target: HINCP
        id: SERIALNO
        Please modify it as appropriate!!!
  """
  csv_columns = ('SERIALNO','PUMA','NP','ACCESS','ACR','AGS','BATH','BDSP','BLD','BROADBND','BUS','COMPOTHX','CONP','DIALUP','DSL','ELEP','FIBEROP','FS','FULP','GASP','HANDHELD','HFL','INSP','LAPTOP','MHP','MODEM','MRGI','MRGP','MRGT','MRGX','OTHSVCEX','REFR','RMSP','RNTM','RNTP','RWAT','SATELLITE','SINK','SMP','STOV','TEL','TEN','TOIL','VALP','VEH','WATP','YBL','FES','FPARC','GRNTP','HHL','HHT','HINCP','HUGCL','HUPAC','HUPAOC','HUPARC','KIT','LNGI','MULTG','MV','NOC','NPF','NPP','NR','NRC','PARTNER','PLM','PSF','R18','R60','R65','RESMODE','SMOCP','SMX','SRNT','SSMC','SVAL','TAXP','WIF','WKEXREL','WORKSTAT')
  target = features.target('HINCP').continuous()
  key = features.key('SERIALNO')
  inputs = [
      features.numeric('CONP'),
      features.numeric('WATP'),
      features.numeric('FS'),
      features.numeric('SMX'),
      features.numeric('PSF'),
      features.numeric('STOV'),
      features.numeric('MULTG'),
      features.numeric('WKEXREL'),
      features.numeric('BATH'),
      features.numeric('INSP'),
      features.numeric('ACR'),
      features.numeric('NPF'),
      features.numeric('YBL'),
      features.numeric('HFL'),
      features.numeric('TAXP'),
      features.numeric('GASP'),
      features.numeric('GRNTP'),
      features.numeric('MODEM'),
      features.numeric('AGS'),
      features.numeric('FIBEROP'),
      features.numeric('RESMODE'),
      features.numeric('SATELLITE'),
      features.numeric('DIALUP'),
      features.numeric('TEL'),
      features.numeric('TEN'),
      features.numeric('R18'),
      features.numeric('BUS'),
      features.numeric('HUPAC'),
      features.numeric('SMOCP'),
      features.numeric('HANDHELD'),
      features.numeric('HUPARC'),
      features.numeric('ELEP'),
      features.numeric('RMSP'),
      features.numeric('R60'),
      features.numeric('VEH'),
      features.numeric('NP'),
      features.numeric('NR'),
      features.numeric('SRNT'),
      features.numeric('RNTM'),
      features.numeric('OTHSVCEX'),
      features.numeric('RNTP'),
      features.numeric('MRGI'),
      features.numeric('WIF'),
      features.numeric('LAPTOP'),
      features.numeric('REFR'),
      features.numeric('TOIL'),
      features.numeric('DSL'),
      features.numeric('FPARC'),
      features.numeric('MRGX'),
      features.numeric('FES'),
      features.numeric('HHT'),
      features.numeric('MRGT'),
      features.numeric('BLD'),
      features.numeric('SMP'),
      features.numeric('MRGP'),
      features.numeric('WORKSTAT'),
      features.numeric('MHP'),
      features.numeric('FULP'),
      features.numeric('HUGCL'),
      features.numeric('SSMC'),
      features.numeric('PUMA'),
      features.numeric('LNGI'),
      features.numeric('VALP'),
      features.numeric('NRC'),
      features.numeric('BDSP'),
      features.numeric('HUPAOC'),
      features.numeric('KIT'),
      features.numeric('ACCESS'),
      features.numeric('R65'),
      features.numeric('NOC'),
      features.numeric('MV'),
      features.numeric('COMPOTHX'),
      features.numeric('SVAL'),
      features.numeric('RWAT'),
      features.numeric('BROADBND'),
      features.numeric('PARTNER'),
      features.numeric('PLM'),
      features.numeric('HHL'),
      features.numeric('NPP'),
      features.numeric('SINK'),
  ]


Run %ml preprocess, and it generates the input cell for you to fill out.

In [None]:
%ml preprocess

Fill in the cell input.

In [None]:
%ml preprocess
train_data_path: /content/datalab/tmp/ml/census/data_train.csv
eval_data_path: /content/datalab/tmp/ml/census/data_eval.csv
data_format: CSV
output_dir: /content/datalab/tmp/ml/census/preprocessed
feature_set_class_name: CensusFeatures

### Local Preprocessing

Run it and below is what you get. You can run the pipeline directly (it is a local pipeline), or extend it with more DataFlow transforms.

```

# header
"""
Following code is generated from command line:
%ml preprocess
train_data_path: /content/datalab/tmp/ml/census/data_train.csv
eval_data_path: /content/datalab/tmp/ml/census/data_eval.csv
data_format: CSV
output_dir: /content/datalab/tmp/ml/census/preprocessed
feature_set_class_name: CensusFeatures

Please modify as appropriate!!!
"""

# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.io as io
import os

# defines
feature_set = CensusFeatures()
OUTPUT_DIR = '/content/datalab/tmp/ml/census/preprocessed'
pipeline = beam.Pipeline('DirectPipelineRunner')


# preprocessing
training_data = beam.io.TextFileSource(
    '/content/datalab/tmp/ml/census/data_train.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)

eval_data = beam.io.TextFileSource(
    '/content/datalab/tmp/ml/census/data_eval.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline | beam.Read('ReadEvalData', eval_data)

(metadata, train_features, eval_features) = ((train, eval) |
    ml.Preprocess('Preprocess', feature_set, input_format='csv',
                  format_metadata={'headers': feature_set.csv_columns}))

(metadata      | 'SaveMetadata'
    >> io.SaveMetadata(os.path.join(OUTPUT_DIR, 'metadata.yaml')))

(train_features | 'SaveTrain'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_train'), shard_name_template=''))

(eval_features     | 'SaveEval'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_eval'), shard_name_template=''))

# run pipeline
pipeline.run()

```

In [4]:
!ls /content/datalab/tmp/ml/census/preprocessed

features_eval  features_train  info  metadata.yaml


### Cloud Preprocessing
You can also generate Cloud DataFlow pipeline. Just add "--cloud" to "%ml preprocess". <br>
Note that if you need to get it running in cloud, you need: <br>
1. Sign In using the up right sign-in button, if you have not done so. <br>
2. Set a default project by running '%projects set Your-Project-Id'.
3. Your data need to be in Cloud Storage.

Define variables that will be used later.

In [3]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
train_data_path = os.path.join(bucket, 'census', 'data_train.csv')
eval_data_path = os.path.join(bucket, 'census', 'data_eval.csv')
output_dir = os.path.join(bucket, 'census', 'preprocessed')

In [7]:
%%storage create --bucket $bucket

In [6]:
!gsutil cp gs://cloud-datalab/sampledata/ml/census/data_train.csv $train_data_path
!gsutil cp gs://cloud-datalab/sampledata/ml/census/data_eval.csv $eval_data_path

Copying gs://cloud-datalab/sampledata/ml/census/data_train.csv [Content-Type=text/csv]...
Copying     ...-test-automated-sampledata/census/data_train.csv: 4.24 MiB/4.24 MiB    
Copying gs://cloud-datalab/sampledata/ml/census/data_eval.csv [Content-Type=text/csv]...
Copying     ...l-test-automated-sampledata/census/data_eval.csv: 482.12 KiB/482.12 KiB    


Below is the input before it generates the cloud pipeline code.

In [None]:
%ml preprocess --cloud
train_data_path: $train_data_path
eval_data_path: $eval_data_path 
data_format: CSV
output_dir: $output_dir
feature_set_class_name: CensusFeatures

It will generate code like the following:

```

# header
"""
Following code is generated from command line:
%ml preprocess --cloud
train_data_path: $train_data_path
eval_data_path: $eval_data_path 
data_format: CSV
output_dir: $output_dir
feature_set_class_name: CensusFeatures

Please modify as appropriate!!!
"""

# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.io as io
import os

# defines
feature_set = CensusFeatures()
OUTPUT_DIR = 'gs://cloud-ml-test-automated-sampledata/census/preprocessed'
import datetime
options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': 'preprocess-censusfeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
    'project': 'cloud-ml-test-automated',
    'extra_packages': ['gs://cloud-ml/sdk/cloudml-0.1.4.tar.gz'],
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True
}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline('DataflowPipelineRunner', options=opts)


# preprocessing
training_data = beam.io.TextFileSource(
    'gs://cloud-ml-test-automated-sampledata/census/data_train.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)

eval_data = beam.io.TextFileSource(
    'gs://cloud-ml-test-automated-sampledata/census/data_eval.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline | beam.Read('ReadEvalData', eval_data)

(metadata, train_features, eval_features) = ((train, eval) |
    ml.Preprocess('Preprocess', feature_set, input_format='csv',
                  format_metadata={'headers': feature_set.csv_columns}))

(metadata      | 'SaveMetadata'
    >> io.SaveMetadata(os.path.join(OUTPUT_DIR, 'metadata.yaml')))

(train_features | 'SaveTrain'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_train'), shard_name_template=''))

(eval_features     | 'SaveEval'
    >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_eval'), shard_name_template=''))

# run pipeline
pipeline.run()
```

After you run the above generated code, you can go to Developer Console to see the DataFlow job: https://pantheon.corp.google.com/dataflow (and select the right project). Also, run the following to make sure the files were generated.

In [10]:
!gsutil list $output_dir

gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_eval
gs://cloud-ml-test-automated-sampledata/census/preprocessed/features_train
gs://cloud-ml-test-automated-sampledata/census/preprocessed/info
gs://cloud-ml-test-automated-sampledata/census/preprocessed/metadata.yaml
gs://cloud-ml-test-automated-sampledata/census/preprocessed/tmp/
