<a name="about"></a>
About this notebook
======

This notebook assumes you have ran the local Census Regression notebook and you have not deleted the LOCAL_ROOT folder. In this notebook, we will use batch prediction on a pre-trained Tensorflow model using Google Cloud Machine Learning Engine services. This notebook will does not assume that the notebook "4. Census Regression Cloud Prediction" was executed.

<a name="setup"></a>
Setting things up
=====

In [1]:
import datalab_structured_data as sd

Lets look at the versions of structured_data and TF we have. Make sure TF is 1.0.0, and SD is 0.0.1.

In [2]:
import os

import tensorflow as tf
from tensorflow.python.lib.io import file_io

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))

tf 1.0.0
sd 0.0.1


This notebook will write files during prediction. Please give a root folder you wish to use.

In [3]:
LOCAL_ROOT = './census_regression_workspace' # This should be the same as what was used in the local census notebook
CLOUD_ROOT = 'gs://' + datalab_project_id() + '-census-regression-datalab'

if not file_io.file_exists(LOCAL_ROOT):
  raise ValueError('LOCAL_ROOT not found. Did you run the local notebook?')
!gsutil mb {CLOUD_ROOT}

Creating gs://cloud-ml-dev-census-regression-datalab/...
ServiceException: 409 Bucket cloud-ml-dev-census-regression-datalab already exists.


First, let us put the csv files on GCS and the output of training.

In [4]:
!gsutil -m cp {os.path.join(LOCAL_ROOT, '*_data.csv')} {CLOUD_ROOT}
!gsutil -m cp -r {os.path.join(LOCAL_ROOT, 'training')} {CLOUD_ROOT}

Copying file://./census_regression_workspace/train_data.csv [Content-Type=text/csv]...
Copying file://./census_regression_workspace/predict_data.csv [Content-Type=text/csv]...
Copying file://./census_regression_workspace/eval_data.csv [Content-Type=text/csv]...
/ [3/3 files][200.1 KiB/200.1 KiB] 100% Done                                    
Operation completed over 3 objects/200.1 KiB.                                    
Copying file://./census_regression_workspace/training/model/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...
Copying file://./census_regression_workspace/training/model/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://./census_regression_workspace/training/evaluation_model/variables/variables.index [Content-Type=application/octet-stream]...
Copying file://./census_regression_workspace/training/model/assets.extra/schema.json [Content-Type=application/json]...
Copying file://./census_regression_workspace/training/

In [5]:
!gsutil ls {CLOUD_ROOT}/training

gs://cloud-ml-dev-census-regression-datalab/training/evaluation_model/
gs://cloud-ml-dev-census-regression-datalab/training/model/
gs://cloud-ml-dev-census-regression-datalab/training/staging/
gs://cloud-ml-dev-census-regression-datalab/training/train/


<a name="local_preprocessing"></a>
ML Engine Batch Prediction
=====

Batch prediction has two modes. In the 'evaluation' mode, the input data is expected to 100% match the training schema, meaning the target column should exist in the data. In 'prediction' mode, the input data files must match the training schema except that the target column is missing. Note that batch prediction can be slow on small datasets because it takes a while for a Dataflow job to start.

In [6]:
!gsutil -m rm -r {CLOUD_ROOT}/batch_prediction

CommandException: 1 files/objects could not be removed.


In [None]:
sd.cloud_batch_predict(
  training_ouput_dir=os.path.join(CLOUD_ROOT, 'training'),
  prediction_input_file=os.path.join(CLOUD_ROOT, 'eval_data.csv'),
  output_dir=str(os.path.join(CLOUD_ROOT, 'batch_prediction')),
  mode='evaluation',
  output_format='json'
)


Building package and uploading to gs://cloud-ml-dev-census-regression-datalab/batch_prediction/staging/sd.tar.gz
Starting cloud batch prediction.
gs://cloud-ml-dev-census-regression-datalab/eval_data.csv
<type 'unicode'>
Dataflow Job submitted, see Job structured-data-batch-prediction-20170223194250 at https://console.developers.google.com/dataflow?project=cloud-ml-dev



Using fallback coder for typehint: Any.



When prediction is done, {CLOUD_ROOT}/batch_prediction should contain the prediction files and an errors file (that should be empty)

In [8]:
!gsutil ls  {CLOUD_ROOT}/batch_prediction

gs://cloud-ml-dev-census-regression-datalab/batch_prediction/staging/


Cleaning things up
=====

If you want to delete the files you made on GCS, uncomment and run the next cell.

In [None]:
#!gsutil rm -fr {CLOUD_ROOT}