<h1> Feature Engineering </h1>

In this notebook, you will learn how to incorporate feature engineering into your pipeline.
<ul>
<li> Working with feature columns </li>
<li> Adding feature crosses in TensorFlow </li>
<li> Reading data from BigQuery </li>
<li> Creating datasets using Dataflow </li>
<li> Using a wide-and-deep model </li>
</ul>

In [1]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [2]:
!pip install --user apache-beam[gcp]==2.24.0 
!pip install --user httplib2==0.12.0 

Collecting apache-beam[gcp]==2.24.0
  Downloading apache_beam-2.24.0-cp37-cp37m-manylinux2010_x86_64.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting avro-python3!=1.9.2,<1.10.0,>=1.8.1
  Downloading avro-python3-1.9.2.1.tar.gz (37 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting oauth2client<4,>=2.0.1
  Downloading oauth2client-3.0.0.tar.gz (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.2/77.2 KB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting pyarrow<0.18.0,>=0.15.1
  Downloading pyarrow-0.17.1-cp37-cp37m-manylinux2014_x86_64.whl (63.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.8/63.8 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting fastavro<0.24,>=0.21.4
  Downloading fastavro-0.23.6-cp37-cp37m-manylinux2010_

NOTE: In the output of the above cell you may ignore any WARNINGS or ERRORS related to the following:  "apache-beam", "pyarrow", "tensorflow-transform", "tensorflow-model-analysis", "tensorflow-data-validation", "joblib", "google-cloud-storage" etc.

If you get any related errors mentioned above please rerun the above cell.

**Note**: Restart your kernel to use updated packages.

In [1]:
import tensorflow as tf
import apache_beam as beam
import shutil
print(tf.__version__)

2.6.3


<h2> 1. Environment variables for project and bucket </h2>

1. Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> cloud-training-demos 
2.  Cloud training often involves saving and restoring model files. Therefore, we should <b>create a single-region bucket</b>. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available) 
<b>Change the cell below</b> to reflect your Project ID and bucket name.


In [5]:
!gcloud compute regions list

NAME                     CPUS  DISKS_GB  ADDRESSES  RESERVED_ADDRESSES  STATUS  TURNDOWN_DATE
asia-east1               0/24  0/4096    0/8        0/8                 UP
asia-east2               0/24  0/4096    0/8        0/8                 UP
asia-northeast1          0/24  0/4096    0/8        0/8                 UP
asia-northeast2          0/24  0/4096    0/8        0/8                 UP
asia-northeast3          0/24  0/4096    0/8        0/8                 UP
asia-south1              0/24  0/4096    0/8        0/8                 UP
asia-south2              0/24  0/4096    0/8        0/8                 UP
asia-southeast1          0/24  0/4096    0/8        0/8                 UP
asia-southeast2          0/24  0/4096    0/8        0/8                 UP
australia-southeast1     4/24  200/4096  1/8        0/8                 UP
australia-southeast2     0/24  0/4096    0/8        0/8                 UP
europe-central2          0/24  0/4096    0/8        0/8                 UP
europe

In [6]:
import os
PROJECT = 'qwiklabs-gcp-01-bf4a5bbdb309'    # CHANGE THIS
BUCKET = 'qwiklabs-gcp-01-bf4a5bbdb309' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.
REGION = 'asia-southeast1' # Choose an available region for Cloud AI Platform

In [7]:
# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '2.6' 

## ensure we're using python3 env
os.environ['CLOUDSDK_PYTHON'] = 'python3.7'

In [8]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## ensure we predict locally with our current Python environment
gcloud config set ml_engine/local_python `which python`

Updated property [core/project].
Updated property [compute/region].
Updated property [ml_engine/local_python].


<h2> 2. Specifying query to pull the data </h2>

Let's pull out a few extra columns from the timestamp.

In [9]:
def create_query(phase, EVERY_N):
  if EVERY_N == None:
    EVERY_N = 4 #use full dataset
    
  #select and pre-process fields
  base_query = """
#legacySQL
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  DAYOFWEEK(pickup_datetime) AS dayofweek,
  HOUR(pickup_datetime) AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """
  
  # add subsampling criteria by modding with hashkey
  if phase == 'train': 
    query = "{} AND ABS(HASH(pickup_datetime)) % {} < 2".format(base_query,EVERY_N)
  elif phase == 'valid': 
    query = "{} AND ABS(HASH(pickup_datetime)) % {} == 2".format(base_query,EVERY_N)
  elif phase == 'test':
    query = "{} AND ABS(HASH(pickup_datetime)) % {} == 3".format(base_query,EVERY_N)
  return query
    
print(create_query('valid', 100)) #example query using 1% of data


#legacySQL
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  DAYOFWEEK(pickup_datetime) AS dayofweek,
  HOUR(pickup_datetime) AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
   AND ABS(HASH(pickup_datetime)) % 100 == 2


Try the query above in https://bigquery.cloud.google.com/table/nyc-tlc:yellow.trips if you want to see what it does (ADD LIMIT 10 to the query!)

<h2> 3. Preprocessing Dataflow job from BigQuery </h2>

This code reads from BigQuery and saves the data as-is on Google Cloud Storage.  We can do additional preprocessing and cleanup inside Dataflow, but then we'll have to remember to repeat that preprocessing during inference. It is better to use tf.transform which will do this book-keeping for you, or to do preprocessing within your TensorFlow model. We will look at this in future notebooks. For now, we are simply moving data from BigQuery to CSV using Dataflow.

While we could read from BQ directly from TensorFlow (See: https://www.tensorflow.org/api_docs/python/tf/contrib/cloud/BigQueryReader), it is quite convenient to export to CSV and do the training off CSV.  Let's use Dataflow to do this at scale.

Because we are running this on the Cloud, you should go to the GCP Console (https://console.cloud.google.com/dataflow) to look at the status of the job. It will take several minutes for the preprocessing job to launch.

In [10]:
%%bash
if gsutil ls | grep -q gs://${BUCKET}/taxifare/ch4/taxi_preproc/; then
  gsutil -m rm -rf gs://$BUCKET/taxifare/ch4/taxi_preproc/
fi

First, let's define a function for preprocessing the data

In [11]:
import datetime

####
# Arguments:
#   -rowdict: Dictionary. The beam bigquery reader returns a PCollection in
#     which each row is represented as a python dictionary
# Returns:
#   -rowstring: a comma separated string representation of the record with dayofweek
#     converted from int to string (e.g. 3 --> Tue)
####
def to_csv(rowdict):
    days = ['null', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
    CSV_COLUMNS = 'fare_amount,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers,key'.split(',')
    rowdict['dayofweek'] = days[rowdict['dayofweek']]
    rowstring = ','.join([str(rowdict[k]) for k in CSV_COLUMNS])
    return rowstring


####
# Arguments:
#   -EVERY_N: Integer. Sample one out of every N rows from the full dataset.
#     Larger values will yield smaller sample
#   -RUNNER: 'DirectRunner' or 'DataflowRunner'. Specify to run the pipeline
#     locally or on Google Cloud respectively. 
# Side-effects:
#   -Creates and executes dataflow pipeline. 
#     See https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline
####
def preprocess(EVERY_N, RUNNER):
    job_name = 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')
    print('Launching Dataflow job {} ... hang on'.format(job_name))
    OUTPUT_DIR = 'gs://{0}/taxifare/ch4/taxi_preproc/'.format(BUCKET)

    # dictionary of pipeline options
    options = {
        'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
        'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
        'job_name': 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
        'project': PROJECT,
        'runner': RUNNER,
        'num_workers' : 4,
        'max_num_workers' : 5
    }
    # instantiate PipelineOptions object using options dictionary
    opts = beam.pipeline.PipelineOptions(flags=[], **options)
  
    # instantantiate Pipeline object using PipelineOptions
    with beam.Pipeline(options=opts) as p:
        for phase in ['train', 'valid']:
            query = create_query(phase, EVERY_N) 
            outfile = os.path.join(OUTPUT_DIR, '{}.csv'.format(phase))
            (
              p | 'read_{}'.format(phase) >> beam.io.Read(beam.io.BigQuerySource(query=query))
                | 'tocsv_{}'.format(phase) >> beam.Map(to_csv)
                | 'write_{}'.format(phase) >> beam.io.Write(beam.io.WriteToText(outfile))
            )
    print("Done")

Now, let's run pipeline locally. This takes upto <b>5 minutes</b>.  You will see a message "Done" when it is done.

In [12]:
preprocess(50*10000, 'DirectRunner') 



Launching Dataflow job preprocess-taxifeatures-220219-030817 ... hang on




Done


In [13]:
%%bash
gsutil ls gs://$BUCKET/taxifare/ch4/taxi_preproc/

gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00000-of-00005
gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00001-of-00005
gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00002-of-00005
gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00003-of-00005
gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00004-of-00005
gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/valid.csv-00000-of-00002
gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/valid.csv-00001-of-00002


## 4. Run Beam pipeline on Cloud Dataflow

Run pipeline on cloud on a larger sample size.

In [14]:
%%bash
if gsutil ls | grep -q gs://${BUCKET}/taxifare/ch4/taxi_preproc/; then
  gsutil -m rm -rf gs://$BUCKET/taxifare/ch4/taxi_preproc/
fi

The following step will take <b>10-15 minutes.</b> Monitor job progress on the [Cloud Console in the Dataflow](https://console.cloud.google.com/dataflow) section.
__Note__: If the error occurred regarding enabling of `Dataflow API` then disable and re-enable the `Dataflow API` and re-run the below cell.

In [16]:
preprocess(50*100, 'DataflowRunner') 

Launching Dataflow job preprocess-taxifeatures-220219-031258 ... hang on




Done


Once the job completes, observe the files created in Google Cloud Storage

In [17]:
%%bash
gsutil ls -l gs://$BUCKET/taxifare/ch4/taxi_preproc/

  49169783  2022-02-19T03:18:41Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00000-of-00001
    116172  2022-02-19T03:08:53Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00000-of-00005
    116800  2022-02-19T03:08:53Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00001-of-00005
    114694  2022-02-19T03:08:53Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00002-of-00005
    108807  2022-02-19T03:08:53Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00003-of-00005
    116320  2022-02-19T03:08:53Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/train.csv-00004-of-00005
  24666705  2022-02-19T03:18:21Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/valid.csv-00000-of-00001
    114309  2022-02-19T03:08:52Z  gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_preproc/valid.csv-00000-of-00002
    107712  2022-02-19T03:08:52Z  gs://qwiklabs-

In [18]:
%%bash
# print first 7 lines of first shard of train.csv
gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/train.csv-00000-of-*" | head -7

2.5,Sun,0,-73.970518,40.794138,-73.967905,40.792987,4.0,2009-11-15 00:11:00.000000-73.970540.794140.793-73.9679
2.5,Thu,0,-73.986587,40.734404,-73.986598,40.734413,1.0,2014-04-03 00:31:17.000000-73.986640.734440.7344-73.9866
2.5,Thu,0,-74.005759,40.708276,-73.987198,40.721558,1.0,2012-12-20 00:19:59.000000-74.005840.708340.7216-73.9872
2.5,Thu,0,-73.98294830322266,40.755123138427734,-73.98302459716797,40.75524139404297,1.0,2015-02-26 00:41:23.000000-73.982940.755140.7552-73.983
2.5,Fri,0,-73.99102,40.751169,-73.990161,40.750829,1.0,2011-09-09 00:35:35.000000-73.99140.751240.7508-73.9902
2.5,Fri,0,-73.99537,40.725034,-73.975603,40.681622,1.0,2012-06-01 00:21:44.000000-73.995440.72540.6816-73.9756
2.5,Wed,0,-73.93214,40.800442,-73.932032,40.800695,3.0,2009-10-14 00:32:00.000000-73.932140.800440.8007-73.932


## 5. Develop model with new inputs

Download the first shard of the preprocessed data to enable local development.

In [19]:
%%bash
[ -d sample ] && { rm -rf sample; }
mkdir sample
gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/train.csv-00000-of-*" > sample/train.csv
gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/valid.csv-00000-of-*" > sample/valid.csv

We have two new inputs in the INPUT_COLUMNS, three engineered features, and the estimator involves bucketization and feature crosses.

In [20]:
%%bash
grep -A 20 "INPUT_COLUMNS =" taxifare/trainer/model.py

INPUT_COLUMNS = [
    # Define features
    tf.feature_column.categorical_column_with_vocabulary_list('dayofweek', vocabulary_list = ['Sun', 'Mon', 'Tues', 'Wed', 'Thu', 'Fri', 'Sat']),
    tf.feature_column.categorical_column_with_identity('hourofday', num_buckets = 24),

    # Numeric columns
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),

    # Engineered features that are created in the input_fn
    tf.feature_column.numeric_column('latdiff'),
    tf.feature_column.numeric_column('londiff'),
    tf.feature_column.numeric_column('euclidean')
]

# Build the estimator
def build_estimator(model_dir, nbuckets, hidden_units):
    """


In [21]:
%%bash
grep -A 50 "build_estimator" taxifare/trainer/model.py

def build_estimator(model_dir, nbuckets, hidden_units):
    """
     Build an estimator starting from INPUT COLUMNS.
     These include feature transformations and synthetic features.
     The model is a wide-and-deep model.
  """

    # Input columns
    (dayofweek, hourofday, plat, plon, dlat, dlon, pcount, latdiff, londiff, euclidean) = INPUT_COLUMNS

    # Bucketize the lats & lons
    latbuckets = np.linspace(38.0, 42.0, nbuckets).tolist()
    lonbuckets = np.linspace(-76.0, -72.0, nbuckets).tolist()
    b_plat = tf.feature_column.bucketized_column(plat, latbuckets)
    b_dlat = tf.feature_column.bucketized_column(dlat, latbuckets)
    b_plon = tf.feature_column.bucketized_column(plon, lonbuckets)
    b_dlon = tf.feature_column.bucketized_column(dlon, lonbuckets)

    # Feature cross
    ploc = tf.feature_column.crossed_column([b_plat, b_plon], nbuckets * nbuckets)
    dloc = tf.feature_column.crossed_column([b_dlat, b_dlon], nbuckets * nbuckets)
    pd_pair = tf.feature_column.cr

In [22]:
%%bash
grep -A 15 "add_engineered(" taxifare/trainer/model.py

def add_engineered(features):
    # this is how you can do feature engineering in TensorFlow
    lat1 = features['pickuplat']
    lat2 = features['dropofflat']
    lon1 = features['pickuplon']
    lon2 = features['dropofflon']
    latdiff = (lat1 - lat2)
    londiff = (lon1 - lon2)

    # set features for distance with sign that indicates direction
    features['latdiff'] = latdiff
    features['londiff'] = londiff
    dist = tf.sqrt(latdiff * latdiff + londiff * londiff)
    features['euclidean'] = dist
    return features

--
    features = add_engineered(feature_placeholders.copy())
    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)

# Create input function to load data into datasets
def read_dataset(filename, mode, batch_size = 512):
    def _input_fn():
        def decode_csv(value_column):
            columns = tf.compat.v1.decode_csv(value_column, record_defaults = DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))
            label

Try out the new model on the local sample (this takes <b>5 minutes</b>) to make sure it works fine.

In [23]:
%%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
python -m trainer.task \
  --train_data_paths=${PWD}/sample/train.csv \
  --eval_data_paths=${PWD}/sample/valid.csv  \
  --output_dir=${PWD}/taxi_trained \
  --train_steps=10 \
  --job-dir=/tmp

INFO:tensorflow:Using config: {'_model_dir': '/home/jupyter/training-data-analyst/courses/machine_learning/feateng/taxi_trained/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 30, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 3, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': '/home/ju

In [24]:
%%bash
ls taxi_trained/export/exporter/

1645241025


You can use ```saved_model_cli``` to look at the exported signature. Note that the model doesn't need any of the engineered features as inputs. It will compute latdiff, londiff, euclidean from the provided inputs, thanks to the ```add_engineered``` call in the serving_input_fn.

In [25]:
%%bash
model_dir=$(ls ${PWD}/taxi_trained/export/exporter | tail -1)
saved_model_cli show --dir ${PWD}/taxi_trained/export/exporter/${model_dir} --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['dayofweek'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_5:0
    inputs['dropofflat'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_2:0
    inputs['dropofflon'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_3:0
    inputs['hourofday'] tensor_info:
        dtype: DT_INT32
        shape: (-1)
        name: Placeholder_6:0
    inputs['passengers'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_4:0
    inputs['pickuplat'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder:0
    inputs['pickuplon'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_1:0
  The given SavedModel SignatureDef contains the fo

In [26]:
%%writefile /tmp/test.json
{"dayofweek": "Sun", "hourofday": 17, "pickuplon": -73.885262, "pickuplat": 40.773008, "dropofflon": -73.987232, "dropofflat": 40.732403, "passengers": 2}

Writing /tmp/test.json


In [27]:
%%bash
model_dir=$(ls ${PWD}/taxi_trained/export/exporter)
gcloud ai-platform local predict \
  --model-dir=${PWD}/taxi_trained/export/exporter/${model_dir} \
  --json-instances=/tmp/test.json

PREDICTIONS
[0.16815726459026337]


If the signature defined in the model is not serving_default then you must specify it via --signature-name flag, otherwise the command may fail.
Instructions for updating:
non-resource variables are not supported in the long term
2022-02-19 03:24:18.245509: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.



## 6. Train on cloud

This will take <b> 10-15 minutes </b> even though the prompt immediately returns after the job is submitted. Monitor job progress on the [Cloud Console, in the AI Platform](https://console.cloud.google.com/mlengine) section and wait for the training job to complete.


In [28]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/ch4/taxi_trained
JOBNAME=lab4a_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=${PWD}/taxifare/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=BASIC \
  --runtime-version 2.3 \
  --python-version 3.5 \
  -- \
  --train_data_paths="gs://$BUCKET/taxifare/ch4/taxi_preproc/train*" \
  --eval_data_paths="gs://${BUCKET}/taxifare/ch4/taxi_preproc/valid*"  \
  --train_steps=5000 \
  --output_dir=$OUTDIR

gs://qwiklabs-gcp-01-bf4a5bbdb309/taxifare/ch4/taxi_trained asia-southeast1 lab4a_220219_032536
jobId: lab4a_220219_032536
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [lab4a_220219_032536] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe lab4a_220219_032536

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs lab4a_220219_032536


The RMSE is now 8.33249, an improvement over the 9.3 that we were getting ... of course, we won't know until we train/validate on a larger dataset. Still, this is promising. But before we do that, let's do hyper-parameter tuning.

<b>Use the Cloud Console link to monitor the job and wait till the job is done.</b>

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License