<a href="https://colab.research.google.com/github/rkrissada/100DayOfMLCode/blob/master/day_058_tensorflow_estimator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## tf.estimator

we will create a machine learning model using tf.estimator and evaluate its performance. The dataset is rather small (7700 samples), so we can do it all in-memory. We will also simply pass the raw data in as-is

In [1]:
!pip install datalab



In [2]:
import datalab.bigquery as bq
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

print(tf.__version__)

1.13.1



Read data created in the previous chapter.

In [0]:
csv_columns = ['fare_amount','pickuplon','pickuplat','dropofflon','dropofflat','passengers','key']
features = csv_columns[1:len(csv_columns)-1]
label = csv_columns[0]

df_train = pd.read_csv('./taxi-train.csv',header = None, names = csv_columns)
df_valid = pd.read_csv('./taxi-valid.csv',header = None, names = csv_columns)

### Input function to read from Pandas Dataframe into tf.constant

In [0]:
def make_input_fn(df, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = df[label],
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000,
    num_threads =1
  )

### Create feature columns for estimator

In [0]:
def make_feature_cols():
  input_columns = [tf.feature_column.numeric_column(k) for k in features]
  return input_columns


### Linear Regression with tf.Estimator framework

In [17]:
tf.logging.set_verbosity(tf.logging.INFO)

outdir = 'taxi_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time

model = tf.estimator.LinearRegressor(
      feature_columns = make_feature_cols(), model_dir = outdir)

model.train(input_fn = make_input_fn(df_train, num_epochs = 10))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'taxi_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f01c5ab5128>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create C

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressor at 0x7f01c5aa8ef0>

Evaluate on the validation data (we should defer using the test data to after we have selected a final model).

In [19]:
def print_rmse(model, name, df):
  metrics = model.evaluate(input_fn = make_input_fn(df,1))
  print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss'])))
print_rmse(model, 'validation', df_valid)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-04-10T05:17:39Z
INFO:tensorflow:Graph was finalized.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-608
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-04-10-05:17:39
INFO:tensorflow:Saving dict for global step 608: average_loss = 108.82821, global_step = 608, label/mean = 11.666427, loss = 12942.783, prediction/mean = 11.719884
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 608: taxi_trained/model.ckpt-608
RMSE on validation dataset = 10.432075500488281


This is nowhere near our benchmark (RMSE of $6 or so on this data), but it serves to demonstrate what TensorFlow code looks like.

Let's use this model for prediction.

In [20]:
import itertools
# Read saved model and use it for prediction
model = tf.estimator.LinearRegressor(
  feature_columns = make_feature_cols(), model_dir = outdir)

preds_iter = model.predict(input_fn=make_input_fn(df_valid, 1))
print([pred['predictions'][0] for pred in list(itertools.islice(preds_iter, 5))])

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'taxi_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f01c4fb5978>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameter

This explains why the RMSE was so high -- the model essentially predicts the same amount for every trip. Would a more complex model help? Let's try using a deep neural network. The code to do this is quite straightforward as well.

### Deep Neural Network regression

In [21]:
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time
model = tf.estimator.DNNRegressor(hidden_units=[32,8,2],
  feature_columns = make_feature_cols(), model_dir = outdir)

model.train(input_fn=make_input_fn(df_train, num_epochs = 100));
print_rmse(model,'validation', df_valid)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'taxi_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f01c719d518>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was fi

We are not beating our benchmark with either model ... what's up? Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well. That's what the rest of this course is about!


But, for the record, let's say we had to choose between the two models. We'd choose the one with the lower validation error. Finally, we'd measure the RMSE on the test data with this chosen model.

### Benchmark dataset

In [40]:
import datalab.bigquery as bq
import numpy as np
import pandas as pd

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="./bigquery_user.json"

def create_query(phase, every_n):
  """
  phase: 1 = train 2 = valid
  """
  base_query = """
#standardSQL
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  GENERATE_UUID() AS key,
  EXTRACT(DAYOFWEEK FROM pickup_datetime)*1.0 AS dayofweek,
  EXTRACT(HOUR FROM pickup_datetime)*1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers
FROM
  `bigquery-public-data.new_york.tlc_yellow_trips_2016`
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """
  
  if every_n == None:
    if phase < 2:
      # Training
      query = "{0}AND MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))),4) < 2".format(base_query)
    else:
      # Validation
      query = "{0}AND MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))),4) = {1}".format(base_query, phase)
  else:
    query = "{0}AND MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), {1} ) = {2}".format(base_query, every_n, phase)
    
  return query



query = create_query(2, 100000)
print(query)
df = bq.Query(query).to_dataframe()


#standardSQL
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  GENERATE_UUID() AS key,
  EXTRACT(DAYOFWEEK FROM pickup_datetime)*1.0 AS dayofweek,
  EXTRACT(HOUR FROM pickup_datetime)*1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers
FROM
  `bigquery-public-data.new_york.tlc_yellow_trips_2016`
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  AND MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), 100000 ) = 2


RequestException: ignored

In [0]:
print_rmse(model, 'benchmark', df)