<h1> Machine Learning using tf.estimator </h1>

In this notebook, you will create a machine learning model using tf.estimator API and evaluate the model's performance. In this lab, the training dataset is small enough (roughly 7300 training examples), so it wil fit in-memory. This means that the data can be passed as a Pandas Dataframe to the machine learning model.

Note that to train and validate the machine learning model, this notebook tries to use the `taxi-train.csv` and `taxi-valid.csv` files from an earlier lab. You can check that you still have those files by going to **View > Table of Contents** in the menu bar and then choosing the **Files ** tab as shown on the following screenshot. Try clicking on the Refresh button in the Files tab if you don't see them.

Don't worry if you are having trouble finding those files. It is possible that they were automatically deleted if your Colab runtime was restarted after you worked on the earlier lab which created the files based on the SQL query against the data warehouse.  The next cell of the notebook will check if your runtime is missing the `taxi-*.csv` files and if so, will try to download them from github. Alternatively, you can go back to the earlier lab available [here](bit.ly/d1-create-datasets) and re-create the files.

![](https://i.imgur.com/MP0EKTk.png)


---
Before you start, **make sure that you are logged in with your student account**. Otherwise you may incur Google Cloud charges for using this notebook. 

---

In [0]:
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

from google.cloud import bigquery

#@markdown Remember to uncheck "Reset all runtimes before running"

#@markdown As you know, reseting the runtime will delete any files you may have on your notebook file system. 
#@markdown ![](https://i.imgur.com/9dgw0h0.png)

#@markdown Enter  your GCP Project ID:
PROJECT = "osipov-archive-backup-coldline" #@param {type: "string"}
#@markdown Next, use Shift-Enter to run this cell and complete authentication.

try:  
  from google.colab import auth
  auth.authenticate_user()  
  print("AUTHENTICATED")
except:
  print("FAILED to authenticate")
  
bq = bigquery.Client(project=PROJECT)
  
print tf.__version__

# Copy taxi-*.csv files from github if they are missing from the runtime.
!wget -nc https://github.com/osipov/training-data-analyst/raw/master/bootcamps/serverless_ml/taxi-11k-datasets.zip  
!unzip -n taxi-11k-datasets.zip  

Here you prepare Pandas dataframes to store the training and validation examples. There are roughly 7300 and 1600 data points in the training and validation example datasets respectively so they easily fit in memory.




In [0]:
# In CSV, label is the first column, after the features, followed by the key
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[0]

df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS)
df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS)

In [0]:
df_train.describe()

In [0]:
df_valid.describe()

<h2> Input function to read from Pandas Dataframe into tf.constant </h2>

In [0]:
def make_input_fn(df, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = df[LABEL],
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000,
    num_threads = 1
  )

### Create feature columns for estimator

In [0]:
def make_feature_cols():
  input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]
  return input_columns

<h3> Linear Regression with tf.estimator  </h3>

In [0]:
tf.logging.set_verbosity(tf.logging.INFO)

OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model = tf.estimator.LinearRegressor(
      feature_columns = make_feature_cols(), model_dir = OUTDIR)

model.train(input_fn = make_input_fn(df_train, num_epochs = 10))

Evaluate on the validation data. In general, you should not look at the test data until after you have released a model.

In [0]:
def print_rmse(model, name, df):
  metrics = model.evaluate(input_fn = make_input_fn(df, 1))
  print 'RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss']))
print_rmse(model, 'validation', df_valid)

This is nowhere near our benchmark metric (RMSE of $6 or so on this data), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction.

In [0]:
import itertools

# Read saved model and use it for prediction
model = tf.estimator.LinearRegressor(
      feature_columns = make_feature_cols(), model_dir = OUTDIR)
preds_iter = model.predict(input_fn = make_input_fn(df_valid, 1))
print [pred['predictions'][0] for pred in list(itertools.islice(preds_iter, 5))]

This explains why the RMSE was so high -- the model essentially predicts the same amount for every trip.  Would a more complex model help? Let's try using a deep neural network.  The tf.estimator API makes it quite straightforward.

<h3> Deep Neural Network regression </h3>

In [0]:
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

#notice the use of the DNNRegressor model
model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2],
      feature_columns = make_feature_cols(), model_dir = OUTDIR)

model.train(input_fn = make_input_fn(df_train, num_epochs = 100));
print_rmse(model, 'validation', df_valid)

So what was the point of spending all this time learning about deep neural networks if they can't even beat the naive benchmark model?  Well, you started using TensorFlow for Machine Learning, but you have not yet learned how to do it well.  That's what the rest of this session is about!

<h2> Benchmark dataset </h2>

Remember that if you had to choose between two models you should choose the one with the lower validation error. Next, you should measure the metric on the test data with the selected model. Let's start the process using the dataset in the data warehouse.

In [0]:
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
    SELECT
      (tolls_amount + fare_amount) AS fare_amount,
      
      CONCAT( STRING(pickup_datetime), 
              CAST(pickup_longitude AS STRING), 
              CAST(pickup_latitude AS STRING),
              CAST(dropoff_latitude AS STRING), 
              CAST(dropoff_longitude AS STRING)) AS key,

      EXTRACT(DAYOFWEEK FROM pickup_datetime)*1.0 AS dayofweek,
      EXTRACT(HOUR FROM pickup_datetime)*1.0 AS hourofday,
      pickup_longitude AS pickuplon,
      pickup_latitude AS pickuplat,
      dropoff_longitude AS dropofflon,
      dropoff_latitude AS dropofflat,
      passenger_count*1.0 AS passengers
    FROM
      `nyc-tlc.yellow.trips`
    WHERE
      {}
      AND trip_distance > 0
      AND fare_amount >= 2.5
      AND pickup_longitude > -78
      AND pickup_longitude < -70
      AND dropoff_longitude > -78
      AND dropoff_longitude < -70
      AND pickup_latitude > 37
      AND pickup_latitude < 45
      AND dropoff_latitude > 37
      AND dropoff_latitude < 45
      AND passenger_count > 0
  """
  if EVERY_N == None:
    if phase < 2:
      # training
      selector = "MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), 4) < 2"
    else:
      selector = "MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), 4) = 2"
  else:
      selector = "MOD(ABS(FARM_FINGERPRINT(STRING(pickup_datetime))), %d) = %d" % (EVERY_N, phase)
    
  query = base_query.format(selector)

  return query

sql = create_query(2, 100000)
df = bq.query(sql).to_dataframe()
df.describe()

In [0]:
print_rmse(model, 'benchmark', df)

RMSE on the validation dataset is <b>10.6</b> (your results will vary because of random seeds).

This is not only way more than our original benchmark of 6.00, but it doesn't even beat our distance-based rule's RMSE of 8.02.

<h3>Recap</h3>

In this notebook you learned how to write a simple TensorFlow model but you are yet to make your model performant. Remember that the current implementation assumes that the training and validation datasets fit in memory of the node where you are executing your code. Next, you will learn how to scale up your model implementation to support large datasets that can range to petabytes of data.

Copyright 2017 Counter Factual .AI LLC. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License