<a href="https://colab.research.google.com/github/rkrissada/100DayOfMLCode/blob/master/day_059_batch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Refactoring to add batching and feature-creation

we continue reading the same small dataset, but refactor our ML pipeline in two small, but significant, ways:

1.  Refactor the input to read data in batches.
2. Refactor the feature creation so that it is not one-to-one with inputs. The Pandas function in the previous notebook also batched, only after it had read the whole data into memory -- on a large dataset, this won't be an option.


In [0]:
import tensorflow as tf
import numpy as np
import shutil

### 1. Refactor the input

Read data created, but this time make it more general and performant. Instead of using Pandas, we will use TensorFlow's Dataset API.

In [0]:
csv_columns = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
label_column = 'fare_amount'
defaults = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(value_column):
      columns = tf.decode_csv(value_column, record_defaults = defaults)
      features = dict(zip(csv_columns, columns))
      label = features.pop(label_column)
      return features, label
    
    # Create list of files that match pattern
    file_list = tf.gfile.Glob(filename)
    
    # Create dataset from file list
    dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
    if mode == tf.estimator.ModeKeys.TRAIN:
      num_epochs = None # indefinitely
      dataset = dataset.shuffle(buffer_size = 10 * batch_size)
    else:
      num_epochs = 1 # end-of-input after this
    
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()
  return _input_fn


def get_train():
  return read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN)

def get_valid():
  return read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL)

def get_test():
  return read_dataset('./taxi-test.csv', mode = tf.estimator.ModeKeys.EVAL)

### 2. Refactoe the way features are created.

For now, pass these through (same as previous). However, refactoring this way will enable us to break the one-to-one relationship between inputs and features.

In [0]:
input_columns = [
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),
]

def add_more_features(feats):
  # Nothing to add
  return feats

feature_cols = add_more_features(input_columns)

## Create and train the model

Note that we train for num_steps * batch_size example.

In [9]:
tf.logging.set_verbosity(tf.logging.INFO)
outdir = 'taxi_trained'
shutil.rmtree(outdir, ignore_errors = True)
model = tf.estimator.LinearRegressor(
  feature_columns = feature_cols, model_dir = outdir)

model.train(input_fn = get_train(), steps = 100)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'taxi_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc736baacf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create C

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressor at 0x7fc736baaf60>