# TensorFlow Estimators

```{admonition} Attribution
This notebook builds on Chapter 14: *Going Deeper – The Mechanics of TensorFlow* of {cite}`RaschkaMirjalili2019`.
```

In this notebook, we will work with TensorFlow Estimators. The `tf.estimator` API encapsulates the underlying steps in machine learning tasks, such as training, prediction (inference), and evaluation. Estimators are more encapsulated but also more scalable when compared to the previous approaches that we have covered above. Also, the `tf.estimator` API adds support for running models on multiple platforms without requiring major code changes, which makes them more suitable for the so-called "production phase" in industry applications. 

TensorFlow comes with a selection of off-the-shelf estimators for common machine learning and deep learning architectures that are useful for comparison studies, for example, to quickly assess whether a certain approach is applicable to a particular dataset or problem. Besides using pre-made Estimators, we can also create an Estimator by converting a Keras model to an Estimator.

In [18]:
import tensorflow as tf

print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

2.7.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Working with feature columns

In machine learning and deep learning applications, we can encounter various
different types of features: continuous, unordered categorical (nominal), and ordered categorical (ordinal). Note that while numeric data can be either continuous or discrete, in the context of the TensorFlow API, "numeric" data specifically refers to continuous data of the floating point type.

 Sometimes, feature sets are comprised of a mixture of different feature types. While TensorFlow Estimators were designed to handle all these different types of features, we must specify how each feature should be interpreted by the Estimator.

### Auto MPG dataset

```{figure} ../../img/feature_cols.png 
---
width: 60em
name: feature_cols
---
Assigning types to feature columns from the Auto MPG dataset.

```

To demonstrate the use of TF Estimators, we use the [Auto MPG dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg). We are going to treat five features from the Auto MPG dataset (*number of cylinders*, *displacement*, *horsepower*, *weight*, and *acceleration*) as numeric (i.e. continuous) features. The *model year* can be regarded as an ordered categorical feature. Lastly, the *manufacturing origin* can be regarded as an unordered categorical feature with three possible discrete values, 1, 2, and 3, which correspond to the US, Europe, and Japan, respectively. {numref}`feature_cols` above shows how we will treat these feature columns. 

In [19]:
import pandas as pd
import numpy as np

dataset_path = tf.keras.utils.get_file(
    "auto-mpg.data",
    "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
)

column_names = [
    "MPG", "Cylinders", "Displacement",
    "Horsepower", "Weight", "Acceleration",
    "ModelYear", "Origin"
]

# Load dataset; drop missing values
df = pd.read_csv(dataset_path, 
    names=column_names, 
    na_values="?", 
    comment="\t", 
    sep=" ", 
    skipinitialspace=True)

print("Shape:", df.shape)
print("No. of missing values:")
print(df.isna().sum())

# For simplicity drop rows with missing values.
df = df.dropna().reset_index(drop=True)
df.tail()

Shape: (398, 8)
No. of missing values:
MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
ModelYear       0
Origin          0
dtype: int64


Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,ModelYear,Origin
387,27.0,4,140.0,86.0,2790.0,15.6,82,1
388,44.0,4,97.0,52.0,2130.0,24.6,82,2
389,32.0,4,135.0,84.0,2295.0,11.6,82,1
390,28.0,4,120.0,79.0,2625.0,18.6,82,1
391,31.0,4,119.0,82.0,2720.0,19.4,82,1


Splitting the dataset and standardizing numerical columns:

In [20]:
import sklearn
import sklearn.model_selection

df_train, df_test = sklearn.model_selection.train_test_split(df, train_size=0.8)
train_stats = df_train.describe().transpose()
train_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MPG,313.0,23.603834,7.863563,9.0,17.5,23.0,29.0,46.6
Cylinders,313.0,5.412141,1.69637,3.0,4.0,4.0,8.0,8.0
Displacement,313.0,190.944089,103.351455,68.0,98.0,144.0,260.0,455.0
Horsepower,313.0,103.099042,37.381433,46.0,75.0,92.0,120.0,225.0
Weight,313.0,2959.71246,859.175108,1613.0,2205.0,2745.0,3574.0,4997.0
Acceleration,313.0,15.704792,2.704675,8.0,14.0,15.5,17.4,24.8
ModelYear,313.0,75.968051,3.660111,70.0,73.0,76.0,79.0,82.0
Origin,313.0,1.587859,0.812234,1.0,1.0,1.0,2.0,3.0


In [21]:
numeric_column_names = [
    'Cylinders', 
    'Displacement', 
    'Horsepower', 
    'Weight', 
    'Acceleration'
]

df_train_norm, df_test_norm = df_train.copy(), df_test.copy()
for col_name in numeric_column_names:
    train_mean = train_stats.loc[col_name, 'mean']
    train_std  = train_stats.loc[col_name, 'std']
    df_train_norm.loc[:, col_name] = (df_train_norm.loc[:, col_name] - train_mean) / train_std
    df_test_norm.loc[:, col_name] = (df_test_norm.loc[:, col_name] - train_mean) / train_std
    
df_train_norm.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,ModelYear,Origin
264,30.0,-0.832448,-0.899301,-0.938943,-0.936611,0.294012,78,1
71,15.0,1.525527,1.093898,1.254659,1.085096,-1.184909,72,1
202,32.0,-0.832448,-1.025086,-0.885441,-1.128655,0.478877,76,3
74,14.0,1.525527,1.229358,1.254659,1.300419,-0.630313,72,1
189,22.0,0.34654,0.329516,-0.082903,0.318081,-0.112691,76,1


### Numeric features

In the following code, we will use TensorFlow's `feature_column` function
to transform the 5 continuous features into the feature column data structure that
TensorFlow Estimators can work with:

In [29]:
numeric_features = []
for col_name in numeric_column_names:
    feature_column = tf.feature_column.numeric_column(key=col_name)
    numeric_features.append(feature_column)

print(numeric_features)

[NumericColumn(key='Cylinders', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='Displacement', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='Horsepower', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='Weight', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='Acceleration', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]


### Bucketized features

Next, let's group the rather fine-grained model year information into buckets to
simplify the learning task for the model that we are going to train later. Note that we assign `boundaries=[73, 76, 79]` which results in left-closed partitioning of the real line into 4 intervals `(-∞, 73)`, `[73, 76)`, `[76, 79)`, and `[79, +∞)`.

In [28]:
feature_year = tf.feature_column.numeric_column(key="ModelYear")
bucketized_column = tf.feature_column.bucketized_column(
    source_column=feature_year,
    boundaries=[73, 76, 79]
)

# For consistency, we create list of bucketized features
bucketized_features = [] 
bucketized_features.append(bucketized_column)
print(bucketized_features)

[BucketizedColumn(source_column=NumericColumn(key='ModelYear', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(73, 76, 79))]


### Categorical indicator features

Next, we will proceed with defining a list for the unordered categorical feature,
`Origin`. Here we use `categorical_column_with_vocabulary_list` in `tf.feature_column` and provide a list of all possible category names as input. 

```{tip}
If the list of possible categories is too large, we can use `categorical_column_with_vocabulary_list` and provide a file that contains all the categories/words so that we do not have to store a list of all possible words in memory.  If the features are already associated
with an index of categories in the range `[0, num_categories)`, then we can use the
`categorical_column_with_identity` function. However,
in this case, the feature `Origin` is given as integer values `1`, `2`, `3` (as opposed to `0`, `1`, `2`), which does not match the requirement for categorical indexing.
```

In [25]:
print(df.Origin.unique())

[1 3 2]


In [30]:
feature_origin = tf.feature_column.categorical_column_with_vocabulary_list(
    key='Origin',
    vocabulary_list=[1, 2, 3]
)

```{margin}
Refer to the [official TensorFlow docs](https://www.tensorflow.org/api_docs/python/tf/feature_column) for other implemented feature columns such as hashed columns and crossed columns.
```

Certain Estimators, such as `DNNClassifier` and `DNNRegressor`, only accept so-called
"dense columns." Therefore, the next step is to convert the existing categorical feature column to such a dense column. There are two ways to do this: using an embedding column via `embedding_column` or an indicator column via `indicator_column`. We use the latter which converts the categorical indices to one-hot encoded vectors to convert the categorical column to a dense format:

In [32]:
indicator_column = tf.feature_column.indicator_column(feature_origin)

# For consistency, we create list of nominal features
categorical_indicator_features = []
categorical_indicator_features.append(indicator_column)
print(categorical_indicator_features)

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Origin', vocabulary_list=(1, 2, 3), dtype=tf.int64, default_value=-1, num_oov_buckets=0))]


## Machine learning with pre-made estimators

### Input functions

We have to define an **input function** that 
processes the data and returns a TensorFlow dataset consisting of a tuple 
that contains the input features and the targets. Note that the features 
must be a dictionary format such that the keys match 
the names (or keys) of feature columns.

In [42]:
def train_input_fn(df_train, batch_size=8):
    df = df_train.copy()
    x_train, y_train = df, df.pop('MPG')
    dataset = tf.data.Dataset.from_tensor_slices((dict(x_train), y_train))

    # Shuffle, batch, and repeat the examples
    return dataset.shuffle(1000).batch(batch_size).repeat()

# Inspection
ds = train_input_fn(df_train_norm)
batch = next(iter(ds))
print('Keys:', batch[0].keys())
print('Batch Model Years:', batch[0]['ModelYear'])
print('Batch MPGs (targets):', batch[1].numpy())

Keys: dict_keys(['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'ModelYear', 'Origin'])
Batch Model Years: tf.Tensor([71 74 78 76 72 74 72 75], shape=(8,), dtype=int64)
Batch MPGs (targets): [19.  16.  17.7 26.  13.  14.  12.  16. ]


Input function for evaluation:

In [43]:
def eval_input_fn(df_eval, batch_size=8):
    df = df_eval.copy()
    x_eval, y_eval = df, df.pop('MPG')
    dataset = tf.data.Dataset.from_tensor_slices((dict(x_eval), y_eval))

    # Shuffle, batch, and repeat the examples
    return dataset.shuffle(1000).batch(batch_size).repeat()

# Inspection
ds = eval_input_fn(df_test_norm)
batch = next(iter(ds))
print('Keys:', batch[0].keys())
print('Batch Model Years:', batch[0]['ModelYear'])
print('Batch MPGs (targets):', batch[1].numpy())

Keys: dict_keys(['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'ModelYear', 'Origin'])
Batch Model Years: tf.Tensor([80 78 76 81 72 73 70 71], shape=(8,), dtype=int64)
Batch MPGs (targets): [41.5 21.6 29.  32.4 14.  12.  22.  23. ]


### Initializing the Estimator

Since predicting MPG values
is a typical regression problem, we will use `tf.estimator.DNNRegressor`. When
instantiating the regression Estimator, we will provide the list of feature columns
and specify the number of hidden units that we want to have in each hidden layer
using the argument `hidden_units`.

In [47]:
regressor = tf.estimator.DNNRegressor(
    feature_columns=(
        numeric_features + 
        bucketized_features + 
        categorical_indicator_features
    ),
    hidden_units=[32, 10],
    model_dir='models/autompg-dnnregressor/')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'models/autompg-dnnregressor/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


The other argument, `model_dir`, that we have provided specifies the directory
for saving model parameters. One of the advantages of Estimators is that they
automatically checkpoint the model during training, so that in case the training of
the model crashes for an unexpected reason, we can easily load
the last saved checkpoint and continue training from there. The checkpoints will also
be saved in the directory specified by `model_dir`.

### Training

The `.train()` method expects two arguments. The argument `input_fn` expects a callable that returns a batch of training examples. The `steps` which is the total number of SGD updates (or calls to the input function) is calculated as follows:

In [50]:
EPOCHS = 30
BATCH_SIZE = 8
total_steps = EPOCHS * int(np.ceil(len(df_train) / BATCH_SIZE))
print('Training Steps:', total_steps)

regressor.train(
    input_fn=lambda: train_input_fn(df_train_norm, batch_size=BATCH_SIZE),
    steps=total_steps
)

Training Steps: 1200
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


2022-02-07 11:42:32.791662: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-07 11:42:32.791686: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-02-07 11:42:32.800969: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:42:32.897032: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:42:32.905102: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:42:32.90

INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into models/autompg-dnnregressor/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 572.394, step = 0


2022-02-07 11:42:33.006831: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:42:33.032947: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:42:33.038938: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:42:33.147668: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


INFO:tensorflow:global_step/sec: 130.477
INFO:tensorflow:loss = 648.3938, step = 100 (0.767 sec)
INFO:tensorflow:global_step/sec: 128.294
INFO:tensorflow:loss = 560.362, step = 200 (0.779 sec)
INFO:tensorflow:global_step/sec: 124.645
INFO:tensorflow:loss = 762.4479, step = 300 (0.802 sec)
INFO:tensorflow:global_step/sec: 133.88
INFO:tensorflow:loss = 473.31888, step = 400 (0.747 sec)
INFO:tensorflow:global_step/sec: 138.36
INFO:tensorflow:loss = 530.813, step = 500 (0.722 sec)
INFO:tensorflow:global_step/sec: 146.483
INFO:tensorflow:loss = 530.0683, step = 600 (0.683 sec)
INFO:tensorflow:global_step/sec: 140.863
INFO:tensorflow:loss = 735.958, step = 700 (0.710 sec)
INFO:tensorflow:global_step/sec: 135.441
INFO:tensorflow:loss = 643.9301, step = 800 (0.738 sec)
INFO:tensorflow:global_step/sec: 136.189
INFO:tensorflow:loss = 615.75024, step = 900 (0.734 sec)
INFO:tensorflow:global_step/sec: 153.958
INFO:tensorflow:loss = 491.15604, step = 1000 (0.649 sec)
INFO:tensorflow:global_step/sec

<tensorflow_estimator.python.estimator.canned.dnn.DNNRegressorV2 at 0x17e9fb430>

```{note}

Recall that `model_dir` saves the checkpoints of the model during training. The last model can be loaded using the `warm_start_from` argument as follows:

```python
reloaded_regressor = tf.estimator.DNNRegressor(
    feature_columns=all_feature_columns,
    hidden_units=[32, 10],
    warm_start_from='models/autompg-dnnregressor/',
    model_dir='models/autompg-dnnregressor/'
)
```


### Evaluation

To evaluate performance, we use the `.evaluate` method:

In [54]:
eval_results = regressor.evaluate(
    input_fn=lambda: eval_input_fn(df_test_norm, batch_size=8)
)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2022-02-07T11:49:42
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/autompg-dnnregressor/model.ckpt-1200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


2022-02-07 11:49:42.384064: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-07 11:49:42.384087: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-02-07 11:49:42.399882: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:49:42.407763: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:49:42.414157: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-02-07 11:49:42.42

KeyboardInterrupt: 

In [None]:
print(eval_results)

In [17]:
pred_res = regressor.predict(input_fn=lambda: eval_input_fn(df_test_norm, batch_size=8))

print(next(iter(pred_res)))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/autompg-dnnregressor/model.ckpt-80000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'predictions': array([4.7905107], dtype=float32)}


In [18]:
tf.get_logger().setLevel('ERROR')

In [19]:
boosted_tree = tf.estimator.BoostedTreesRegressor(
    feature_columns=all_feature_columns,
    n_batches_per_layer=20,
    n_trees=200)

boosted_tree.train(
    input_fn=lambda:train_input_fn(df_train_norm, batch_size=BATCH_SIZE))

eval_results = boosted_tree.evaluate(
    input_fn=lambda:eval_input_fn(df_test_norm, batch_size=8))

print(eval_results)

print('Average-Loss {:.4f}'.format(eval_results['average_loss']))


{'average_loss': 7.3268905, 'label/mean': 22.655697, 'loss': 7.2563562, 'prediction/mean': 22.736225, 'global_step': 24000}
Average-Loss 7.3269
