In this tutorial, you are going to apply Tensorflow on csv structured data.
* Load a csv file using Pandas.
* Build an input pipeline to batch and shuffle the rows using `tf.data` APIs.
* Map from columns in the csv file to features used to train the model via `feature_column`.
* Build, train, and evaluate a model using `TF2.Keras`.

In [0]:
!pip install -q tf-nightly tfds-nightly sklearn tensorflow-hub

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow_hub as hub

print("Tensorflow Version: {}".format(tf.__version__))
print("GPU {} available.".format("is" if tf.config.experimental.list_physical_devices("GPU") else "not"))

Tensorflow Version: 2.2.0-dev20200313
GPU is available.


# The Dataset

We will use a [dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) provided by the Cleveland Clinic Foundation for Heart Disease. We are going to predict whether a patient has heart disease using other features. Each row represents a patient.

|Column|Description|Feature Type|Data Type|
|--|--|--|--|
|Age|Age in years|Numerical|integer|
|Sex|(1 = male; 0 = female)|Categorical|integer|
|CP|Chest pain type (0, 1, 2, 3, 4)|Categorical|integer|
|Trestbpd|Resting blood pressure (in mm Hg on admission to the hospital)|Numerical|integer|
|Chol|Serum cholestoral in mg/dl|Numerical|integer|
|FBS|(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)|Categorical|integer|
|RestECG|Resting electrocardiographic results (0, 1, 2)|Categorical|integer|
|Thalach|Maximum heart rate achieved|Numerical|integer|
|Exang|Exercise induced angina (1 = yes; 0 = no)|Categorical|integer|
|Oldpeak|ST depression induced by exercise relative to rest|Numerical|float|
|Slope|The slope of the peak exercise ST segment|Numerical|integer|
|CA|Number of major vessels (0-3) colored by flourosopy|Numerical|integer|
|Thal|3 = normal; 6 = fixed defect; 7 = reversable defect|Categorical|string|
|Target|Diagnosis of heart disease (1 = true; 0 = false)|Classification|integer|

Refer to Tensorflow.org (2020).

## Use Pandas to Create a Dataframe

In [2]:
URL = "https://storage.googleapis.com/applied-dl/heart.csv"
dataframe = pd.read_csv(URL)
dataframe.head(n=10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0
5,56,1,2,120,236,0,0,178,0,0.8,1,0,normal,0
6,62,0,4,140,268,0,2,160,0,3.6,3,2,normal,1
7,57,0,4,120,354,0,0,163,1,0.6,1,0,normal,0
8,63,1,4,130,254,0,2,147,0,1.4,2,1,reversible,1
9,53,1,4,140,203,1,2,155,1,3.1,3,0,reversible,0


## Split into Sub-datasets

In [3]:
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print("Train {}, Val {}, Test {}".format(len(train), len(val), len(test)))

Train 193, Val 49, Test 61


In [4]:
print("Keys: {}".format(list(dict(val).keys())))

Keys: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']


## Create the Input Pipeline using `tf.data`

Next, we will load this dataframe with `tf.data`. This will enable us to use feature columns as a bridge to map from columns in Pandas dataframe to Tensorflow features. This dataframe is small and works for accessing directly from the memory using the `tf.data.Dataset.from_tensor_slices` API.

In [0]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()  # not to change the origin dataframe
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

In [0]:
batch_size = 5

train_ds = df_to_dataset(dataframe=train, shuffle=True, batch_size=batch_size)
val_ds = df_to_dataset(dataframe=val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(dataframe=test, shuffle=False, batch_size=batch_size)

## Use the Pipeline

In [7]:
for feature_batch, label_batch in val_ds.take(1):
  print("Each feature:", feature_batch.keys())
  print("A batch of age:", feature_batch['age'])
  print("A batch of label:", label_batch)

Each feature: dict_keys(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'])
A batch of age: tf.Tensor([58 41 64 41 65], shape=(5,), dtype=int64)
A batch of label: tf.Tensor([1 0 0 0 0], shape=(5,), dtype=int64)


# Types of Feature Columns

Tensorflow provides many types of feature columns. Here we demo several types of feature columns and show how to transfrom a column from the dataframe. **The output of a feature column becomes the input to the model.**

In [0]:
# we use this batch to demostrate several types of feature columns
example_batch = next(iter(train_ds))[0]

In [0]:
def demo(feature_column):
  # Before applying the batch data into the layer built on a feature column, 
  # you have to wrap it with a `DenseFeatures`.
  feature_layer = tf.keras.layers.DenseFeatures(feature_columns=feature_column)
  print(feature_layer(example_batch).numpy())

## Numeric Columns

A numeric column is the simplest type of column. It is used to represent real-valued features. When using this type of column, the model will receive the value from the dataframe unchanged.

In [10]:
age = tf.feature_column.numeric_column(key="age")

# apply the batch data to the feature column
demo(age)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[43.]
 [44.]
 [54.]
 [54.]
 [55.]]


## Bucketized Columns

Often, you don't want to feed a number directly into the model, instead, split its value into different categories based on numeric ranges. For example, lots of numbers are aggregated in several ranges. The number outside the above groups gets worse predictions. In such a requirement, you can use a bucketized column.

In [11]:
# note that `bucketized_column` takes a `numeric_column` as the input
age_buckets = tf.feature_column.bucketized_column(
  source_column=age, 
  boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
demo(age_buckets)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]


The output of the bucketized column is the one-hot encoding vector. The number `1` indicates the value located in the corresponding range.

## Categorical Columns

In this dataset, lots of columns are represented as a string, like `Sex`, `CP`, `tha1` etc. Such column values you can't feed into the model directly, you have to transform them into a representing vector, for example, a one-hot encoding vector which you would see from the bucketized column.

In Tensorflow, you can create an `indicator_column` from a vocabulary list.

In [12]:
thal = tf.feature_column.categorical_column_with_vocabulary_list(
  key="thal", vocabulary_list=["fixed", "normal", "reversible"])

thal_encoding = tf.feature_column.indicator_column(
  categorical_column=thal)

# represented as a one-hot encoding
demo(thal_encoding)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


## Embedding Columns

Suppose a column has multiple labels instead of a few possible strings or the number of the categories grow as the time passing by, the one-hot encoding from the indicator column is infeasible to train a model. You can use `embedding_column` instead. The `embedding_column` represents the categories in a dense and low-dimensional vector that can be trained over the data. **That means the similar categories would be closer or similar to each other.**

In [13]:
# use a vector with 8 dimensions to represent the categories
tha1_embedding = tf.feature_column.embedding_column(
  categorical_column=thal, dimension=8)

demo(tha1_embedding)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[-0.02072956 -0.6034192  -0.28819227  0.23554352  0.31263772  0.11758582
  -0.04030164 -0.16724992]
 [ 0.37743947 -0.16297846 -0.43093836 -0.28096676  0.10568228 -0.00959399
  -0.5185137   0.24998617]
 [ 0.37743947 -0.16297846 -0.43093836 -0.28096676  0.10568228 -0.00959399
  -0.5185137   0.24998617]
 [-0.02072956 -0.6034192  -0.28819227  0.23554352  0.31263772  0.11758582
  -0.04030164 -0.16724992]
 [ 0.37743947 -0.16297846 -0.43093836 -0.28096676  0.10568228 -0.00959399
  -0.5185137   0.24998617]]


## Hashed Feature Column

Another way to digest the categorical column is by creating a hashed feature column that represents a categorical column with a large number of values. This column takes an input string to calculate a hashed value and use this value to select one from the `hash_bucket_size` bucket to encode the string.

When using this column, you don't require to provide the size of the vocabularies. In general, the size of the hash bucket size is smaller than the vocabulary size.

In [14]:
tha1_hashed = tf.feature_column.categorical_column_with_hash_bucket(
  key='thal', hash_bucket_size=1000)

demo(tf.feature_column.indicator_column(categorical_column=tha1_hashed))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Crossed Feature Column

Combining features into a single feature, known as `feature crosses`, enable the model to learn separate weights for each combination of features. Note that not all combination is built for training (sometimes these combinations would be very large). Instead, it is backed by a `hashed column` that you can decide how large the table is.

**Note that the features used to build the crossing features must be the categorical columns.**

In [15]:
crossed_feature = tf.feature_column.crossed_column(
  keys=[age_buckets, thal], hash_bucket_size=1000)

demo(tf.feature_column.indicator_column(crossed_feature))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [16]:
demo(tf.feature_column.embedding_column(crossed_feature, dimension=8))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[-0.32155547  0.676893   -0.34823462 -0.47435904 -0.0615655  -0.45151976
   0.61088735  0.08262596]
 [ 0.26656768  0.460454   -0.15344864  0.4372799  -0.1085922  -0.28644493
   0.57062525  0.06734061]
 [-0.63851583 -0.53523624 -0.15856667 -0.00788063  0.4307777  -0.35051027
   0.19064869  0.46556914]
 [-0.04328196 -0.21871363  0.11917761 -0.07049485 -0.01515157 -0.19899617
   0.1439738  -0.06276788]
 [-0.5075828   0.10969061 -0.25707173 -0.5866927   0.44999468 -0.0087176
  -0.23546302  0.11422282]]


The overall concept is:

![](https://docs.google.com/drawings/d/e/2PACX-1vTdzmwY9KfkInv_OsXIR7Jzcn58FyDDRxwLMcNocgI2Bp6RepzL_ijamFTLDE33T_Kv1f6lTSM9oP9E/pub?w=960&h=720)

# Choose Which Columns to Use

In [0]:
feature_columns = []

# numeric cols
for header in ["age", "trestbps", "chol", "thalach", "oldpeak", "slope", "ca"]:
  feature_columns.append(tf.feature_column.numeric_column(key=header))

# bucketized cols
_age = tf.feature_column.numeric_column(key="age")
age_buckets = tf.feature_column.bucketized_column(
  source_column=_age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# indicator cols
thal = tf.feature_column.categorical_column_with_vocabulary_list(
  key="thal", vocabulary_list=["fixed", "normal", "reversible"])
thal_onehot = tf.feature_column.indicator_column(categorical_column=thal)
feature_columns.append(thal_onehot)

# embedding cols
tha1_embedding = tf.feature_column.embedding_column(
  categorical_column=thal, dimension=8)
feature_columns.append(tha1_embedding)

# crossed cols
crossed_feature = tf.feature_column.crossed_column(
  keys=[thal, age_buckets], hash_bucket_size=1000)
crossed_feature = tf.feature_column.indicator_column(
  categorical_column=crossed_feature)
feature_columns.append(crossed_feature)

## Create a Feature Layer

In [0]:
# create a DenseFeatures layer to input them to the Keras model
feature_layer = tf.keras.layers.DenseFeatures(feature_columns, trainable=True)

In [0]:
batch_size = 32

# create new pipelines for training, validating, and testing
train_ds = df_to_dataset(dataframe=train, batch_size=batch_size)
val_ds = df_to_dataset(dataframe=val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(dataframe=test, shuffle=False, batch_size=batch_size)

# Building, Compiling, and Training the Model

In [0]:
model = tf.keras.Sequential(layers=[
  feature_layer,
  tf.keras.layers.Dense(units=128, activation='relu'),
  tf.keras.layers.Dense(units=128, activation='relu'),
  tf.keras.layers.Dense(units=1)
])

In [0]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

loss_function = tf.keras.losses.BinaryCrossentropy(from_logits=True)

loss_metrics = tf.keras.metrics.BinaryCrossentropy(from_logits=True)
accuracy_metrics = tf.keras.metrics.BinaryAccuracy()

val_loss_metrics = tf.keras.metrics.BinaryCrossentropy(from_logits=True)
val_acc_metrics = tf.keras.metrics.BinaryAccuracy()

In [0]:
@tf.function
def train_step(data, labels):
  with tf.GradientTape() as tape:
    predictions = model(data)
    losses = loss_function(y_true=labels, y_pred=predictions)
  
  # update loss and accuracy
  loss_metrics(labels, predictions)
  accuracy_metrics(labels, predictions)

  # update the model
  gradients = tape.gradient(losses, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

In [0]:
@tf.function
def val_step(data, labels):
  predictions = model(data)
  losses = loss_function(y_true=labels, y_pred=predictions)

  # update the loss and accuracy
  val_loss_metrics(labels, predictions)
  val_acc_metrics(labels, predictions)

In [24]:
EPOCHS = 10

for epoch in range(EPOCHS):
  loss_metrics.reset_states()
  accuracy_metrics.reset_states()
  val_loss_metrics.reset_states()
  val_acc_metrics.reset_states()

  for batch, (data, labels) in enumerate(train_ds):
    train_step(data, labels)

  for batch, (data, labels) in enumerate(val_ds):
    val_step(data, labels)

  print("Epoch {}, Loss {:.6f}, Accuracy {:.4f}, Val_Loss {:.6f}, Val_Acc {:.4f}".format(
    epoch + 1, 
    loss_metrics.result(), accuracy_metrics.result(), 
    val_loss_metrics.result(), val_acc_metrics.result()
  ))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Epoch 1, Loss 1.639729, Accuracy 0.6741, Val_Loss 1.992476, Val_Acc 0.7261
Epoch 2, Loss 1.253586, Accuracy 0.6920, Val_Loss 1.022261, Val_Acc 0.5634
Epoch 3, Loss 1.075468, Accuracy 0.5893, Val_Loss 0.511441, Val_Acc 0.8612
Epoch 4, Loss 1.351238, Accuracy 0.4955, Val_Loss 0.894110, Val_Acc 0.7261
Epoch 5, Loss 0.548300, Accuracy 0.7946, Val_Loss 1.032243, Val_Acc 0.3208
Epoch 6, Loss 0.709480, Accuracy 0.6875, Val_Loss 1.085726, Val_Acc 0.7261
Epoch 7, Loss 1.042960, Accuracy 0.6161, Val_Loss 0.431756, Val_Acc 0.7574
Epoch 8, Loss 0.896738, Accuracy 0.6339, Val_Loss 0.551521, Val_Acc 0.8336
Epoch 9, Loss 0.767495, Accuracy 0.6295, Val_Loss 0.955928, Val_Acc 0.7261
Epoch 10, Loss 0.842166,

# Evaluation

Before evaluating the model, you have to add the input layer to the model. This step is similar to the previous steps of building the pipeline.

While building the test dataset (`test_ds`), splitting columns into the features and a target. Here we only require to add a layer consisting of the feature columns, that is, without the target column. No worry about the target column it would also be sent into the evaluation function as `y`.

Such an operation also helps to export the model for the prediction.

In [25]:
inputs = {key: tf.keras.layers.Input(shape=(), name=key, dtype=dataframe[key].dtype) for key in dataframe.keys()}
del inputs['target']  # target is already included in the `test_ds`
inputs

{'age': <tf.Tensor 'age:0' shape=(None,) dtype=int64>,
 'ca': <tf.Tensor 'ca:0' shape=(None,) dtype=int64>,
 'chol': <tf.Tensor 'chol:0' shape=(None,) dtype=int64>,
 'cp': <tf.Tensor 'cp:0' shape=(None,) dtype=int64>,
 'exang': <tf.Tensor 'exang:0' shape=(None,) dtype=int64>,
 'fbs': <tf.Tensor 'fbs:0' shape=(None,) dtype=int64>,
 'oldpeak': <tf.Tensor 'oldpeak:0' shape=(None,) dtype=float64>,
 'restecg': <tf.Tensor 'restecg:0' shape=(None,) dtype=int64>,
 'sex': <tf.Tensor 'sex:0' shape=(None,) dtype=int64>,
 'slope': <tf.Tensor 'slope:0' shape=(None,) dtype=int64>,
 'thal': <tf.Tensor 'thal:0' shape=(None,) dtype=string>,
 'thalach': <tf.Tensor 'thalach:0' shape=(None,) dtype=int64>,
 'trestbps': <tf.Tensor 'trestbps:0' shape=(None,) dtype=int64>}

You can wrap the model into a bigger model using the `tensorflow_hub.KerasLayer` API to treat it as a layer. Instead, you can simply pass the inputs into the model to regard them as the placeholders.

In [0]:
def build_model(inputs):
  y = hub.KerasLayer(model, trainable=False)(inputs)
  return y

# you can wrap the model using a KerasLayer
outputs = build_model(inputs)

In [0]:
# you can simply pass the inputs to the model
outputs = model(inputs)

In [0]:
bind_model = tf.keras.Model(inputs=inputs, outputs=outputs)
bind_model.compile(loss=loss_function, 
                   metrics=[tf.keras.metrics.BinaryAccuracy()])

In [33]:
final_loss, final_acc = bind_model.evaluate(test_ds)
print("Test Loss: {:.6f}, Test Accuracy: {:.4f}".format(final_loss, final_acc))

Test Loss: 0.446674, Test Accuracy: 0.7869
