# Structured data classification

This tutorial is mainly based on the Keras tutorial ["Structured data classification from scratch"](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/) by François Chollet and ["Classify structured data using Keras preprocessing layers"](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers) by TensorFlow.

## Setup

In [42]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.__version__

'2.7.1'

## Data


Here's the description of each feature:

Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | normal; fixed defect; reversible defect | Categorical (string)
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target

### Data import
Let's download the data and load it into a Pandas dataframe:

In [43]:
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
df = pd.read_csv(file_url)

In [44]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


In [None]:
df.info()

The following feature are continuous numerical features:

- `age`
- `trestbps`
- `chol`
- `thalach`
- `oldpeak`
- `slope`

The following features are *categorical features* encoded as integers:

- `sex`
- `cp`
- `fbs`
- `restecg`
- `exang`
- `ca`

The following feature is a *categorical features* encoded as string:

- `thal`

### Initial data preparation

- Define outcome variable as y_label


In [45]:
y_label = 'target'

- Correct data format

In [46]:
# Convert to string
df['thal'] = df['thal'].astype("string")

In [47]:
# Convert to categorical
cat_convert = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca']

for i in cat_convert:
    df[i] = df[i].astype("category")

- Make lists of feature variables (without our label)

In [48]:
# Make list of all numerical data (except label)
list_num = df.drop(columns=[y_label]).select_dtypes(include=[np.number]).columns.tolist()

# Make list of all categorical data (except label)
list_cat = df.drop(columns=[y_label]).select_dtypes(include=['category']).columns.tolist()

### Data splitting

Let's split the data into a training and validation set:

In [50]:
df_val = df.sample(frac=0.2, random_state=1337)
df_train = df.drop(df_val.index)

In [51]:
print(
    "Using %d samples for training and %d for validation"
    % (len(df_train), len(df_val))
)

Using 242 samples for training and 61 for validation


### Transform to Tensors

- Let's generate [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) objects for our training and validation dataframes
- The utility function converts each training and validation set into a tf.data.Dataset, then shuffles and batches the data.

In [65]:
# Define a function to create our tensors

def dataframe_to_dataset(dataframe, shuffle=True, batch_size=32):
    df = dataframe.copy()
    labels = df.pop(y_label)
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    df = ds.prefetch(batch_size)
    return ds

- Now, use the newly created function (`dataframe_to_dataset`) to check the format of the data the input pipeline helper function returns by calling it on the training data, and use a small batch size to keep the output readable:

In [55]:
batch_size = 5

train_ds = dataframe_to_dataset(df_train, batch_size=batch_size)

- Let's take a look at the data:

In [56]:
[(train_features, label_batch)] = train_ds.take(1)

print('Every feature:', list(train_features.keys()))
print('A batch of ages:', train_features['age'])
print('A batch of targets:', label_batch )

Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
A batch of ages: tf.Tensor([45 41 39 65 69], shape=(5,), dtype=int64)
A batch of targets: tf.Tensor([0 0 0 1 1], shape=(5,), dtype=int64)


- As the output demonstrates, the training set returns a dictionary of column names (from the DataFrame) that map to column values from rows.

- Let's batch the datasets (combine some of our samples). Here, we use a mini-batch size of 32:

## Feature preprocessing

- Next, we define utility functions to do the feature preprocessing operations.

In this tutorial, you will use the following preprocessing layers to demonstrate how to perform preprocessing, structured data encoding, and feature engineering:

- tf.keras.layers.Normalization: Performs feature-wise normalization of input features.

- tf.keras.layers.CategoryEncoding: Turns integer categorical features into one-hot, multi-hot, or tf-idf dense representations.

- tf.keras.layers.StringLookup: Turns string categorical values into integer indices.

- tf.keras.layers.IntegerLookup: Turns integer categorical values into integer indices.


### Numerical preprocessing function

- Define a new utility function that returns a layer which applies feature-wise normalization to numerical features using that Keras preprocessing layer:

In [58]:
# Define numerical preprocessing function
def get_normalization_layer(name, dataset):
    
    # Create a Normalization layer for our feature
    normalizer = layers.Normalization(axis=None)

    # Prepare a dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])

    # Learn the statistics of the data
    normalizer.adapt(feature_ds)

    # Normalize the input feature
    return normalizer

- Next, test the new function by calling it on the total uploaded pet photo features to normalize 'PhotoAmt':

In [62]:
test_age_feature = train_features['age']

test_age_layer = get_normalization_layer('age', train_ds)

test_age_layer(test_age_feature)

<tf.Tensor: shape=(5,), dtype=float32, numpy=
array([-1.0903668, -1.542778 , -1.7689835,  1.1716888,  1.6241   ],
      dtype=float32)>

### Categorical preprocessing functions

- Define another new utility function that returns a layer which maps values from a vocabulary to integer indices and multi-hot encodes the features using the tf.keras.layers.StringLookup, tf.keras.layers.IntegerLookup, and tf.keras.CategoryEncoding preprocessing layers:

In [59]:
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  
  # Create a layer that turns strings into integer indices.
  if dtype == 'string':
    index = layers.StringLookup(max_tokens=max_tokens)
  # Otherwise, create a layer that turns integer values into integer indices.
  else:
    index = layers.IntegerLookup(max_tokens=max_tokens)

  # Prepare a `tf.data.Dataset` that only yields the feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Encode the integer indices.
  encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())

  # Apply multi-hot encoding to the indices. The lambda function captures the
  # layer, so you can use them, or include them in the Keras Functional model later.
  return lambda feature: encoder(index(feature))

- Test the get_category_encoding_layer function by calling it on pet 'Thal' features to turn them into multi-hot encoded tensors:

In [63]:
test_thal_feature = train_features['thal']

test_thal_layer = get_category_encoding_layer(name='thal',
                                              dataset=train_ds,
                                              dtype='string')
test_thal_layer(test_thal_feature)

<tf.Tensor: shape=(6,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 0.], dtype=float32)>

### Data preprocessing

Next, we will:

- Apply the preprocessing utility functions defined earlier on our numerical and categorical features 
- Add all the feature inputs to a list.

- Earlier, we used a small batch size to demonstrate the input pipeline. 
- Let's now create a new input pipeline with a larger batch size of 256:

In [66]:
batch_size = 256

ds_train = dataframe_to_dataset(df_train, shuffle=True, batch_size=batch_size)
ds_val = dataframe_to_dataset(df_val, shuffle=True, batch_size=batch_size)


- Normalize the numerical features and add them to one list of inputs called `encoded_features`:

In [None]:
all_inputs = []
encoded_features = []

# Numerical features.
for feature in list_num:
  numeric_col = keras.Input(shape=(1,), name=feature)
  normalization_layer = get_normalization_layer(feature, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

In [None]:
encoded_features

- Make a list of all encoded features

In [None]:
all_features = layers.concatenate(encoded_features)

- Note that we also could use a for loop to automate some of the steps above:

In [None]:
## Model

Now we can build the model: 

1. We use 32 number of units in the first layer
1. We use [layers.Dropout()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) to prevent overvitting
1. Our output layer has 1 output (since the classification task is binary)
1. keras.Model groups layers into an object with training and inference features.


In [None]:
# First layer
x = layers.Dense(32, activation="relu")(all_features)
# Dropout to prevent overvitting
x = layers.Dropout(0.5)(x)
# Output layer
output = layers.Dense(1, activation="sigmoid")(x)

# Group all layers 
model = keras.Model(all_inputs, output)

In [None]:
model.compile(optimizer="adam", 
              loss ="binary_crossentropy", 
              metrics=["accuracy"])

Let's visualize our connectivity graph:

In [None]:
# `rankdir='LR'` is to make the graph horizontal.
keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

## Training

In [None]:
model.fit(train_ds, epochs=10, validation_data=val_ds)

## Predictions

In [None]:
sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}

predictions = model.predict(input_dict)

print(
    "This particular patient had a %.1f percent probability "
    "of having a heart disease, as evaluated by our model." % (100 * predictions[0][0],)
)