<a href="https://colab.research.google.com/github/nigoda/machine_learning/blob/main/15_Classify_structure_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classify Structure Data**

How to build a classifier for a structure data(e.g tabular data in a CSV). Use keras to define the model, and feature column as a bridge to map from columns in a CSV to feature for model training.



*   Load a CSV file using pandas
*   Build an input pipeline to batch and shuffle the rows using tf.data
*   Map from columns in the CSV  to feature columns.
*   Build, train, and evaluate a model using keras.

*Dataset : Clevenland clinic foundation for Heart Disease description*

### **Import TensorFlow and other libraries**

we will use sklearn for splitting the data into traning and test set.

In [None]:
!pip install sklearn



In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd

!pip install tensorflow==2.0.0-beta1
import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split



Use Pandas to create a dataframe

In [None]:
URL = "https://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(URL)
dataframe.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


Split the dataframe into train, validation, and test

In [None]:
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

193 train examples
49 validation examples
61 test examples


### **Creatre an input pipeline using tf.data**

Next, we will wrap the dataframes with `td.data`. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to featurea used to train the model. If we were working with a very large CSV file(so large that is does not fit into memory), we would use `tf.data` to read it from disk directly.  

In [None]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

Lets understand what this function is doing:

1.   Copy the input dataframe so that the changes are not presisted.
2.   Pop the labels column from the dataframe with pop method,which returns the label column and remove it from the dataframe.
3.   Create dataset from tensor slices. The tensor slices are created by obtaining dictionary representation of the dataframe and the label column.
4.   Shuffle the dataset in case method.
5.   Get a batch of tensors of specified size and return it.



In [None]:
batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

### **Understand the input pipeline**

Now that we have created the input pipeline, Let's call it to see the formate of the data it returns. We have used a small batch size to keep the output readable.

In [None]:
for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:',list(feature_batch.keys()))
  print('A batch of ages: ', feature_batch['age'])
  print('A batch of target: ', label_batch)

Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
A batch of ages:  tf.Tensor([54 56 58 50 58], shape=(5,), dtype=int32)
A batch of target:  tf.Tensor([0 0 1 0 0], shape=(5,), dtype=int32)


### **Demonstrate several types of feature columns**

TensorFlow provides many types of feature columns. In this section, we will create several types of feature columns, and demonstrate how they transform a column from the dataframe.

In [None]:
# We will use this batch to demonstrate several types of feature columns.
example_batch = next(iter(train_ds))[0]

In [None]:
# A utility method to create a feature column and to transform a batch of data.
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

### **Numeric columns**

The output of a feature column becomes the input to the model(using the demo function define above, we will be able to see exactly how each column from the dataframe is transformed). A `numeric column` is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

In [None]:
age = feature_column.numeric_column("age")
demo(age)

[[54.]
 [56.]
 [58.]
 [50.]
 [58.]]


### **Bucketized columns**
Often, you don't want to feed a number directly into the model, but instead spilt its value into different categories based on numerical ranges. Consider raw data that represents a persons's age. Instead of representing age as a numeric column, we could split the age into several buckets using a `bucketized column`. Notice the one-hot values below descibe which age range each row matches.

In [None]:
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
demo(age_buckets)

[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]


### **Categorical columns**

In this dataset, that is represented as a string(e.g. 'fixed','normal',or'reversib;e'). We cannot feed string directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent string as a one-hot vector(much like you have seen above with age buckets). The vocabulary can be passed as a list using `Categorical_column_with_vocabulary_list`, or loaded from a file using `categorical_column_with_vocabulary_file`.


In [None]:
thal = feature_column.categorical_column_with_vocabulary_list(
    'thal',['fixed','normal','reversible'])
thal_one_hot = feature_column.indicator_column(thal)
demo(thal_one_hot)

[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]]


### **Embedding columns**

Suppose instead of having just a few possible srting , we have thousand(or more) values per categorical . For a number of reasons, as the number of categorical grow larger,it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions,as `embedding column` repressent that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. The size of the embedding(8, in the example below) is a parameter that must be turned.

key point: using an embedding column is best when a categorical column has many possible values. We are using one here for demonstraction purpose, so you have a complete example you can modify for a different dataset in the future.

In [None]:
# Notice the input to the embedding column is the categorical column
# we previously created
thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)

[[-0.30729973 -0.39098382 -0.35466993 -0.66526955  0.2597974  -0.23162203
   0.6121564  -0.15612991]
 [ 0.12556003 -0.22928041 -0.3129179   0.16608681  0.19093558  0.6859717
  -0.5167214  -0.22028807]
 [-0.00785702  0.62338656  0.52729785  0.0894713   0.30477864 -0.11983582
   0.19623291  0.09423055]
 [-0.00785702  0.62338656  0.52729785  0.0894713   0.30477864 -0.11983582
   0.19623291  0.09423055]
 [-0.30729973 -0.39098382 -0.35466993 -0.66526955  0.2597974  -0.23162203
   0.6121564  -0.15612991]]


### **Hashed feature columns**

Another way to represent a categorical with a large number of values is to use a `categorical_column_with_hash_bucket`.This feature column calculates a hash values of the input, then selects one of the hash_bucket_size bucket to encode a string. When using this column,you do not need to provide the vocabulary, and you can choose to make the number of hash_bucket significantly smaller than the number of actual categorical to save space.

key point: An important downside of this technique is that there may be collisions in which different strings are mapped to the same bucket. In practice, this can work well for some datasets regardless.

In [None]:
thal_hashed = feature_column.categorical_column_with_hash_bucket(
    'thal', hash_bucket_size = 1000)
demo(feature_column.indicator_column(thal_hashed)) 

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### **Crossed feature columns**

Combining features into a single feature, better known as `feature crosses`, enables a model to learn separate weights for each combination of features. Here,we will create a new feature that is the cross of age and thal. Note that crossed_column does not build the full tabel of all possible combinations(which could be very large). Instead, it is backed by a hashed_column, so you can choose how large the tabel is.


In [None]:
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
demo(feature_column.indicator_column(crossed_feature))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## **choose which columns to use**

We have seen how to use several types of feature columns. Now we will use them to train the model. The goal of this to show you the complete code(eg mechanics) needed to work with feature columns. We have selected a few columns arbitrarily to train our model below.

key point: If your aim is to build an accurate model, try a larger dataset of yours own, and think carefully about which features are the most meaningful to include, and how they should be represented.


In [None]:
feature_columns = []

# numeric cols
for header in ['age','trestbps','chol','thalach','oldpeak','slope','ca']:
  feature_columns.append(feature_column.numeric_column(header))

# bucketized cols
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# indicator cols
thal = feature_column.categorical_column_with_vocabulary_list(
    'thal', ['fixed','normal','reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# embedding cols
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

# crossed cols
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000) 
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

### **Create a feature layer**
Now that we have defined our feature columns, we will use a `DenseFeatures` layer to input them to our keras model.


In [None]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

Earlier, we used a small batch size to demonstrate how feature columns worked. We create a new input pipeline with a large batch size.

In [None]:
batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

### **Create, compile, and train the model**


In [None]:
# Create a baseline model with logistic regression
model = tf.keras.Sequential([
    feature_layer,
    layers.Dense(1, activation='sigmoid')
])

In [None]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'],
              run_eagerly=True)

model.fit(train_ds,
          validation_data = val_ds,
          epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f0b68f92f28>

In [None]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

Accuracy 0.36065573


## **Build Neural Network based model**


In [None]:
model_nn = tf.keras.Sequential([
   feature_layer,
   layers.Dense(128, activation='relu'),
   layers.Dense(128, activation='relu'),
   layers.Dense(1, activation='sigmoid')
])

model_nn.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'],
              run_eagerly=True)

model_nn.fit(train_ds,
          validation_data = val_ds,
          epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f0b66bbbeb8>

In [None]:
print(model_nn.summary())

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_features_20 (DenseFeat multiple                  24        
_________________________________________________________________
dense_24 (Dense)             multiple                  131840    
_________________________________________________________________
dense_25 (Dense)             multiple                  16512     
_________________________________________________________________
dense_26 (Dense)             multiple                  129       
Total params: 148,505
Trainable params: 148,505
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
loss, accuracy = model_nn.evaluate(test_ds)
print("Accuracy", accuracy)

Accuracy 0.6557377


key point: You will typically see best result with deep learning with much larger and more complex datasets. When working with a small dataset like this one, we recommend using a decision tree or random forest as a strong baseline. The goal of this exercise is not to train an accurate model, but to demonstrate the mechanics of working with structure data, so you have code to use as a startin point when working with your own datasets in the future.