# Build, train and evaluate models with TensorFlow Decision Forests

TensorFlow Decision Forests (TF-DF) is a library for the training, evaluation, interpretation and inference of Decision Forest models.

In this tutorial, I learned how to:

1. Train a binary classification Random Forest on a dataset containing numerical, categorical and missing features.
2. Evaluate the model on a test dataset.
3. Prepare the model for [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving).
4. Examine the overall structure of the model and the importance of each feature.
5. Re-train the model with a different learning algorithm (Gradient Boosted Decision Trees).
6. Use a different set of input features.
7. Change the hyperparameters of the model.
8. Preprocess the features.
9. Train a model for regression.
10. Train a model for ranking.
1. 
Detailed documentation is available in the [user manual](https://github.com/tensorflow/decision-forests/documentation). The [example](https://github.com/tensorflow/decision-forests/examples) directory contains other end-to-end examples.

In [1]:
!pip install tensorflow_decision_forests
!pip install wurlitzer

Collecting wurlitzer
  Downloading wurlitzer-3.0.2-py3-none-any.whl (7.3 kB)
Installing collected packages: wurlitzer
Successfully installed wurlitzer-3.0.2


## Imports

In [2]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math
from wurlitzer import sys_pipes

In [3]:
# Check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

Found TensorFlow Decision Forests v0.1.9


## Training a Random Forest model

In this section, we `train`, `evaluate`, `analyse` and `export` a binary classification **Random Forest** trained on the *Palmer's Penguins* dataset.

### Load the dataset

This dataset is very small (300 examples) and stored as a .csv-like file. Therefore, use **`Pandas`** to load it.

`Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the TensorFlow Dataset to read the files may be better suited.`

Let's assemble the dataset into a csv file (i.e. add the header), and load it:

In [6]:
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O ./data/penguins.csv

# Load a dataset into a Pandas Dataframe
dataset_df = pd.read_csv("./data/penguins.csv")

# Display the first 5 examples
dataset_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


The dataset contains a mix of `numerical` (e.g. **`bill_depth_mm`**), `categorical` (e.g. **`island`**) and `missing features`. TF-DF supports all these feature types natively (*differently than NN based models*), therefore there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_present feature.

Labels are a bit different: Keras metrics expect integers. The label (**`species`**) is stored as a string, so let's convert it into an integer.

### Data Preprocessing

1. Encode the categorical label into an integer.
2. Split the dataset into a training and a testing dataset.
3. Convert the pandas dataframe (**`pd.Dataframe`**) into tensorflow datasets (**`tf.data.Dataset`**)

In [7]:
# Details:
# This stage is necessary if your classification label is represented as a
# string. Note: Keras expected classification labels to be integers.

# Name of the label column.
label = "species"

classes = dataset_df[label].unique().tolist()
print(f"Label classes: {classes}")

dataset_df[label] = dataset_df[label].map(classes.index)

Label classes: ['Adelie', 'Gentoo', 'Chinstrap']


In [21]:
def split_dataset(dataset, test_ratio=0.30):
    """Splits a pandas dataframe in two"""
    test_indices = np.random.random_sample(len(dataset)) < test_ratio
    return dataset[~test_indices], dataset[test_indices]


train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print(f"{len(train_ds_pd)} examples in training and {len(test_ds_pd)} examples for testing")

246 examples in training and 98 examples for testing


In [22]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

### Training the Model

In [26]:
# Specify the model
model_1 = tfdf.keras.RandomForestModel()

# Optionally, add evaluation metrics
model_1.compile(
    metrics=['accuracy']
)

# Train the model
# `sys_pipes` is optional. It enables the display of the training logs
with sys_pipes():
    model_1.fit(x=train_ds)

2021-09-29 18:59:36.639946: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)




[INFO kernel.cc:746] Start Yggdrasil model training
[INFO kernel.cc:747] Collect training examples
[INFO kernel.cc:392] Number of batches: 4
[INFO kernel.cc:393] Number of examples: 246
[INFO kernel.cc:769] Dataset:
Number of records: 246
Number of columns: 8

Number of columns by type:
	NUMERICAL: 5 (62.5%)
	CATEGORICAL: 3 (37.5%)

Columns:

NUMERICAL: 5 (62.5%)
	0: "bill_depth_mm" NUMERICAL num-nas:1 (0.406504%) mean:17.1257 min:13.1 max:21.2 sd:1.96878
	1: "bill_length_mm" NUMERICAL num-nas:1 (0.406504%) mean:44.0747 min:32.1 max:58 sd:5.33557
	2: "body_mass_g" NUMERICAL num-nas:1 (0.406504%) mean:4241.12 min:2700 max:6300 sd:831.97
	3: "flipper_length_mm" NUMERICAL num-nas:1 (0.406504%) mean:201.498 min:172 max:231 sd:14.3577
	6: "year" NUMERICAL mean:2008.04 min:2007 max:2009 sd:0.818165

CATEGORICAL: 3 (37.5%)
	4: "island" CATEGORICAL has-dict vocab-size:4 zero-ood-items most-frequent:"Biscoe" 122 (49.5935%)
	5: "sex" CATEGORICAL num-nas:9 (3.65854%) has-dict vocab-size:3 zero-oo