# Build, train and evaluate models with TensorFlow Decision Forests

TensorFlow Decision Forests (TF-DF) is a library for the training, evaluation, interpretation and inference of Decision Forest models.

In this tutorial, I learned how to:

1. Train a binary classification Random Forest on a dataset containing numerical, categorical and missing features.
2. Evaluate the model on a test dataset.
3. Prepare the model for [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving).
4. Examine the overall structure of the model and the importance of each feature.
5. Re-train the model with a different learning algorithm (Gradient Boosted Decision Trees).
6. Use a different set of input features.
7. Change the hyperparameters of the model.
8. Preprocess the features.
9. Train a model for regression.
10. Train a model for ranking.
1. 
Detailed documentation is available in the [user manual](https://github.com/tensorflow/decision-forests/documentation). The [example](https://github.com/tensorflow/decision-forests/examples) directory contains other end-to-end examples.

In [1]:
!pip install tensorflow_decision_forests
!pip install wurlitzer

Collecting wurlitzer
  Downloading wurlitzer-3.0.2-py3-none-any.whl (7.3 kB)
Installing collected packages: wurlitzer
Successfully installed wurlitzer-3.0.2


## Imports

In [2]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math
from wurlitzer import sys_pipes

In [3]:
# Check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

Found TensorFlow Decision Forests v0.1.9


## Training a Random Forest model

In this section, we `train`, `evaluate`, `analyse` and `export` a binary classification **Random Forest** trained on the *Palmer's Penguins* dataset.

### Load the dataset and convert it in a tf.Dataset

This dataset is very small (300 examples) and stored as a .csv-like file. Therefore, use **`Pandas`** to load it.

`Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the TensorFlow Dataset to read the files may be better suited.`

Let's assemble the dataset into a csv file (i.e. add the header), and load it:

In [4]:
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# Load a dataset into a Pandas Dataframe
dataset_df = pd.read_csv("/tmp/penguins.csv")

# Display the first 5 examples
dataset_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
