<a href="https://colab.research.google.com/github/linhoangce/tensorflow_learning_stuff/blob/main/decision_forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Migrating from NN

Decision Forests work differently than Neural Networks (NN): DFs generally do not train with backpropagation, or in mini-batches. Therefore, TF-DF pipelines have a few differences from other Tensorflow pipelines.


## Dataset and Features

### Validation Dataset

Unlike the standar Neural Network training paradigm, TF-DF models do not need a validation dataset to monior overfitting, or to stop training early. If we already have a train/validation/test split, and we are using the validation for one of those reason, it is safe to train our TF-DF on train+validation (ubless the validation split is also used for something else, like hyper-parameter tuning)


* NN:
<code>

model.fit(train_ds, validation_data=val_ds)

</code>

Or just combine both and no need for validation dataset

* DF:
<code>

model.fit(train_ds.concatenate(val_ds))

</code>


**Rationale**: the TF-DF framework is composed of multiple algorithms. Some of them do not use a validation dataset (e.g Random Forest) while some others do (e.g Gradient Boosted Trees). Algorithms that do might benefit from different types and size of validation datasets. Therefore, if a valiation dataset is needed, it will be extracted automatically from the training dataset.

### Dataset I/O

**traing for exactly 1 epoch**

In [None]:
import tensorflow as tf
import tensorflow_decision_forests as tfdf

model = tfdf.keras.RandomForestModel()
train_ds = ...
### NN
# Numer of epochs in Keras
model.fit(train_ds, num_epcohs=5)

# Number of epochs in the dataset
train_ds = train_ds.repeat(5)
model.fit(train_ds)


### DF
model.fit(train_ds)

**Rationale**: Users of neural networks often train a model for N steps (which may involve looping over the dataset > 1 time), due to the nature of SGD. TF-DF trains by reading the whole dataset and then running the training at the end. 1 epoch is needed to read the full dataset, and any extra steps will result in unncessary data I/O, as well as slower training.

**Do not shuffle the dataset**

Datasets do not need to be shuffled (unless the unput_fn is reading only a sample of the dataset)

**Rationale**: TF-DF shuffles access to the data internally after reading the full dataset into memory. TF-DF algorithms are deterministic (if the user does not change the random seed). Enabling shuffling will only make the algorithm non-deterministic. Shuffling does make sense if the input dataset is ordered and the input_fn is only going to read a sample of it (the sample should be random). However, this will make the training procedure non-deterministic.

In [None]:
train_ds = train_ds.shuffle(5)
model.fit(train_ds)

### DF
model.fit(train_ds)

**Do not tune the batch size**



### Large Datasets

Unlike NN, which can loop over mini-batches of a large dataset infinitely,  DF requires a finite dataset that fits in memorr for their training procedures. The size of the dataset has performance and memory implications.

There are diminishing returns for increasing the size of the dataset, and DF-algorithms arguably need fewer examples for convergence than large NN models. Instead of scaling the number of training steps (as in a NN), we can try scaling the amount of data to see where the compute tradeoff makes sense. Therefore, **it is a good idea to first try training on a (small) subset of the dataset.**

The alternative solution is to use *distributed training*. Distributed training is great way to increase the size of the dataset if multiple machines are available. While all the distributed algorimths are available to distribute the computation, not all of them are able to distribute the RAM usage.

**How many examples to use**

**It should fit in memory on the machine the model is training on:**

* Note that this is not the same as the size of the examples on disk

* as a rule of thumn one numerical or categorical value uses **4 bytes** of memory. So, a dataset with 100 features and 25 million examples will take ~10GB (= 100 * 25 * 10^6 * 4) of memory.

* Categorical-set features (e.g tokenized text) take more memory (4 bytes per token + 12 bytes per feature)

** Consider what training time budget is**

* while generally faster than NN for smaller datasets (e.g < 100k examples), DF training algorithms do no scale linearly with the dataset size; rather, ~O(features x num_examples x log(num_examples)) in most cases.

* The training time depends on the hyperparameters. The most impactful parameters are: (1) the number of trees (`num_tree`), (2) the example sampling rate (`subsample` for GBT), (3) the attribute sampling rate (`num_candidate_attribute_ratio`)

* Categorical-set features are more expensive than other features. The cost is controlled by the `categorical_set_split_greedy_sampling` parameter.

* Sparse Oblique features (disabled by default) give good results but are expensive to compute.

**Rules of thumb for scaling up data**

It is suggested to start with a small slice of the data (< 10k examples), which should allow for training a TF-DF model in seconds or a few minutes in most case. Then we can increase the data at a fixed rate (e.g %40 more each time), stopping when validation set performance does not improve or the dataset no longer fits in memory.

## Model Determinism

The TF-DF training algorithm is deterministic, i.e training twice on the same dataset will give the exact same model. This is different from neural networks trained with TensorFlow. To preserve this determinism, users should ensure that dataset reads are deterministic as well.

### Training Configuration

**Specify a task (e.g classification, ranking) instead of a loss (e.g binary cross-entropy)**

In [None]:
### NN
model = tf.keras.Sequential()

model.add(Dense(64, activation='relu'))
model.add(Dense(1)) # One output for binary classification

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='Adam',
              metrics=['accuracy'])

### DF
# the loss is automatically determined from the task
model = tfdf.keras.GradientBoostedTreesModel(task=tf_keras.Task.CLASSIFICATION)

# Optional if we want to report the accuracy
model.compile(metrics=['accuracy'])


**Hyperparemters are semantically stable**

All the hyperparameters have default values. Those values are reasonable first candidates to try. Default hyperparameter values are guaranteed to never change. For this reason, new hyperpararmters of algorithm improvements are disabled by default.

Users that wish to use the latest algorithms, but who do not want ot optimize the hyperparameters themselves can use the "hyperparamters templates" provided by TF-DF.

In [3]:
# Model with default hyperparameters
model = tfdf.keras.GradientBoostedTreesModel()

# List the hyperparameters (with default values)
# and hyperparameters templates of the GBT learning algo
?tfdf.keras.GradientBoostedTreesModel

Use /tmp/tmpb49rf7sf as temporary training directory


In [7]:
# Use a hyperparameter template
model = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template='better_default')

Resolve hyper-parameter template "better_default" to "better_default@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL'}.
Use /tmp/tmpq0icqrzk as temporary training directory


In [8]:
# Change one of the hyperparameters
model = tfdf.keras.GradientBoostedTreesModel(num_trees=500)

Use /tmp/tmp8aibunee as temporary training directory


In [9]:
# List all the learning algorithms available
tfdf.keras.get_all_models()

[tensorflow_decision_forests.keras.RandomForestModel,
 tensorflow_decision_forests.keras.GradientBoostedTreesModel,
 tensorflow_decision_forests.keras.CartModel,
 tensorflow_decision_forests.keras.DistributedGradientBoostedTreesModel]

## Model debugging