# Categorical

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/categorical_feature.ipynb)

The way a feature is treated depends on its [semantic](utilities/#Semantic), such as numerical, categorical, boolean, or text. If the semantic is not specified, it is inferred automatically. For example, float and integer features are detected as numerical, while strings are detected as categorical.


A categorical feature represents a type or class in a finite set of possible values without ordering. As an example, consider the color `RED` in the set {`RED`, `BLUE`, `GREEN`}.
Categorical features can be strings or integers. Missing values are represented by "" (empty sting).

Let's train an example on categorical string feature.

In [2]:
import ydf
import pandas as pd

In [3]:
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": ["red", "red", "blue", "green"],
    "feature_2": ["hot", "hot", "cold", ""],
})

model = ydf.RandomForestLearner(label="label").train(dataset)

Train model on 4 examples
Model trained in 0:00:00.008941


We can see the features are detected as categorical in the **Dataspec** tab.

In [4]:
model.describe()

Sometime, you might want to force a feature's semantic to be categorical.

In the next example, "feature_1" and "feature_2" are integers so they will be detected automatically as numerical.
However, we want "feature_1" to be detected as categorical.

In the model description, notice that "feature_1" is categorical, while "feature_2" is numerical.

In [5]:
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [5, 6, 7, 6],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".

model.describe()

Train model on 4 examples
Model trained in 0:00:00.004352
