# Pandas Dataframe
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/pandas.ipynb)


## Setup

In [None]:
pip install ydf pandas -U

## Pandas

YDF can train directly on [Pandas](https://pandas.pydata.org/) dataframes. YDF tries to infer column semantics automatically. For more fine-grained control, YDF offers advanced options for specifying column semantics.

In [8]:
import ydf
import pandas as pd
import numpy as np

# Create a small dataframe with different column types.
df = pd.DataFrame(
    {"feature_1": [1, 2, 3, 1] * 20, # A numerical feature
     "feature_2": ["X", "X", "Y", "Y"] * 20, # A categorical feature
     "feature_3": [True, False, True, False ] * 20, # A boolean feature
     "label": [True, True, False, False ] * 20, # The labels
})
df.head()

Unnamed: 0,feature_1,feature_2,feature_3,label
0,1,X,True,True
1,2,X,False,True
2,3,Y,True,False
3,1,Y,False,False
4,1,X,True,True


We can directly train a model on this dataframe.

In [4]:
# Train a model.
model = ydf.RandomForestLearner(label="label").train(df)

Train model on 80 examples
Model trained in 0:00:00.003959


In [5]:
model.describe()

## Train a model on a subset of features

By default, all the available columns are used by the model.
Instead, you can restrict YDF to only use some of the features.

Train a model on `feature_1` and `feature_2` only.

In [7]:
model = ydf.RandomForestLearner(
    label="label",
    features=["feature_1", "feature_2"]
).train(df)

print("Model input features:", model.input_feature_names())

Train model on 80 examples
Model trained in 0:00:00.003908
Model input features: ['feature_1', 'feature_2']


## Override the feature semantics

To consume a feature, the model needs to know how to interpret this feature. This is called the feature "semantic".
YDF support four types of feature semantics:

- **Numerical**: For quantities or measures.
- **Categorical**: For categories or enums.
- **Boolean**: A special type of categorical with only two categories True and False.
- **Categorical-set**: For sets of categories, tags, or bag of words.

YDF automatically determine the semantic of a feature according to its representation. For example, float and int alues are automatically detected a numerical.

For example, here are the semantics of the model trained above:

In [9]:
model.input_features()

[InputFeature(name='feature_1', semantic=<Semantic.NUMERICAL: 1>, column_idx=0),
 InputFeature(name='feature_2', semantic=<Semantic.CATEGORICAL: 2>, column_idx=1)]

In some cases, it is interresting to force a specific semantic. For instance, if an enum-value is represented with integers, it is important to force the feature as categorical:

In [11]:
model = ydf.RandomForestLearner(
    label="label",
    features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
    include_all_columns=True  # Use all the features; not just the ones in "features".
).train(df)

model.input_features()

Train model on 80 examples
Model trained in 0:00:00.004236


[InputFeature(name='feature_1', semantic=<Semantic.CATEGORICAL: 2>, column_idx=0),
 InputFeature(name='feature_2', semantic=<Semantic.CATEGORICAL: 2>, column_idx=2),
 InputFeature(name='feature_3', semantic=<Semantic.BOOLEAN: 5>, column_idx=3)]