# Pandas
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/pandas.ipynb)


## Setup

In [1]:
pip install ydf -U



## Pandas

YDF can train directly on Pandas dataframes. YDF tries to infer column semantics automatically. For more fine-grained control, YDF offers advanced options for specifying column semantics.

In [2]:
# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd
import numpy as np

# Create a small dataframe with different column types.
df = pd.DataFrame(
    {"col_cat_1": ["a", "b", "c"]*20,
     "col_cat_2": ["x", "x", "x", "y", "y", "y"]*9 + ["q", "q", "w", "w", "r", "r"],
     "col_int": list(range(60)),
     "col_float": np.linspace(0,1,60),
     "col_bool": [True, False]*30
})
df.head()

Unnamed: 0,col_cat_1,col_cat_2,col_int,col_float,col_bool
0,a,x,0,0.0,True
1,b,x,1,0.016949,False
2,c,x,2,0.033898,True
3,a,y,3,0.050847,False
4,b,y,4,0.067797,True


We can directly train a model on this dataframe.

In [3]:
model_1 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10).train(df)
# See the data specification in the dataspec tab.
model_1.describe()

[INFO 23-10-31 18:13:26.2598 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.2622 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.2623 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.2623 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s).
[INFO 23-10-31 18:13:26.2694 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0.0952381 logloss:32.6109
[INFO 23-10-31 18:13:26.2696 UTC random_forest.cc:802] Training of tree  10/10 (tree index:8) done accuracy:0.166667 logloss:17.4497
[INFO 23-10-31 18:13:26.2708 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.166667 logloss:17.4497
[INFO 23-10-31 18:13:26.2721 UTC abstract_model.cc:881] Model self evaluation:
Number of predictions (without weights): 6

'Type: "RANDOM_FOREST"\nTask: CLASSIFICATION\nLabel: "col_cat_1"\n\nInput Features (4):\n\tcol_cat_2\n\tcol_int\n\tcol_float\n\tcol_bool\n\nNo weights\n\nVariable Importance: INV_MEAN_MIN_DEPTH:\n    1.   "col_int"  0.532151 ################\n    2. "col_float"  0.411677 #########\n    3. "col_cat_2"  0.254803 \n    4.  "col_bool"  0.239653 \n\nVariable Importance: NUM_AS_ROOT:\n    1.   "col_int"  6.000000 ################\n    2. "col_float"  3.000000 ######\n    3. "col_cat_2"  1.000000 \n\nVariable Importance: NUM_NODES:\n    1. "col_float" 29.000000 ################\n    2.   "col_int" 26.000000 #############\n    3.  "col_bool" 12.000000 \n    4. "col_cat_2" 11.000000 \n\nVariable Importance: SUM_SCORE:\n    1.  "col_bool" 73.616851 ################\n    2. "col_float" 70.501937 ##############\n    3.   "col_int" 59.014978 ##########\n    4. "col_cat_2" 32.388708 \n\n\n\nWinner takes all: true\nOut-of-bag evaluation: accuracy:0.166667 logloss:17.4497\nNumber of trees: 10\nTotal n

ber of predictions (with weights): 60
Task: CLASSIFICATION
Label: col_cat_1

Accuracy: 0.166667  CI95[W][0.0933069 0.266291]
LogLoss: : 17.4497
ErrorRate: : 0.833333

Default Accuracy: : 0.333333
Default LogLoss: : 1.09861
Default ErrorRate: : 0.666667

Confusion Table:
truth\prediction
       <OOD>   a  b   c
<OOD>      0   0  0   0
    a      0   5  5  10
    b      0  12  4   4
    c      0  18  1   1
Total: 60

One vs other classes:



## Feature Selection

YDF offers many ways to customize which features to use and how to use them.

When specifying the learner, we can manually select a subset of the features.

In [4]:
model_2 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10, features=["col_int", "col_bool"]).train(df)
print(model_2)



Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details



[INFO 23-10-31 18:13:26.2825 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.2829 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.2829 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.2829 UTC random_forest.cc:416] Training random forest on 60 example(s) and 2 feature(s).
[INFO 23-10-31 18:13:26.2853 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0.142857 logloss:30.8946
[INFO 23-10-31 18:13:26.2855 UTC random_forest.cc:802] Training of tree  10/10 (tree index:9) done accuracy:0.15 logloss:21.5379
[INFO 23-10-31 18:13:26.2866 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.15 logloss:21.5379


### Forcing a semantic

We can also force a semantic on a certain feature. Here, we force the integer column to be treated as categorical. Note that we set `include_all_columns` to make sure even columns not explicitly listed are used.

It is not possible to force arbitrary semantics to the columns. Categorical features must be integer or string, while numerical columns must be float or integer. 

**Note**: Internally, YDF converts all numerical columns to 32-bit floats. It is therefore not necessary to perform conversions between numerical formats.

In [5]:
model_3 = ydf.RandomForestLearner(
    label="col_cat_1",
    num_trees=10,  # Compute only 10 trees
    features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
    include_all_columns=True  # Include all columns, not just the ones listed in "features"
).train(df)
print(model_3)



Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details



[INFO 23-10-31 18:13:26.2941 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.2947 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.2948 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.2948 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s).
[INFO 23-10-31 18:13:26.2974 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0.0952381 logloss:32.6109
[INFO 23-10-31 18:13:26.2976 UTC random_forest.cc:802] Training of tree  10/10 (tree index:7) done accuracy:0.101695 logloss:21.2695
[INFO 23-10-31 18:13:26.2987 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.101695 logloss:21.2695


## Fine-grained semantics

YDF creates a dictionary for processing categorical features quickly. It has been shown that models often generalize better when rare features subsumed as "Out-of-dictionary" (OOD) values. As usual, YDF provides sensible default values: Each value appearing less than 5 times is considered OOD, and there can be at most 2000 non-OOD values. These default values can be changed in the model constructor. 

In [6]:
model_4 = ydf.RandomForestLearner(
    label="col_cat_1", 
    num_trees=10,  # Compute only 10 trees
    max_vocab_count=300,  # Allow at most 300 non-OOD values.
    min_vocab_frequency=3,  # Any value appearing less than 3 times is considered OOD.
    features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
    include_all_columns=True
).train(df)
print(model_4)



Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details



[INFO 23-10-31 18:13:26.3056 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.3061 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.3061 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.3061 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s).
[INFO 23-10-31 18:13:26.3084 UTC random_forest.cc:802] Training of tree  1/10 (tree index:1) done accuracy:0.136364 logloss:31.1286
[INFO 23-10-31 18:13:26.3086 UTC random_forest.cc:802] Training of tree  10/10 (tree index:8) done accuracy:0.133333 logloss:19.1921
[INFO 23-10-31 18:13:26.3098 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.133333 logloss:19.1921


Fine-grained semantics can even be specified on individual features

In [7]:
explicit_features = [
    ydf.Feature("col_cat_1", 
                min_vocab_frequency=1,  # No minimum frequency for elements of this feature.
                semantic=ydf.Semantic.CATEGORICAL  # Required when setting min_vocab_frequency.
               ),
    "col_cat_2",  # It is not necessary to provide detailed semantics for all features.
    "col_bool"
]
model_explicit_semantics = ydf.RandomForestLearner(
    label="col_int", 
    num_trees=10,  # Compute only 10 trees
    min_vocab_frequency=3,  # Any value appearing less than 3 times is considered OOD by default.
    features=explicit_features,
    include_all_columns=False
).train(df)
print(model_explicit_semantics)



Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details



[INFO 23-10-31 18:13:26.3164 UTC dataset.cc:299] max_vocab_count = -1 for column col_int, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.3169 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.3169 UTC abstract_learner.cc:141] The label "col_int" was removed from the input feature set.
[INFO 23-10-31 18:13:26.3169 UTC random_forest.cc:416] Training random forest on 60 example(s) and 3 feature(s).
[INFO 23-10-31 18:13:26.3194 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0 logloss:36.0437
[INFO 23-10-31 18:13:26.3199 UTC random_forest.cc:802] Training of tree  10/10 (tree index:9) done accuracy:0 logloss:36.0437
[INFO 23-10-31 18:13:26.3208 UTC random_forest.cc:882] Final OOB metrics: accuracy:0 logloss:36.0437


## Advanced: Creating a VerticalDataset

Internally, YDF uses a data structure called `VerticalDataset` for storing training dataset. Normally, the VerticalDataset is created automatically during training. It is also possible to explicitly create the VerticalDataset. This can be useful when re-using the same dataset multiple times, since we can save re-converting the dataset from pandas.

In [8]:
vds = ydf.create_vertical_dataset(
    df,
    # Columns and their semantics can be specified the same way
    # features are specified for learners
    columns=["col_cat_1", "col_int", "col_bool"]
)
vds.memory_usage()  # Prints memory usage in bytes.

540

A VerticalDataset also contains a **DataSpecification**, which collects all information about the dataset that is used during training: Semantics for each column, dictionary of categorical features, statistical information about numerical features and more.

In [9]:
vds.data_spec()  # Print the data spec.

columns {
  type: CATEGORICAL
  name: "col_cat_1"
  categorical {
    number_of_unique_values: 4
    items {
      key: "c"
      value {
        index: 3
        count: 20
      }
    }
    items {
      key: "b"
      value {
        index: 2
        count: 20
      }
    }
    items {
      key: "a"
      value {
        index: 1
        count: 20
      }
    }
    items {
      key: "<OOD>"
      value {
        index: 0
        count: 0
      }
    }
  }
  count_nas: 0
}
columns {
  type: NUMERICAL
  name: "col_int"
  numerical {
    mean: 29.5
    min_value: 0
    max_value: 59
    standard_deviation: 17.318102282486574
  }
  count_nas: 0
}
columns {
  type: BOOLEAN
  name: "col_bool"
  count_nas: 0
  boolean {
    count_true: 30
    count_false: 30
  }
}
created_num_rows: 60