# In C++ [Standalone]

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/cpp_standalone.ipynb)

## Setup

In [None]:
pip install ydf -U

## What is C++ Standalone?

Once trained, YDF models can be integrated into your C++ software using one of
two solutions:

-   **YDF Lib:** Copy your model data into your binary (or copy it in a
    directory accessible by your binary) and load it using the YDF library. This
    approach lets you change the model without recompiling your library, as
    detailed in the
    [In C++ tutorial](https://ydf.readthedocs.io/en/latest/tutorial/cpp/).

-   **YDF Standalone (this tutorial):** Compile your model into a
    dependency-free .h file that you include directly in your code. This
    solution generates significantly smaller code (up to 700x reduction
    observed), has no YDF dependency improving portability, and offers a simpler
    API.

## How to use C++ Standalone?

YDF models can be integrated in two ways:

-   **Direct Code Generation:** Call `model.to_standalone_cc()` to generate the
    source code. This option is simple and great for experimentation.

-   **Build Rule Integration:** For production, save your model (e.g., in
    Google3) and use a *cc_ydf_embedded_model* Blaze/Bazel rule. This option
    automatically call *to_standalone_cc* call during compilation, simplifying
    model updates and option testing.

Both methods are demonstrated in this tutorial.

## Import libraries

In [2]:
import pandas as pd
import ydf

## Training a small model

First, we train a small YDF model on the Adult dataset.

In [3]:
# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")

model = ydf.GradientBoostedTreesLearner(label="income", num_trees=2).train(
    train_ds
)
# Note: Only train 2 trees to make the generated code smaller.

model.describe()

Train model on 22792 examples
Model trained in 0:00:00.025254


## Direct Code Generation

Let's generate the model .h file. It contains the following symbols.

-   **`Instance` struct:** An input example. Each input feature is an attribute
    (e.g., *age*, *workclass*).
-   **`Predict` function:** A thread safe function that consumes an *Instance*
    and returns a label class (for classification).
-   **`Label`:** The label values. In this case, this is a binary classification
    model with two labels `Label::kLt50K` and `Label::kGt50K`.
-   **Categorical enums:** An enum class for each of the categorical input
    features e.g. *FeatureWorkclass*, *FeatureEducation*.

In [4]:
print(model.to_standalone_cc())

#ifndef YDF_MODEL_YDF_MODEL_H_
#define YDF_MODEL_YDF_MODEL_H_

#include <stdint.h>
#include <cstring>
#include <array>
#include <algorithm>
#include <bitset>
#include <cassert>

namespace ydf_model {

enum class Label : uint32_t {
  kLt50K = 0,
  kGt50K = 1,
};

enum class FeatureWorkclass : uint32_t {
  kOutOfVocabulary = 0,
  kPrivate = 1,
  kSelfEmpNotInc = 2,
  kLocalGov = 3,
  kStateGov = 4,
  kSelfEmpInc = 5,
  kFederalGov = 6,
  kWithoutPay = 7,
};

enum class FeatureEducation : uint32_t {
  kOutOfVocabulary = 0,
  kHsGrad = 1,
  kSomeCollege = 2,
  kBachelors = 3,
  kMasters = 4,
  kAssocVoc = 5,
  k11th = 6,
  kAssocAcdm = 7,
  k10th = 8,
  k7th8th = 9,
  kProfSchool = 10,
  k9th = 11,
  k12th = 12,
  kDoctorate = 13,
  k5th6th = 14,
  k1st4th = 15,
  kPreschool = 16,
};

enum class FeatureMaritalStatus : uint32_t {
  kOutOfVocabulary = 0,
  kMarriedCivSpouse = 1,
  kNeverMarried = 2,
  kDivorced = 3,
  kWidowed = 4,
  kSeparated = 5,
  kMarriedSpouseAbsent = 6,
  kMarriedAfSp

In your C++ code, call the model as:

```c++
#include "ydf_model.h"

void f() {
  using namespace ydf_model;
  const Label prediction = Predict(Instance{
      .age = 39,
      .workclass = FeatureWorkclass::kStateGov,
      .fnlwgt = 775,
      .education = FeatureEducation::kBachelors,
      .education_num = 13,
      .marital_status = FeatureMaritalStatus::kNeverMarried,
      .occupation = FeatureOccupation::kAdmClerical,
      .relationship = FeatureRelationship::kNotInFamily,
      .race = FeatureRace::kWhite,
      .sex = FeatureSex::kMale,
      .capital_gain = 2174,
      .capital_loss = 0,
      .hours_per_week = 40,
      .native_country = FeatureNativeCountry::kUnitedStates,
  });
  if (prediction==Label::kLt50K){
    // ...
  } else if (prediction==Label::kGt50K) {
    // ...
  }
}
```

By default, `Predict` returns a class for classification model. Instead, the
method can return a probability (or probabilities in case of multi-class) or
scores (e.g., logits) with the `classification_output` argument. For example:

-   `model.to_standalone_cc(classification_output='PROBABILITY')`: Returns a
    probabilitiy (`float`) or probabilities (`std::array<float>`).
-   `model.to_standalone_cc(classification_output='SCORE')`: Returns scores.

Categorical feature values are created from the corresponding enum class e.g. `FeatureRelationship::kNotInFamily`. While it is less efficient and can lead to larger binary, categorical values can also be created from a string e.g. `FeatureRelationshipFromString("Not-In-Family")`. The "*FromString" symbols are generated if the model is exported with `categorical_from_string=True`.

**Note:** If a string does not match an existing categorical values, the `kOutOfVocabulary` value is returned.

```python
{
  .age = 39,                                                         \
  .workclass = FeatureWorkclassFromString("State-gov"),              \
  .fnlwgt = 77516,                                                   \
  .education = FeatureEducationFromString("Bachelors"),              \
  ...
}
```

If you look at the content of the `Predict` function, you will see a for-loop
over the trees and a while-loop over the nodes. This is called the "routing"
algorithm, and it is a simple and generally efficient way to generate
predictions with a decision forest.

Other algorithms are available with the `algorithm` argument. For example, the
code generated with `algorithm="IF_ELSE"` will be a succession of imbricated
if-else statements.

*In the following cell, check the content of the `Predict` function at the
bottom*

In [5]:
print(model.to_standalone_cc(algorithm="IF_ELSE"))

#ifndef YDF_MODEL_YDF_MODEL_H_
#define YDF_MODEL_YDF_MODEL_H_

#include <stdint.h>
#include <cstring>
#include <array>
#include <algorithm>
#include <bitset>
#include <cassert>

namespace ydf_model {

enum class Label : uint32_t {
  kLt50K = 0,
  kGt50K = 1,
};

enum class FeatureWorkclass : uint32_t {
  kOutOfVocabulary = 0,
  kPrivate = 1,
  kSelfEmpNotInc = 2,
  kLocalGov = 3,
  kStateGov = 4,
  kSelfEmpInc = 5,
  kFederalGov = 6,
  kWithoutPay = 7,
};

enum class FeatureEducation : uint32_t {
  kOutOfVocabulary = 0,
  kHsGrad = 1,
  kSomeCollege = 2,
  kBachelors = 3,
  kMasters = 4,
  kAssocVoc = 5,
  k11th = 6,
  kAssocAcdm = 7,
  k10th = 8,
  k7th8th = 9,
  kProfSchool = 10,
  k9th = 11,
  k12th = 12,
  kDoctorate = 13,
  k5th6th = 14,
  k1st4th = 15,
  kPreschool = 16,
};

enum class FeatureMaritalStatus : uint32_t {
  kOutOfVocabulary = 0,
  kMarriedCivSpouse = 1,
  kNeverMarried = 2,
  kDivorced = 3,
  kWidowed = 4,
  kSeparated = 5,
  kMarriedSpouseAbsent = 6,
  kMarriedAfSp

The data type (dtype) of numerical features in your training dataset affects
your compiled model's size. A model trained with int16 or int8 numerical
features will be smaller than one trained with int32 or float values. In the
next example, we'll cast the training dataset to a smaller data type to get a
smaller model.

**Note:** To be effective, all the numerical features need to be casted.

In [6]:
# Before casting
train_ds.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

In [7]:
casted_train_ds = train_ds.copy()
for col in casted_train_ds.columns:
  if casted_train_ds[col].dtype in ["int32", "int64"]:
    casted_train_ds[col] = casted_train_ds[col].astype("int16")
# After casting
casted_train_ds.dtypes

age                int16
workclass         object
fnlwgt             int16
education         object
education_num      int16
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int16
capital_loss       int16
hours_per_week     int16
native_country    object
income            object
dtype: object

## Build Rule Integration

Instead of saving manually the result of `model.to_standalone_cc()` to a file,
you can use the `cc_ydf_standalone_model` Blaze/Bazel rule. The steps are:

1\.

Save the model with `model.save(...)` in a new directory in your source code
(e.g., in Google3).

```python
model.save("my_project/ydf_model_data")
```

2\.

Create a BUILD file with a `filegroup` in the model directory:

*File: my_project/ydf_model_data/BUILD*

```python
filegroup(name = "ydf_model_data", srcs = glob(["**"]))
```

3\.

In your library's `BUILD`, create a `cc_ydf_standalone_model` build rule.

*File: my_project/BUILD*

```python
load("//third_party/yggdrasil_decision_forests/serving/embed:embed.bzl", "cc_ydf_standalone_model ")

cc_ydf_standalone_model (
  name = "ydf_model", # Rule name, .h filename, and namespace in the .h file.
  data = "//my_project/ydf_model_data",
  # Compilation options here.
  classification_output = "PROBABILITY",
)
```

4\.

In your `cc_binary` or `cc_library`, add ":my_model" as a dependency.

*File: my_project/BUILD* `python cc_binary( name = "main", srcs = ["main.cc"],
deps = [":ydf_model"], )`

5\.

In your C++ code, include and call the model:

```c++
#include "my_project/ydf_model.h"

using namespace ydf_model;
const Label prediction = Prediction(Instance{.f1=5, f2=F2:kRed});
```