# 🧭 Getting started
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/getting_started.ipynb)


Decision Forests (DFs) are a family of machine learning algorithms used for classification, regression, ranking, uplifting and anomaly detection. As the name implies, DFs are constructed from a collection of decision trees. The two most popular DF training algorithms today are [Random Forests](https://en.wikipedia.org/wiki/Random_forest) and [Gradient Boosted Decision Trees](https://en.wikipedia.org/wiki/Gradient_boosting).

**Yggdrasil Decision Forests** (YDF) is a comprehensive library for training, evaluating, interpreting, and serving these models. YDF is available in several languages, including Python, C++, and CLI. It's also integrated into TensorFlow as TensorFlow Decision Forests. This notebook will walk you through the Python API, which is the recommended way to get started with YDF.

For the complete API Reference and more tutorials, check out the [YDF website](https://ydf.readthedocs.io/).

## Install YDF

In [None]:
pip install ydf -U

## Import libraries

In [1]:
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # Used for loading and manipulating small datasets

## Download and load dataset

We'll use the classic "Adult" dataset for this tutorial. The task is binary classification: predict whether an individual's income is >50k or <=50k based on other numerical and categorical features. This dataset also contains missing values, which YDF handles automatically.

In [2]:
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"

# Download and load the dataset into Pandas DataFrames
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Display the first 5 rows of the training data
train_ds.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,44,Private,228057,7th-8th,4,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,40,Dominican-Republic,<=50K
1,20,Private,299047,Some-college,10,Never-married,Other-service,Not-in-family,White,Female,0,0,20,United-States,<=50K
2,40,Private,342164,HS-grad,9,Separated,Adm-clerical,Unmarried,White,Female,0,0,37,United-States,<=50K
3,30,Private,361742,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,<=50K
4,67,Self-emp-inc,171564,HS-grad,9,Married-civ-spouse,Prof-specialty,Wife,White,Female,20051,0,30,England,>50K


## Train a model

Let's train a Gradient Boosted Trees model using the default hyper-parameters.

In [3]:
model = ydf.GradientBoostedTreesLearner(label="income").train(train_ds)

Train model on 22792 examples
Model trained in 0:00:01.791216


**Key points**

-   YDF distinguishes between learning algorithms (called **learners**, like
    `GradientBoostedTreesLearner`) and trained **models**. You'll see the
    benefits of this distinction in more advanced examples.
-   The only required parameter for a learner is the `label`. All other
    hyper-parameters have sensible defaults.
-   Since we didn't specify the input features, all columns except for the
    label are automatically used as inputs. YDF detects feature types (e.g.,
    numerical, categorical) and handles them appropriately, including those
    with missing values.
-   By default, learners train a classification model. You can specify other
    tasks like regression or ranking using the task parameter (e.g.,
    `task=ydf.Task.REGRESSION`).
-   Training logs can be viewed live by setting `verbose=2` in the learner.
    After training, you can access them with `model.describe()`.
-   A validation dataset was not provided. In this scenario, learners like
    `GradientBoostedTreesLearner` automatically set aside a portion of the
    training data for validation. Other learners, like `RandomForestLearner`,
    don't require a validation set and use all the data for training.

## Inspecting the Model

The `model.describe()` method provides an overview of your model, including:

-   **Model**: The model's task, input features, and size.
-   **Dataspec**: Statistics about each input feature.
-   **Training**: The training and validation loss and performance metrics.
-   **Variable Importance**: A ranking of the features that are most influential for the model.
-   **Structure**: A plot of the first tree of the model.
-   **Tuning**: Logs from hyper-parameter tuning (if enabled).

In [4]:
model.describe()

## Make predictions

To get predictions, simply use the `model.predict()` method. It returns the predictions as a NumPy array.

In [5]:
model.predict(test_ds)

array([0.01860435, 0.36130956, 0.83858865, ..., 0.03087652, 0.08280362,
       0.00970956], shape=(9769,), dtype=float32)

Methods like `train()` and `predict()` are flexible and accept data in various formats, including Pandas DataFrames, dictionaries of lists or NumPy arrays, TensorFlow Datasets, or even file paths.

In [6]:
# Prediction with a dictionary
model.predict({
    'age': [39],
    'workclass': ['State-gov'],
    'fnlwgt': [77516],
    'education': ['Bachelors'],
    'education_num': [13],
    'marital_status': ['Never-married'],
    'occupation': ['Adm-clerical'],
    'relationship': ['Not-in-family'],
    'race': ['White'],
    'sex': ['Male'],
    'capital_gain': [2174],
    'capital_loss': [0],
    'hours_per_week': [40],
    'native_country': ['United-States'],
    'income': ['<=50K'],
})

array([0.01860435], dtype=float32)

## Evaluate model

While the internal validation set gives us a good idea of the model's quality, we should also evaluate its performance on the unseen test dataset.

In [7]:
evaluation = model.evaluate(test_ds)

# Query individual evaluation metrics
print(f"Test accuracy: {evaluation.accuracy}")

# Show the full evaluation report
print("Full evaluation report:")
evaluation

Test accuracy: 0.8737844201044119
Full evaluation report:


Label \ Pred,<=50K,>50K
<=50K,6961,451
>50K,782,1575


## Analyze model

With `model.analyze(ds)`, you can gain deeper insights into your model's behavior. For instance, [Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) (PDP) show how the model's predictions change as a single feature's value changes.

In [8]:
model.analyze(test_ds, sampling=0.1)

## Benchmark model speed

For applications where inference speed is critical, you can use `model.benchmark(ds)` to measure its performance.

In [9]:
model.benchmark(test_ds)

Single-thread inference time per example: 0.718 us (microseconds)
Details: 4190901 predictions in 0.000 seconds

Multi-thread inference time per example: 0.059 us (microseconds)
Details: 36663057 predictions in 0.000 seconds using 24 threads

* Measured with the C++ serving API. See model.to_cpp().

This benchmark measures the inference speed using the underlying C++ API. The Python API introduces some overhead. If you need to benchmark the raw C++ speed, you can use `model.to_cpp()` to generate C++ code for a standalone benchmark.

In [10]:
print(model.to_cpp())

// Automatically generated code running an Yggdrasil Decision Forests model in
// C++. This code was generated with "model.to_cpp()".
//
// Date of generation: 2025-06-25 11:39:07.368064
// YDF Version: 0.12.0
//
// How to use this code:
//
// 1. Copy this code in a new .h file.
// 2. If you use Bazel/Blaze, use the following dependencies:
//      //third_party/absl/status:statusor
//      //third_party/absl/strings
//      //third_party/yggdrasil_decision_forests/api:serving
// 3. In your existing code, include the .h file. Make predictions as follows:
//   // Load the model (to do only once).
//   namespace ydf = yggdrasil_decision_forests;
//   const auto model = ydf::exported_model_123::Load(<path to model>);
//   // Run the model
//   predictions = model.Predict();
// 4. By default, the "Predict" function takes no inputs and creates fake
//   examples. In practice, you want to add your input data as arguments to
//   "Predict" and call "examples->Set..." functions accordingly.
// 

## Save model

Finally, let's save our trained model so we can use it later without retraining.

In [11]:
model.save("/tmp/my_model")

Loading it back is just as easy:

In [12]:
loaded_model = ydf.load_model("/tmp/my_model")

loaded_model.describe()

## Conclusion

That's it! You now know the basics of using YDF. 😊

To learn more, check out the other tutorials on [ydf.readthedocs.io](https://ydf.readthedocs.io/). For example, you can discover how to:

-   Train models for ranking, regression, or uplifting using the task argument.
-   Find the nearest neighbors between examples with `model.distance()`.
-   Enforce monotonic constraints on your features with the `features` argument.
-   Convert the model to a TensorFlow SavedModel for serving with `model.to_tensorflow_saved_model()`.
-   Use feature selection to improve training time and model quality.
-   Train on billions of examples with distributed training.