# ðŸ§­ Getting started
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/getting_started.ipynb)


Decision Forests (DFs) are machine learning algorithms for classification, regression uplifting, and ranking. As the name suggests, DFs are built from decision trees. Today, the two most popular DF training algorithms are [Random Forests](https://en.wikipedia.org/wiki/Random_forest) and [Gradient Boosted Decision Trees](https://en.wikipedia.org/wiki/Gradient_boosting).

**Yggdrasil Decision Forests** (YDF) is a library to train, evaluate, understand, and serve decision forest models. YDF is available in multiple languages: Python, C++, CLI, and TensorFlow, under the name TensorFlow Decision Forests. This notebook demonstrates the Python API, which is the recommended way to use YDF.

For the API Reference and other tutorials, check the [YDF website](https://ydf.readthedocs.io/).

## Install YDF

In [None]:
pip install ydf -U

## Import libraries

In [13]:
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

## Download and load dataset

We use the binary classification Adult. The objective is to predict the value of the `income` column, which can be either `<50k` or `>=50k`, using the other numerical and categorical columns. This dataset contains missing values.

In [14]:
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"

# Download and load the dataset as Pandas DataFrames
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,44,Private,228057,7th-8th,4,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,40,Dominican-Republic,<=50K
1,20,Private,299047,Some-college,10,Never-married,Other-service,Not-in-family,White,Female,0,0,20,United-States,<=50K
2,40,Private,342164,HS-grad,9,Separated,Adm-clerical,Unmarried,White,Female,0,0,37,United-States,<=50K
3,30,Private,361742,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,<=50K
4,67,Self-emp-inc,171564,HS-grad,9,Married-civ-spouse,Prof-specialty,Wife,White,Female,20051,0,30,England,>50K


## Train a model

Let's train a gradient boosted trees model using default values for all the hyper-parameters.

In [15]:
model = ydf.GradientBoostedTreesLearner(label="income").train(train_ds)

Train model on 22792 examples
Model trained in 0:00:03.698584


**Remarks**

- YDF makes a difference between learning algorithms (a.k.a. **learners** such as `GradientBoostedTreesLearner`) and **models**. Later, in more advanced examples, you will see why we do it :).
- The only required parameter for a learner is `label`. Other parameters have good default values.
- We did not specify input features, so all the columns are used as input features. The type of features is automatically detected (e.g. numerical, categorical, boolean, text, with possibly missing values) and ingested.
- By default, learners train classification models. Other tasks (e.g., regression, ranking, uplifting) can be configured with the task parameter e.g. `task=ydf.Task.REGRESSION`.
- Training logs can be shown during training with the `verbose=2` argument, or after training with `model.describe()`. This is useful for debugging and understanding the training process.
- A validation dataset was not specified. In this case, learners such as `GradientBoostedTreesLearner` will extract data from the training dataset that can be used for validation. Other learners such as `RandomForestLearner` do not require a validation dataset and will use all the data for training.


## Looking at model

With `model.describe()`, we can look at:

- **Model**: The model task, input features and size.
- **Dataspec**: The type of statistics about all the input features.
- **Training**: The training and validation loss and metrics.
- **Tuning** (only if hyper-parameter tuning is enable): The tuning logs.
- **Variable importance**: What features matter most to the model.
- **Structure**: The trees in the model.

In [16]:
model.describe()

## Make predictions

`model.predict(ds)` applies a model and returns the predictions as a Numpy array.

In [17]:
model.predict(test_ds)

array([0.01860435, 0.36130956, 0.83858865, ..., 0.03087652, 0.08280362,
       0.00970956], dtype=float32)

Methods that consume datasets, such as `train` and `predict`, support multiple dataset formats such as Pandas DataFrames, dictionaries of lists or Numpy arrays, TensorFlow Datasets, and event file paths!

In [18]:
# Prediction with a dictionary
model.predict({
    'age': [39],
    'workclass': ['State-gov'],
    'fnlwgt': [77516],
    'education': ['Bachelors'],
    'education_num': [13],
    'marital_status': ['Never-married'],
    'occupation': ['Adm-clerical'],
    'relationship': ['Not-in-family'],
    'race': ['White'],
    'sex': ['Male'],
    'capital_gain': [2174],
    'capital_loss': [0],
    'hours_per_week': [40],
    'native_country': ['United-States'],
    'income': ['<=50K'],
})

array([0.01860435], dtype=float32)

## Evaluate model

While the validation dataset above provides an indication of the model's quality, we also want to evaluate the model on the test dataset.

In [19]:
evaluation = model.evaluate(test_ds)

# Query individual evaluation metrics
print(f"test accuracy: {evaluation.accuracy}")

# Show the full evaluation report
print("Full evaluation report:")
evaluation

test accuracy: 0.8738867847271983
Full evaluation report:


Label \ Pred,<=50K,>50K
<=50K,6962,782
>50K,450,1575


## Analyze model

With `model.analyze(ds)` we can understand how the model behaves. For example, [Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) (PDP) tell us how the model reacts to change of feature values.

In [20]:
model.analyze(test_ds, sampling=0.1)

## Benchmark model speed

In applications where model speed is critical, we can use `model.benchmark(ds)` to evaluate the speed of the model.


In [21]:
model.benchmark(test_ds)

Inference time per example and per cpu core: 0.891 us (microseconds)
Estimated over 345 runs over 3.004 seconds.
* Measured with the C++ serving API. Check model.to_cpp() for details.

The benchmark measures the speed of the model when using the C++ API. The Python API will be slower due to the overhead of the Python interpreter. If you are not familiar with the C++ API, you can use the `model.to_cpp()` method to generate C++ code that you can run to evaluate the model's speed.

In [22]:
print(model.to_cpp())

// Automatically generated code running an Yggdrasil Decision Forests model in
// C++. This code was generated with "model.to_cpp()".
//
// Date of generation: 2023-12-19 15:29:09.343331
// YDF Version: 0.0.8
//
// How to use this code:
//
// 1. Copy this code in a new .h file.
// 2. If you use Bazel/Blaze, use the following dependencies:
//      //third_party/absl/status:statusor
//      //third_party/absl/strings
//      //external/ydf_cc/yggdrasil_decision_forests/api:serving
// 3. In your existing code, include the .h file and do:
//   // Load the model (to do only once).
//   namespace ydf = yggdrasil_decision_forests;
//   const auto model = ydf::exported_model_123::Load(<path to model>);
//   // Run the model
//   predictions = model.Predict();
// 4. By default, the "Predict" function takes no inputs and creates fake
//   examples. In practice, you want to add your input data as arguments to
//   "Predict" and call "examples->Set..." functions accordingly.
// 4. (Bonus)
//   All

## Save model

Finally, we use the same model for later use.


In [23]:
model.save("/tmp/my_model")

So we can load the model with:

In [24]:
loaded_model = ydf.load_model("/tmp/my_model")

print(f"This is a {loaded_model.name()} model.")

This is a GRADIENT_BOOSTED_TREES model.


## Conclusion

This is it. You know the basic capabilities of YDF ðŸ˜Š.

To learn more about YDF, check the other tutorials on [ydf.readthedocs.io](https://ydf.readthedocs.io/). For instance, learn how to:

- Learn to train ranking, regression or uplifting models with the `task` argument.
- Measure distance and find the nearest neighbor between examples with `model.distance`.
- Enforce monotonic constraints on your features with the `features` argument.
- Run the models in a webpage in JavaScript with `model.to_javascript()`.
- Convert the model into a TensorFlow SavedModel and run it in TensorFlow Serving with `model.to_tensorflow_saved_model()`.
- Train a model on billions of training examples using distributed training computation.
