Skip to content

Commit

Permalink
TreeEnsemble operator (#5874)
Browse files Browse the repository at this point in the history
### Description
Edited:

This proposes a new operator `TreeEnsemble` that supersedes the
pre-existing `TreeEnsembleRegressor` and `TreeEnsembleClassifier`
operators.

It will require a bump to `ai.onnx.ml` opset 5. Further details can be
found in #5851.

A summary of the updates:
1. TreeEnsemble supports double outputs.
2. Adds a `'SET_MEMBER'` node mode to encode set membership.
3. Type errors are raised if split values do not have the same type as
the input and if the `nodes_*` attributes do not have the same length
(and likewise for `leaf_*`).
4. Integer input types are dropped.
- With the remaining attributes only being represented in floating
point, this can be replicated by simply using a Cast standard operator
before the tree regressor with no behaviour change.
5. `base_values` is dropped.
- This attribute simply specified an offset added after target values
are aggregated. This can be implemented by using the Add standard
operator.
6. The general encoding has been changed to reduce redundancy. Before,
all nodes contained fields like `truenodeids` which are only relevant
for interior nodes and not leaves. Since leaves will account for at
least roughly half the nodes in a binary decision tree, this is highly
wasteful. Therefore, this representation has fields for `nodes_*` for
interior nodes and `leaf_*` for leaf nodes.
- The relationship between leaf and target is now strictly such that a
leaf can have one target (and a target may continue to be contributed by
many leaves). This nuance is discussed
[here](#5851 (comment)).
7. Enumerations are held in integer attributes rather than strings
(`aggregate_function`, `post_transform`, `nodes_modes`).
8. The use of treeids and nodeids is dropped in favour of using the
index into the `nodes_*` and `leaf_*` attributes to define the tree
structure directly with no indirection. A `tree_roots` field has been
added to denote the roots of each decision tree in the ensemble.

The `TreeEnsembleRegressor` can be implemented by directly using this
operator.
The `TreeEnsembleClassifier` can be implemented by using this operator
and then computing the top class for each input by applying an ArgMax
operation for each output before using `LabelEncoder/GatherND` to
produce the requisite label.

As per the reference implementation tests, this representation can
continue to perform the same operations as before as used while adding
some new capability in set memberships.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
<!-- - If it fixes an open issue, please link to the issue here. -->
Addresses #5851.

Signed-off-by: Aditya Goel <agoel4512@gmail.com>
Signed-off-by: Aditya Goel <48102515+adityagoel4512@users.noreply.github.com>
  • Loading branch information
adityagoel4512 committed Feb 2, 2024
1 parent d229258 commit 3cd21a9
Show file tree
Hide file tree
Showing 25 changed files with 1,992 additions and 340 deletions.
2 changes: 1 addition & 1 deletion docs/AddNewOp.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Once the criteria of proposing new operator/function has been satisfied, you wil
1. Write a detailed description about the operator, and its expected behavior. Pretty much, the description should be clear enough to avoid confusion between implementors.
2. Add an example in the description to illustrate the usage.
3. Add reference to the source of the operator in the corresponding framework in the description (if possible).
4. Write the mathematic formula or a pseudocode in the description. The core algorithm needs to be very clear.
4. Write the mathematical formula or a pseudocode in the description. The core algorithm needs to be very clear.
2. Write a reference implementation in Python, this reference implementation should cover all the expected behavior of the operator. Only in extremely rare case, we will waive this requirement.
3. Operator version: check out our
[versioning doc](/docs/Versioning.md#operator-versioning)
Expand Down
121 changes: 121 additions & 0 deletions docs/Changelog-ml.md
Original file line number Diff line number Diff line change
Expand Up @@ -1226,3 +1226,124 @@ This version of the operator has been available since version 4 of the 'ai.onnx.
<dd>Output type is determined by the specified 'values_*' attribute.</dd>
</dl>

## Version 5 of the 'ai.onnx.ml' operator set
### <a name="ai.onnx.ml.TreeEnsemble-5"></a>**ai.onnx.ml.TreeEnsemble-5**</a>

Tree Ensemble operator. Returns the regressed values for each input in a batch.
Inputs have dimensions `[N, F]` where `N` is the input batch size and `F` is the number of input features.
Outputs have dimensions `[N, num_targets]` where `N` is the batch size and `num_targets` is the number of targets, which is a configurable attribute.

The encoding of this attribute is split along interior nodes and the leaves of the trees. Notably, attributes with the prefix `nodes_*` are associated with interior nodes, and attributes with the prefix `leaf_*` are associated with leaves.
The attributes `nodes_*` must all have the same length and encode a sequence of tuples, as defined by taking all the `nodes_*` fields at a given position.

All fields prefixed with `leaf_*` represent tree leaves, and similarly define tuples of leaves and must have identical length.

This operator can be used to implement both the previous `TreeEnsembleRegressor` and `TreeEnsembleClassifier` nodes.
The `TreeEnsembleRegressor` node maps directly to this node and requires changing how the nodes are represented.
The `TreeEnsembleClassifier` node can be implemented by adding a `ArgMax` node after this node to determine the top class.
To encode class labels, a `LabelEncoder` or `GatherND` operator may be used.

#### Version

This version of the operator has been available since version 5 of the 'ai.onnx.ml' operator set.

#### Attributes

<dl>
<dt><tt>aggregate_function</tt> : int (default is 1)</dt>
<dd>Defines how to aggregate leaf values within a target. <br>One of 'AVERAGE' (0) 'SUM' (1) 'MIN' (2) 'MAX (3) defaults to 'SUM' (1)</dd>
<dt><tt>leaf_targetids</tt> : list of ints (required)</dt>
<dd>The index of the target that this leaf contributes to (this must be in range `[0, n_targets)`).</dd>
<dt><tt>leaf_weights</tt> : tensor (required)</dt>
<dd>The weight for each leaf.</dd>
<dt><tt>membership_values</tt> : tensor</dt>
<dd>Members to test membership of for each set membership node. List all of the members to test again in the order that the 'BRANCH_MEMBER' mode appears in `node_modes`, delimited by `NaN`s. Will have the same number of sets of values as nodes with mode 'BRANCH_MEMBER'. This may be omitted if the node doesn't contain any 'BRANCH_MEMBER' nodes.</dd>
<dt><tt>n_targets</tt> : int</dt>
<dd>The total number of targets.</dd>
<dt><tt>nodes_falseleafs</tt> : list of ints (required)</dt>
<dd>1 if false branch is leaf for each node and 0 if an interior node. To represent a tree that is a leaf (only has one node), one can do so by having a single `nodes_*` entry with true and false branches referencing the same `leaf_*` entry</dd>
<dt><tt>nodes_falsenodeids</tt> : list of ints (required)</dt>
<dd>If `nodes_falseleafs` is false at an entry, this represents the position of the false branch node. This position can be used to index into a `nodes_*` entry. If `nodes_falseleafs` is false, it is an index into the leaf_* attributes.</dd>
<dt><tt>nodes_featureids</tt> : list of ints (required)</dt>
<dd>Feature id for each node.</dd>
<dt><tt>nodes_hitrates</tt> : tensor</dt>
<dd>Popularity of each node, used for performance and may be omitted.</dd>
<dt><tt>nodes_missing_value_tracks_true</tt> : list of ints</dt>
<dd>For each node, define whether to follow the true branch (if attribute value is 1) or false branch (if attribute value is 0) in the presence of a NaN input feature. This attribute may be left undefined and the default value is false (0) for all nodes.</dd>
<dt><tt>nodes_modes</tt> : tensor (required)</dt>
<dd>The comparison operation performed by the node. This is encoded as an enumeration of 0 ('BRANCH_LEQ'), 1 ('BRANCH_LT'), 2 ('BRANCH_GTE'), 3 ('BRANCH_GT'), 4 ('BRANCH_EQ'), 5 ('BRANCH_NEQ'), and 6 ('BRANCH_MEMBER'). Note this is a tensor of type uint8.</dd>
<dt><tt>nodes_splits</tt> : tensor (required)</dt>
<dd>Thresholds to do the splitting on for each node with mode that is not 'BRANCH_MEMBER'.</dd>
<dt><tt>nodes_trueleafs</tt> : list of ints (required)</dt>
<dd>1 if true branch is leaf for each node and 0 an interior node. To represent a tree that is a leaf (only has one node), one can do so by having a single `nodes_*` entry with true and false branches referencing the same `leaf_*` entry</dd>
<dt><tt>nodes_truenodeids</tt> : list of ints (required)</dt>
<dd>If `nodes_trueleafs` is false at an entry, this represents the position of the true branch node. This position can be used to index into a `nodes_*` entry. If `nodes_trueleafs` is false, it is an index into the leaf_* attributes.</dd>
<dt><tt>post_transform</tt> : int (default is 0)</dt>
<dd>Indicates the transform to apply to the score. <br>One of 'NONE' (0), 'SOFTMAX' (1), 'LOGISTIC' (2), 'SOFTMAX_ZERO' (3) or 'PROBIT' (4), defaults to 'NONE' (0)</dd>
<dt><tt>tree_roots</tt> : list of ints (required)</dt>
<dd>Index into `nodes_*` for the root of each tree. The tree structure is derived from the branching of each node.</dd>
</dl>

#### Inputs

<dl>
<dt><tt>X</tt> : T</dt>
<dd>Input of shape [Batch Size, Number of Features]</dd>
</dl>

#### Outputs

<dl>
<dt><tt>Y</tt> : T</dt>
<dd>Output of shape [Batch Size, Number of targets]</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T</tt> : tensor(float), tensor(double), tensor(float16)</dt>
<dd>The input type must be a tensor of a numeric type.</dd>
</dl>

### <a name="ai.onnx.ml.TreeEnsembleClassifier-5"></a>**ai.onnx.ml.TreeEnsembleClassifier-5** (deprecated)</a>

This operator is DEPRECATED. Please use TreeEnsemble with provides similar functionality.
In order to determine the top class, the ArgMax node can be applied to the output of TreeEnsemble.
To encode class labels, use a LabelEncoder operator.
Tree Ensemble classifier. Returns the top class for each of N inputs.<br>
The attributes named 'nodes_X' form a sequence of tuples, associated by
index into the sequences, which must all be of equal length. These tuples
define the nodes.<br>
Similarly, all fields prefixed with 'class_' are tuples of votes at the leaves.
A leaf may have multiple votes, where each vote is weighted by
the associated class_weights index.<br>
One and only one of classlabels_strings or classlabels_int64s
will be defined. The class_ids are indices into this list.
All fields ending with <i>_as_tensor</i> can be used instead of the
same parameter without the suffix if the element type is double and not float.

#### Version

This version of the operator has been deprecated since version 5 of the 'ai.onnx.ml' operator set.

### <a name="ai.onnx.ml.TreeEnsembleRegressor-5"></a>**ai.onnx.ml.TreeEnsembleRegressor-5** (deprecated)</a>

This operator is DEPRECATED. Please use TreeEnsemble instead which provides the same
functionality.<br>
Tree Ensemble regressor. Returns the regressed values for each input in N.<br>
All args with nodes_ are fields of a tuple of tree nodes, and
it is assumed they are the same length, and an index i will decode the
tuple across these inputs. Each node id can appear only once
for each tree id.<br>
All fields prefixed with target_ are tuples of votes at the leaves.<br>
A leaf may have multiple votes, where each vote is weighted by
the associated target_weights index.<br>
All fields ending with <i>_as_tensor</i> can be used instead of the
same parameter without the suffix if the element type is double and not float.
All trees must have their node ids start at 0 and increment by 1.<br>
Mode enum is BRANCH_LEQ, BRANCH_LT, BRANCH_GTE, BRANCH_GT, BRANCH_EQ, BRANCH_NEQ, LEAF

#### Version

This version of the operator has been deprecated since version 5 of the 'ai.onnx.ml' operator set.

0 comments on commit 3cd21a9

Please sign in to comment.