TreeEnsemble operator (#5874)

### Description Edited: This proposes a new operator `TreeEnsemble` that supersedes the pre-existing `TreeEnsembleRegressor` and `TreeEnsembleClassifier` operators. It will require a bump to `ai.onnx.ml` opset 5. Further details can be found in #5851. A summary of the updates: 1. TreeEnsemble supports double outputs. 2. Adds a `'SET_MEMBER'` node mode to encode set membership. 3. Type errors are raised if split values do not have the same type as the input and if the `nodes_*` attributes do not have the same length (and likewise for `leaf_*`). 4. Integer input types are dropped. - With the remaining attributes only being represented in floating point, this can be replicated by simply using a Cast standard operator before the tree regressor with no behaviour change. 5. `base_values` is dropped. - This attribute simply specified an offset added after target values are aggregated. This can be implemented by using the Add standard operator. 6. The general encoding has been changed to reduce redundancy. Before, all nodes contained fields like `truenodeids` which are only relevant for interior nodes and not leaves. Since leaves will account for at least roughly half the nodes in a binary decision tree, this is highly wasteful. Therefore, this representation has fields for `nodes_*` for interior nodes and `leaf_*` for leaf nodes. - The relationship between leaf and target is now strictly such that a leaf can have one target (and a target may continue to be contributed by many leaves). This nuance is discussed [here](#5851 (comment)). 7. Enumerations are held in integer attributes rather than strings (`aggregate_function`, `post_transform`, `nodes_modes`). 8. The use of treeids and nodeids is dropped in favour of using the index into the `nodes_*` and `leaf_*` attributes to define the tree structure directly with no indirection. A `tree_roots` field has been added to denote the roots of each decision tree in the ensemble. The `TreeEnsembleRegressor` can be implemented by directly using this operator. The `TreeEnsembleClassifier` can be implemented by using this operator and then computing the top class for each input by applying an ArgMax operation for each output before using `LabelEncoder/GatherND` to produce the requisite label. As per the reference implementation tests, this representation can continue to perform the same operations as before as used while adding some new capability in set memberships. ### Motivation and Context   Addresses #5851. Signed-off-by: Aditya Goel <agoel4512@gmail.com> Signed-off-by: Aditya Goel <48102515+adityagoel4512@users.noreply.github.com>
onnx · Feb 2, 2024 · 3cd21a9 · 3cd21a9
1 parent d229258
commit 3cd21a9
Show file tree

Hide file tree

Showing 25 changed files with 1,992 additions and 340 deletions.
diff --git a/docs/AddNewOp.md b/docs/AddNewOp.md
@@ -66,7 +66,7 @@ Once the criteria of proposing new operator/function has been satisfied, you wil
     1. Write a detailed description about the operator, and its expected behavior. Pretty much, the description should be clear enough to avoid confusion between implementors.
     2. Add an example in the description to illustrate the usage.
     3. Add reference to the source of the operator in the corresponding framework in the description (if possible).
-    4. Write the mathematic formula or a pseudocode in the description. The core algorithm needs to be very clear.
+    4. Write the mathematical formula or a pseudocode in the description. The core algorithm needs to be very clear.
 2. Write a reference implementation in Python, this reference implementation should cover all the expected behavior of the operator. Only in extremely rare case, we will waive this requirement.
 3. Operator version: check out our
 [versioning doc](/docs/Versioning.md#operator-versioning)

diff --git a/docs/Changelog-ml.md b/docs/Changelog-ml.md
@@ -1226,3 +1226,124 @@ This version of the operator has been available since version 4 of the 'ai.onnx.
 <dd>Output type is determined by the specified 'values_*' attribute.</dd>
 </dl>
 
+## Version 5 of the 'ai.onnx.ml' operator set
+### <a name="ai.onnx.ml.TreeEnsemble-5"></a>**ai.onnx.ml.TreeEnsemble-5**</a>
+
+  Tree Ensemble operator.  Returns the regressed values for each input in a batch.
+      Inputs have dimensions `[N, F]` where `N` is the input batch size and `F` is the number of input features.
+      Outputs have dimensions `[N, num_targets]` where `N` is the batch size and `num_targets` is the number of targets, which is a configurable attribute.
+
+      The encoding of this attribute is split along interior nodes and the leaves of the trees. Notably, attributes with the prefix `nodes_*` are associated with interior nodes, and attributes with the prefix `leaf_*` are associated with leaves.
+      The attributes `nodes_*` must all have the same length and encode a sequence of tuples, as defined by taking all the `nodes_*` fields at a given position.
+
+      All fields prefixed with `leaf_*` represent tree leaves, and similarly define tuples of leaves and must have identical length.
+
+      This operator can be used to implement both the previous `TreeEnsembleRegressor` and `TreeEnsembleClassifier` nodes.
+      The `TreeEnsembleRegressor` node maps directly to this node and requires changing how the nodes are represented.
+      The `TreeEnsembleClassifier` node can be implemented by adding a `ArgMax` node after this node to determine the top class.
+      To encode class labels, a `LabelEncoder` or `GatherND` operator may be used.
+
+#### Version
+
+This version of the operator has been available since version 5 of the 'ai.onnx.ml' operator set.
+
+#### Attributes
+
+<dl>
+<dt><tt>aggregate_function</tt> : int (default is 1)</dt>
+<dd>Defines how to aggregate leaf values within a target. <br>One of 'AVERAGE' (0) 'SUM' (1) 'MIN' (2) 'MAX (3) defaults to 'SUM' (1)</dd>
+<dt><tt>leaf_targetids</tt> : list of ints (required)</dt>
+<dd>The index of the target that this leaf contributes to (this must be in range `[0, n_targets)`).</dd>
+<dt><tt>leaf_weights</tt> : tensor (required)</dt>
+<dd>The weight for each leaf.</dd>
+<dt><tt>membership_values</tt> : tensor</dt>
+<dd>Members to test membership of for each set membership node. List all of the members to test again in the order that the 'BRANCH_MEMBER' mode appears in `node_modes`, delimited by `NaN`s. Will have the same number of sets of values as nodes with mode 'BRANCH_MEMBER'. This may be omitted if the node doesn't contain any 'BRANCH_MEMBER' nodes.</dd>
+<dt><tt>n_targets</tt> : int</dt>
+<dd>The total number of targets.</dd>
+<dt><tt>nodes_falseleafs</tt> : list of ints (required)</dt>
+<dd>1 if false branch is leaf for each node and 0 if an interior node. To represent a tree that is a leaf (only has one node), one can do so by having a single `nodes_*` entry with true and false branches referencing the same `leaf_*` entry</dd>
+<dt><tt>nodes_falsenodeids</tt> : list of ints (required)</dt>
+<dd>If `nodes_falseleafs` is false at an entry, this represents the position of the false branch node. This position can be used to index into a `nodes_*` entry. If `nodes_falseleafs` is false, it is an index into the leaf_* attributes.</dd>
+<dt><tt>nodes_featureids</tt> : list of ints (required)</dt>
+<dd>Feature id for each node.</dd>
+<dt><tt>nodes_hitrates</tt> : tensor</dt>
+<dd>Popularity of each node, used for performance and may be omitted.</dd>
+<dt><tt>nodes_missing_value_tracks_true</tt> : list of ints</dt>
+<dd>For each node, define whether to follow the true branch (if attribute value is 1) or false branch (if attribute value is 0) in the presence of a NaN input feature. This attribute may be left undefined and the default value is false (0) for all nodes.</dd>
+<dt><tt>nodes_modes</tt> : tensor (required)</dt>
+<dd>The comparison operation performed by the node. This is encoded as an enumeration of 0 ('BRANCH_LEQ'), 1 ('BRANCH_LT'), 2 ('BRANCH_GTE'), 3 ('BRANCH_GT'), 4 ('BRANCH_EQ'), 5 ('BRANCH_NEQ'), and 6 ('BRANCH_MEMBER'). Note this is a tensor of type uint8.</dd>
+<dt><tt>nodes_splits</tt> : tensor (required)</dt>
+<dd>Thresholds to do the splitting on for each node with mode that is not 'BRANCH_MEMBER'.</dd>
+<dt><tt>nodes_trueleafs</tt> : list of ints (required)</dt>
+<dd>1 if true branch is leaf for each node and 0 an interior node. To represent a tree that is a leaf (only has one node), one can do so by having a single `nodes_*` entry with true and false branches referencing the same `leaf_*` entry</dd>
+<dt><tt>nodes_truenodeids</tt> : list of ints (required)</dt>
+<dd>If `nodes_trueleafs` is false at an entry, this represents the position of the true branch node. This position can be used to index into a `nodes_*` entry. If `nodes_trueleafs` is false, it is an index into the leaf_* attributes.</dd>
+<dt><tt>post_transform</tt> : int (default is 0)</dt>
+<dd>Indicates the transform to apply to the score. <br>One of 'NONE' (0), 'SOFTMAX' (1), 'LOGISTIC' (2), 'SOFTMAX_ZERO' (3) or 'PROBIT' (4), defaults to 'NONE' (0)</dd>
+<dt><tt>tree_roots</tt> : list of ints (required)</dt>
+<dd>Index into `nodes_*` for the root of each tree. The tree structure is derived from the branching of each node.</dd>
+</dl>
+
+#### Inputs
+
+<dl>
+<dt><tt>X</tt> : T</dt>
+<dd>Input of shape [Batch Size, Number of Features]</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>Y</tt> : T</dt>
+<dd>Output of shape [Batch Size, Number of targets]</dd>
+</dl>
+
+#### Type Constraints
+
+<dl>
+<dt><tt>T</tt> : tensor(float), tensor(double), tensor(float16)</dt>
+<dd>The input type must be a tensor of a numeric type.</dd>
+</dl>
+
+### <a name="ai.onnx.ml.TreeEnsembleClassifier-5"></a>**ai.onnx.ml.TreeEnsembleClassifier-5** (deprecated)</a>
+
+  This operator is DEPRECATED. Please use TreeEnsemble with provides similar functionality.
+      In order to determine the top class, the ArgMax node can be applied to the output of TreeEnsemble.
+      To encode class labels, use a LabelEncoder operator.
+      Tree Ensemble classifier. Returns the top class for each of N inputs.<br>
+      The attributes named 'nodes_X' form a sequence of tuples, associated by
+      index into the sequences, which must all be of equal length. These tuples
+      define the nodes.<br>
+      Similarly, all fields prefixed with 'class_' are tuples of votes at the leaves.
+      A leaf may have multiple votes, where each vote is weighted by
+      the associated class_weights index.<br>
+      One and only one of classlabels_strings or classlabels_int64s
+      will be defined. The class_ids are indices into this list.
+      All fields ending with <i>_as_tensor</i> can be used instead of the
+      same parameter without the suffix if the element type is double and not float.
+
+#### Version
+
+This version of the operator has been deprecated since version 5 of the 'ai.onnx.ml' operator set.
+
+### <a name="ai.onnx.ml.TreeEnsembleRegressor-5"></a>**ai.onnx.ml.TreeEnsembleRegressor-5** (deprecated)</a>
+
+  This operator is DEPRECATED. Please use TreeEnsemble instead which provides the same
+      functionality.<br>
+      Tree Ensemble regressor.  Returns the regressed values for each input in N.<br>
+      All args with nodes_ are fields of a tuple of tree nodes, and
+      it is assumed they are the same length, and an index i will decode the
+      tuple across these inputs.  Each node id can appear only once
+      for each tree id.<br>
+      All fields prefixed with target_ are tuples of votes at the leaves.<br>
+      A leaf may have multiple votes, where each vote is weighted by
+      the associated target_weights index.<br>
+      All fields ending with <i>_as_tensor</i> can be used instead of the
+      same parameter without the suffix if the element type is double and not float.
+      All trees must have their node ids start at 0 and increment by 1.<br>
+      Mode enum is BRANCH_LEQ, BRANCH_LT, BRANCH_GTE, BRANCH_GT, BRANCH_EQ, BRANCH_NEQ, LEAF
+
+#### Version
+
+This version of the operator has been deprecated since version 5 of the 'ai.onnx.ml' operator set.
+