Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] New encoding for TreeEnsemble* operators #5851

Closed
adityagoel4512 opened this issue Jan 10, 2024 · 15 comments
Closed

[Feature request] New encoding for TreeEnsemble* operators #5851

adityagoel4512 opened this issue Jan 10, 2024 · 15 comments
Labels
enhancement Request for new feature or operator

Comments

@adityagoel4512
Copy link
Contributor

adityagoel4512 commented Jan 10, 2024

System information

ONNX 1.15

What is the problem that this feature solves?

Limitations and inefficiencies in decision tree encoding.

The proposal would enable:

  1. encoding SET_MEMBERSHIP as an operator type
  2. reducing some redundancy in the specification of the operators
  3. add support for double precision output type.

Alternatives considered

No response

Describe the feature

Currently, the TreeEnsemble* operators do not have the ability to represent set membership as a concept directly.

Upstream libraries such as LightGBM frequently produce set membership operators especially in the context of categorical variables. Currently, converter frameworks represent this by chaining equalities and connecting the true output edge to the same node (i.e. we don't have a tree any more). We should capture this explicitly in the standard rather than relying on runtimes and converter frameworks to have consensus outside the scope of the standard. One approach might be to add a SET_MEMBERSHIP node mode and encoding the possible members in a new attribute.

Secondly, the encoding of the remaining features contains some redundancies and could do with redesign.

  • node_hitrates and node_hitrates_as_tensor can be dropped. They are not used in any converter frameworks or runtimes and the existing description is ambiguous in exactly how this influences performance which makes the attribute impossible to use in practice.
  • nodes_missing_value_tracks_true is redundant. You can arrange true/false branches accordingly (nan compares false for all node_modes except with 'BRANCH_NEQ' where it compares true). It ends up being in the hot path when making predictions in onnxruntime (see https://github.com/microsoft/onnxruntime/blob/cf78d01546ca059a2ab487e01626e38029a3e8fd/onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h#L738-L761) unnecessarily, so removing this attribute could simplify inference code and improve performance.
  • If the former is not implemented, the number of node_modes options can be decreased to the minimal orthogonal set. 'BRANCH_LEQ', 'BRANCH_LT', 'BRANCH_EQ', and 'LEAF' are all that is necessary. This will make implementing the operator more efficient without losing any expressivity.

Finally, the operators currently only support float output. This frequently leads to discrepancies with libraries like LightGBM and XGBoost that are unacceptable in many deployment contexts. It looks like adding double support was proposed in a 2023 roadmap meeting but never introduced: https://github.com/onnx/steering-committee/blob/main/roadmap/2023-docs/01-tree-ensembles.pdf. This would eliminate this discrepancy.

Will this influence the current api (Y/N)?

Yes, it would require a new ai.onnx.ml opset.

Feature Area

Operator definitions.

Are you willing to contribute it (Y/N)

Yes

Notes

I'm happy to contribute this and see it through but would like to gather any opinions first.

@adityagoel4512 adityagoel4512 added the enhancement Request for new feature or operator label Jan 10, 2024
@cbourjau
Copy link
Contributor

cbourjau commented Jan 10, 2024

Thanks for tackling this issue! Here are a few thoughts:

  • nodes_missing_value_tracks_true: I could imagine that there are some optimizations that benefit from this. For example, if a tree is exclusively using a single split type then that information is no longer needed in each node. This allows for smaller Node structures which may improve caching in a much more significant way than the additional cost of nodes_missing_value_tracks_true. Arguably, this would be a task for an optimizer step rather than the standard, though.
  • Hitrates: While difficult to use, I think there is merit in keeping them as an optional attribute for the future. I don't think there is any value in having them as a tensor though. I always understood the hitrate as "higher is more likely", but never bothered to look too deeply into them.
  • The base_values, base_values_as_tensor, post_transform are unnecessary. They can be expressed using subsequent standard operators.
  • Input types should match the split value types defined as a tensor in nodes_values_as_tensor
  • ProtoBuf has no compression. It is rather wasteful to store the split types as strings. Instead, we could just as well store them in a u8 tensor. The "nodes_<...>" table contains entries for every leaf making it much larger than necessary. Instead of having a dedicated row for each LEAF we may consider adding two columns indicating if the child indices point to leaves or nodes. This would make the nodes_treeids attribute unnecessary. Edit: Appologies! I believe the nodes_treeids are redundant, but that does not follow from this argument.
  • I'm having some trouble understanding the "A leaf may have multiple votes, where each vote is weighted by the associated target_weights index" part in the specs. How can this feature be used in a way that requires those votes to be aggregated at runtime rather than build time? Depending on the answer, this attribute may be removed.
  • The target_nodeids appears to be redundant.

@adityagoel4512
Copy link
Contributor Author

I could imagine that there are some optimizations that benefit from this. For example, if a tree is exclusively using a single split type then that information is no longer needed in each node. This allows for smaller Node structures which may improve caching in a much more significant way than the additional cost of nodes_missing_value_tracks_true.

Do you mean the node type by "that information"? While implementations may vary, it's worth pointing out that the missing value true and split type are packed into a single byte in ORT (see https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h#L99-L103). I'm not sure I understand this point in light of this.

@xadupre
Copy link
Contributor

xadupre commented Jan 11, 2024

Instead of removing attribute nodes_missing_value_tracks_true, we could allow the fact it is missing and define its value in this case. The runtime implementation can make it disappear by changing the node ids. onnxruntime is already reordering the node so that the left node is the next id to be faster. So instead of changing the spec, we could write an onnx function optimize the tree to remove any unncessary attribute. We would still to update the specifications to support double input (maybe float 16 as well?), encode node_modes with integers, support a new mode "value IN set of values".

@cbourjau
Copy link
Contributor

Do you mean the node type by "that information"? While implementations may vary, it's worth pointing out that the missing value true and split type are packed into a single byte in ORT (see https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h#L99-L103). I'm not sure I understand this point in light of this.

I see, thanks for pointing that out! The split function is then later inlined in a switch on that mode flag. I suppose there would not be a noticeable difference by only having a single split type in that case.

@cbourjau
Copy link
Contributor

We would still to update the specifications to support double input (maybe float 16 as well?)

Note that double input is already supported today, but the output is always a 32-bit. I don't think there is much merit in offering float16 as input and split values. Upcasting the input to float32 is not free, but the outputs would remain unchanged. If a use-case for trees with 16-bit inputs arises it will be a simple thing to add it in a new forward-compatible way.

@xadupre
Copy link
Contributor

xadupre commented Jan 15, 2024

float16 as split values may be useful to reduce memory footprint but I agree, I did not see any feature request about that type.

@adityagoel4512
Copy link
Contributor Author

Instead of removing attribute nodes_missing_value_tracks_true, we could allow the fact it is missing and define its value in this case. The runtime implementation can make it disappear by changing the node ids.

At risk of coupling the operator definition with onnxruntime details again, I actually tried this out a while ago, however this reordering interferes with the fast path where all conditions are the same in a tree. I suppose we can leave the attribute in since it helps with clarity for converter library writers.

@cbourjau
Copy link
Contributor

Instead of removing attribute nodes_missing_value_tracks_true, we could allow the fact it is missing and define its value in this case. The runtime implementation can make it disappear by changing the node ids.

At risk of coupling the operator definition with onnxruntime details again, I actually tried this out a while ago, however this reordering interferes with the fast path where all conditions are the same in a tree. I suppose we can leave the attribute in since it helps with clarity for converter library writers.

Maybe we are talking past each other and are both in favor of keeping node_missing_value_tracks_true as an (optional?) attribute?

My take is that node_missing_value_tracks_true has a pretty trivial default: If not defined every node will use "missing tracks false" because every comparison with NaN is always false. It seems rather trivial for runtimes to implement this, unless I'm missing something?

@adityagoel4512
Copy link
Contributor Author

adityagoel4512 commented Jan 19, 2024

Edited response:

  • I'm having some trouble understanding the "A leaf may have multiple votes, where each vote is weighted by the associated target_weights index" part in the specs. How can this feature be used in a way that requires those votes to be aggregated at runtime rather than build time? Depending on the answer, this attribute may be removed.

My understanding (@xadupre please correct me if I'm wrong) is that a leaf can correspond to more than one target. You could share the prediction at a leaf across multiple targets. Since aggregation is done within the target and not within the leaf, you couldn't aggregate these targets at build time. That said, I don't know how/if this is even used in practice - if we denormalised this the representation of the tree could be much simpler (each leaf corresponds to one target that is [0, n_targets)).

@xadupre
Copy link
Contributor

xadupre commented Jan 22, 2024

You are right. The format allows multiple target per node but the implementation does not support that and I never ad to change that assumption. scikit-learn, lightgbm or xgboost are the same. Classification or regression have usually one dimension (real regression or binary classification). Then the implementation is extended to support more dimensions or more classes by multiplying the trees and keeping one dimension for the target.

@adityagoel4512
Copy link
Contributor Author

adityagoel4512 commented Jan 22, 2024

I think given we will be making many changes we should aim to be precise in what is supported (it also simplifies the encoding).

If we associate a leaf with a target we can continue to support the 1 to N relation between leaf node to target (a target may be composed from many leaves). Allowing many targets to be composed from one leaf complicates the encoding with no clear use case - if a user wants leaves to map to many targets in the same way this could be done with other standard operators as well (map each leaf to a different target and then take the row of leaf outputs and aggregate them as desired).

@adityagoel4512
Copy link
Contributor Author

adityagoel4512 commented Jan 25, 2024

@xadupre, is there any need to retain and upgrade the TreeEnsembleClassifier? I believe all of the functionality is implementable directly through the TreeEnsembleRegressor (combined with other standard operators).

  • The TreeEnsembleClassifier returns the top class for each input batch and the score for each class. Clearly the top score can be determined directly by applying an argmax for each batch entry in the scores output.
  • Classification is not really distinct at prediction time. This is reflected in the fact that the scores for both the regressor and the classifier are computed using shared logic in both the reference implementation and onnxruntime.
  • classlabels_strings and classlabels_int64s define the possible ways to encode the output labels. Again, this can be determined using the top class index and then using a label encoder or gather/indexing to extract the requisite label.

Is there something I'm missing here? If not, I think a reasonable argument could be made to write the classifier as a function op instead in this update with a view to deprecating the "regressor" and "classifier" long term and just having a "TreeEnsemble" operator.

@cbourjau
Copy link
Contributor

If possible, I think it would be better to only have a single TreeEnsemble node. This would also give us the maximal freedom for how to deprecate TreeEnsembleRegressor and TreeEnsembleClassifier.

@xadupre
Copy link
Contributor

xadupre commented Jan 27, 2024

That makes sense. I would also remove the attribute classlabels_strings. It can be replaced by a LabelEncoder anyway and the model would be smaller anyway. Maybe then, we mark TreeEnsembleRegressor and TreeEnsembleClassifier as deprecated and introduce a new one TreeEnsemble. That would make it easier for converters as they could the existing code without no break change and have more time to switch to the next one. Once it is done, we could also do the same with LinearRegressor and LinearClassifier, SVM...

github-merge-queue bot pushed a commit that referenced this issue Feb 2, 2024
### Description
Edited:

This proposes a new operator `TreeEnsemble` that supersedes the
pre-existing `TreeEnsembleRegressor` and `TreeEnsembleClassifier`
operators.

It will require a bump to `ai.onnx.ml` opset 5. Further details can be
found in #5851.

A summary of the updates:
1. TreeEnsemble supports double outputs.
2. Adds a `'SET_MEMBER'` node mode to encode set membership.
3. Type errors are raised if split values do not have the same type as
the input and if the `nodes_*` attributes do not have the same length
(and likewise for `leaf_*`).
4. Integer input types are dropped.
- With the remaining attributes only being represented in floating
point, this can be replicated by simply using a Cast standard operator
before the tree regressor with no behaviour change.
5. `base_values` is dropped.
- This attribute simply specified an offset added after target values
are aggregated. This can be implemented by using the Add standard
operator.
6. The general encoding has been changed to reduce redundancy. Before,
all nodes contained fields like `truenodeids` which are only relevant
for interior nodes and not leaves. Since leaves will account for at
least roughly half the nodes in a binary decision tree, this is highly
wasteful. Therefore, this representation has fields for `nodes_*` for
interior nodes and `leaf_*` for leaf nodes.
- The relationship between leaf and target is now strictly such that a
leaf can have one target (and a target may continue to be contributed by
many leaves). This nuance is discussed
[here](#5851 (comment)).
7. Enumerations are held in integer attributes rather than strings
(`aggregate_function`, `post_transform`, `nodes_modes`).
8. The use of treeids and nodeids is dropped in favour of using the
index into the `nodes_*` and `leaf_*` attributes to define the tree
structure directly with no indirection. A `tree_roots` field has been
added to denote the roots of each decision tree in the ensemble.

The `TreeEnsembleRegressor` can be implemented by directly using this
operator.
The `TreeEnsembleClassifier` can be implemented by using this operator
and then computing the top class for each input by applying an ArgMax
operation for each output before using `LabelEncoder/GatherND` to
produce the requisite label.

As per the reference implementation tests, this representation can
continue to perform the same operations as before as used while adding
some new capability in set memberships.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
<!-- - If it fixes an open issue, please link to the issue here. -->
Addresses #5851.

Signed-off-by: Aditya Goel <agoel4512@gmail.com>
Signed-off-by: Aditya Goel <48102515+adityagoel4512@users.noreply.github.com>
@adityagoel4512
Copy link
Contributor Author

Closed by #5874

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature or operator
Projects
None yet
Development

No branches or pull requests

3 participants