-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] New encoding for TreeEnsemble* operators #5851
Comments
Thanks for tackling this issue! Here are a few thoughts:
|
Do you mean the node type by "that information"? While implementations may vary, it's worth pointing out that the missing value true and split type are packed into a single byte in ORT (see https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h#L99-L103). I'm not sure I understand this point in light of this. |
Instead of removing attribute |
I see, thanks for pointing that out! The split function is then later inlined in a |
Note that double input is already supported today, but the output is always a 32-bit. I don't think there is much merit in offering float16 as input and split values. Upcasting the input to float32 is not free, but the outputs would remain unchanged. If a use-case for trees with 16-bit inputs arises it will be a simple thing to add it in a new forward-compatible way. |
float16 as split values may be useful to reduce memory footprint but I agree, I did not see any feature request about that type. |
At risk of coupling the operator definition with onnxruntime details again, I actually tried this out a while ago, however this reordering interferes with the fast path where all conditions are the same in a tree. I suppose we can leave the attribute in since it helps with clarity for converter library writers. |
Maybe we are talking past each other and are both in favor of keeping My take is that |
Edited response:
My understanding (@xadupre please correct me if I'm wrong) is that a leaf can correspond to more than one target. You could share the prediction at a leaf across multiple targets. Since aggregation is done within the target and not within the leaf, you couldn't aggregate these targets at build time. That said, I don't know how/if this is even used in practice - if we denormalised this the representation of the tree could be much simpler (each leaf corresponds to one target that is |
You are right. The format allows multiple target per node but the implementation does not support that and I never ad to change that assumption. scikit-learn, lightgbm or xgboost are the same. Classification or regression have usually one dimension (real regression or binary classification). Then the implementation is extended to support more dimensions or more classes by multiplying the trees and keeping one dimension for the target. |
I think given we will be making many changes we should aim to be precise in what is supported (it also simplifies the encoding). If we associate a leaf with a target we can continue to support the 1 to N relation between leaf node to target (a target may be composed from many leaves). Allowing many targets to be composed from one leaf complicates the encoding with no clear use case - if a user wants leaves to map to many targets in the same way this could be done with other standard operators as well (map each leaf to a different target and then take the row of leaf outputs and aggregate them as desired). |
@xadupre, is there any need to retain and upgrade the TreeEnsembleClassifier? I believe all of the functionality is implementable directly through the TreeEnsembleRegressor (combined with other standard operators).
Is there something I'm missing here? If not, I think a reasonable argument could be made to write the classifier as a function op instead in this update with a view to deprecating the "regressor" and "classifier" long term and just having a |
If possible, I think it would be better to only have a single |
That makes sense. I would also remove the attribute classlabels_strings. It can be replaced by a LabelEncoder anyway and the model would be smaller anyway. Maybe then, we mark TreeEnsembleRegressor and TreeEnsembleClassifier as deprecated and introduce a new one TreeEnsemble. That would make it easier for converters as they could the existing code without no break change and have more time to switch to the next one. Once it is done, we could also do the same with LinearRegressor and LinearClassifier, SVM... |
### Description Edited: This proposes a new operator `TreeEnsemble` that supersedes the pre-existing `TreeEnsembleRegressor` and `TreeEnsembleClassifier` operators. It will require a bump to `ai.onnx.ml` opset 5. Further details can be found in #5851. A summary of the updates: 1. TreeEnsemble supports double outputs. 2. Adds a `'SET_MEMBER'` node mode to encode set membership. 3. Type errors are raised if split values do not have the same type as the input and if the `nodes_*` attributes do not have the same length (and likewise for `leaf_*`). 4. Integer input types are dropped. - With the remaining attributes only being represented in floating point, this can be replicated by simply using a Cast standard operator before the tree regressor with no behaviour change. 5. `base_values` is dropped. - This attribute simply specified an offset added after target values are aggregated. This can be implemented by using the Add standard operator. 6. The general encoding has been changed to reduce redundancy. Before, all nodes contained fields like `truenodeids` which are only relevant for interior nodes and not leaves. Since leaves will account for at least roughly half the nodes in a binary decision tree, this is highly wasteful. Therefore, this representation has fields for `nodes_*` for interior nodes and `leaf_*` for leaf nodes. - The relationship between leaf and target is now strictly such that a leaf can have one target (and a target may continue to be contributed by many leaves). This nuance is discussed [here](#5851 (comment)). 7. Enumerations are held in integer attributes rather than strings (`aggregate_function`, `post_transform`, `nodes_modes`). 8. The use of treeids and nodeids is dropped in favour of using the index into the `nodes_*` and `leaf_*` attributes to define the tree structure directly with no indirection. A `tree_roots` field has been added to denote the roots of each decision tree in the ensemble. The `TreeEnsembleRegressor` can be implemented by directly using this operator. The `TreeEnsembleClassifier` can be implemented by using this operator and then computing the top class for each input by applying an ArgMax operation for each output before using `LabelEncoder/GatherND` to produce the requisite label. As per the reference implementation tests, this representation can continue to perform the same operations as before as used while adding some new capability in set memberships. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? --> <!-- - If it fixes an open issue, please link to the issue here. --> Addresses #5851. Signed-off-by: Aditya Goel <agoel4512@gmail.com> Signed-off-by: Aditya Goel <48102515+adityagoel4512@users.noreply.github.com>
Closed by #5874 |
System information
ONNX 1.15
What is the problem that this feature solves?
Limitations and inefficiencies in decision tree encoding.
The proposal would enable:
Alternatives considered
No response
Describe the feature
Currently, the TreeEnsemble* operators do not have the ability to represent set membership as a concept directly.
Upstream libraries such as LightGBM frequently produce set membership operators especially in the context of categorical variables. Currently, converter frameworks represent this by chaining equalities and connecting the true output edge to the same node (i.e. we don't have a tree any more). We should capture this explicitly in the standard rather than relying on runtimes and converter frameworks to have consensus outside the scope of the standard. One approach might be to add a
SET_MEMBERSHIP
node mode and encoding the possible members in a new attribute.Secondly, the encoding of the remaining features contains some redundancies and could do with redesign.
node_hitrates
andnode_hitrates_as_tensor
can be dropped. They are not used in any converter frameworks or runtimes and the existing description is ambiguous in exactly how this influences performance which makes the attribute impossible to use in practice.nodes_missing_value_tracks_true
is redundant. You can arrange true/false branches accordingly (nan compares false for allnode_modes
except with 'BRANCH_NEQ' where it compares true). It ends up being in the hot path when making predictions in onnxruntime (see https://github.com/microsoft/onnxruntime/blob/cf78d01546ca059a2ab487e01626e38029a3e8fd/onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h#L738-L761) unnecessarily, so removing this attribute could simplify inference code and improve performance.node_modes
options can be decreased to the minimal orthogonal set. 'BRANCH_LEQ', 'BRANCH_LT', 'BRANCH_EQ', and 'LEAF' are all that is necessary. This will make implementing the operator more efficient without losing any expressivity.Finally, the operators currently only support float output. This frequently leads to discrepancies with libraries like LightGBM and XGBoost that are unacceptable in many deployment contexts. It looks like adding double support was proposed in a 2023 roadmap meeting but never introduced: https://github.com/onnx/steering-committee/blob/main/roadmap/2023-docs/01-tree-ensembles.pdf. This would eliminate this discrepancy.
Will this influence the current api (Y/N)?
Yes, it would require a new ai.onnx.ml opset.
Feature Area
Operator definitions.
Are you willing to contribute it (Y/N)
Yes
Notes
I'm happy to contribute this and see it through but would like to gather any opinions first.
The text was updated successfully, but these errors were encountered: