Skip to content

Conversation

@lbh930
Copy link

@lbh930 lbh930 commented Nov 29, 2025

Description

Modified ImmutableLabelInfo to sort labels lexicographically before assigning IDs. This ensures that the mapping from Label to ID is deterministic and doesn't depend on the iteration order of HashMap. Updated the regression testsTestSGDLinear, TestFMClassification, TestClassificationEnsembles that relied on the previous non-deterministic ID assignment. This made sure the comparisons in the tests are not order-dependent.

Motivation

NonDex detected test flakiness in 4 tests:

  • org.tribuo.classification.mnb.TestMNB.testSingleClassTraining
  • org.tribuo.classification.SerializationTest.load431Protobufs
  • org.tribuo.classification.sgd.fm.TestFMClassification.loadProtobufModel
  • org.tribuo.classification.sgd.linear.TestSGDLinear.testSingleClassTraining

This PR is verified by NonDex to fix them all. The root cause for all of them was that the order of label IDs depended on HashMap iteration order, which has no guarantee of determinism. When the ID assignment changed, the resulting model parameters and predictions varied. By sorting the labels, the ID assignment will be consistent and make model training deterministic.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Nov 29, 2025
@Craigacp
Copy link
Member

We had to do this for the regression infos some time ago to fix a nasty indexing bug there, however it required lots of juggling in the models to fix as different indices cause problems. I don't think this could cause similar problems, but the iteration order here of the new style ones is guaranteed to be in increasing string sort order, and old ones won't be (as they still will deserialize into a HashMap not a LinkedHashMap and they can't be fixed to be in string sort order as that would break existing models). I'll need to think more about if this is safe and if we can construct some tests to check that it is safe.

If we do do this then you should use the TreeSet idiom we have in ImmutableRegressionInfo to construct the sorted keys, it's a bit shorter and I'd prefer to keep things consistent - https://github.com/oracle/tribuo/blob/main/Regression/Core/src/main/java/org/tribuo/regression/ImmutableRegressionInfo.java#L81.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants