Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 481 - implement support for missing values with XGBoost #482

Merged
merged 7 commits into from
Jan 31, 2024

Conversation

patrick-le-shopify
Copy link
Contributor

This PR attempts to implement support for missing values with XGBoost (#481).

The main logic change pertains to the initialization of the FeatureVector. As of now, the DenseFeatureVector has a default value of 0.0, and feature scores are filled in with actual values only when they are present. Effectively, features which are missing are given a value of 0.0. This happens "upstream," and by the time XGBoost model (NaiveAdditiveDecisionTree) is invoked, there is no longer any missing feature value.

The implementation here adds a new class called SparseFeatureVector that gives features a default value of Float.NaN, enabling XGBoost model to actually follow branches where the feature is missing.

@patrick-le-shopify
Copy link
Contributor Author

Tagging @styrmis @wrigleyDan for awareness.

@patrick-le-shopify patrick-le-shopify changed the title Issue 481 Issue 481 - implement support for missing values with XGBoost Jan 24, 2024
Copy link
Contributor

@styrmis styrmis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! I like the approach—I've added just one comment. It was the explain output that in the end clarified the behaviour of the model inference in the plugin for me, so I think it would be worth maintaining the explicit reporting of default values used in that output if possible.

@@ -259,7 +259,7 @@ public Explanation explain(LeafReaderContext context, int doc) throws IOExceptio
}
featureString += ":";
if (!explain.isMatch()) {
subs.add(Explanation.noMatch(featureString + " [no match, default value 0.0 used]"));
subs.add(Explanation.noMatch(featureString + " [no match, default value used]"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the feature vector implementations can return their default value (which may be NaN), could we report this here?

Without this information it would require careful inspection of the source to determine what the default is, where reporting it in the explain output would be more user friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I've made changes to add what the default value is to the explanation.

Copy link
Contributor

@wrigleyDan wrigleyDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for that contribution!

@wrigleyDan wrigleyDan merged commit d2bbe8b into o19s:main Jan 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants