Clarify how we handle missing values, NaN, zeros... #135

nomoa · 2018-02-16T14:19:12Z

While trying to add support for XGBoost missing direction I realized that the way we handle missing values is not very clear (code&doc wise).

During logging we allow users to set missing_as_zero which will emit zeros instead of nothing. After that it's up to the user to properly configure its training algorithm to handle these. E.g. XGBoost has support for them and will emit a model with an additional decision is missing? besides the threshold check.
Today the model parser for XGBoost completely ignores the missing branch. This basically assumes the features were logged with missing_as_zero.
Concerning ranklib, its DataPoint natively supports missing values by using NaN. But on the plugin glue code we force all values to zeros on the reset() method.

Things we should fix regardless:

~~when logging we should fail if we emit NaN for a non-missing value (bogus feature/query)~~ Fail when trying to log NaN values #136 .

Evaluate:

start to add a boolean missing(int featureIdx) method to the FeatureVector interface.
eventually fix (or add a new impl) our decision tree implementation so that it supports missing values. Sadly the xgboost format has no way to tell us if the missing branch needs to be checked (optimization).
doc: clarify how we handle missing values in the various ranker implementations we support so that users can decide properly if they want to log features with missing_as_zero.

The text was updated successfully, but these errors were encountered:

nomoa · 2018-02-16T17:26:28Z

edit: removed the comment on ranklib DataPoint, the default constructor does nothing so the float array is properly initialized with zeros which is coherent with the reset() method.

ebernhardson · 2018-02-28T00:20:37Z

I realized while looking over some feature values that the query explorer query also works slightly different to others with respect to missing as zero. Query explorer looks to always match the document regardless, but emits a score of 0.0 if the provided query doesn't match the document. This is perhaps slightly complicated by classic_idf not depending on the query terms (it should match everything) while others like raw_ttf does depend on the terms.

The result of this is basically that missing_as_zero doesn't combine with query explorer in the same way as a match query, even if the query explorer is exploring that same match query.

subsetpark · 2022-09-06T18:32:41Z

Hi there,

This is an older ticket but seems to be the canonical discussion point for dealing with missing values.

We are currently building our ESLTR pipeline, and currently we log some values which can be missing. For instance, if the data behind a logged feature is behind a feature flag, and the account/session being logged is outside of that flag, that feature will be missing in the logs: it will have an entry in the _ltr output but not value.

The model we're training is built with XGBoost, so we are currently representing that feature as NaN for the observation in question.

I have two questions about current best practices for a scenario like ours:

Generally, is this ticket still a priority? Does ESLTR still intend to handle missing values as distinct from 0s?
(perhaps more interestingly), given the current state of affairs, what is the best way to represent missing values for ESLTR? Intuitively, it seems problematic to simply treat them as 0s, because the 0-value for some binary feature means something other than indicating that, essentially, this particular feature is irrelevant in the case of this observation.

Other tickets here have alluded to using some other sentinel value - for instance, the maximum float amount, or perhaps -1. But I'm curious: does the ESLTR team have any current recommendations for how to express missing features as distinct from negative features? Or, alternately, is the distinction not important? Should they be treated the same as negative features?

purbon added enhancement discuss labels Feb 16, 2018

nomoa mentioned this issue May 17, 2018

Selective features for models at query time #156

Merged

nomoa mentioned this issue Oct 18, 2019

Sparse feature support for xgboost models #248

Closed

nathancday mentioned this issue Jan 5, 2021

XGBoost models as first class citizens #353

Open

3 tasks

lechipatrick mentioned this issue Jan 3, 2023

xgboost to handle missing values #452

Closed

styrmis mentioned this issue Jan 22, 2024

Implement support for missing values with XGBoost #481

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify how we handle missing values, NaN, zeros... #135

Clarify how we handle missing values, NaN, zeros... #135

nomoa commented Feb 16, 2018 •

edited

Loading

nomoa commented Feb 16, 2018

ebernhardson commented Feb 28, 2018

subsetpark commented Sep 6, 2022

Clarify how we handle missing values, NaN, zeros... #135

Clarify how we handle missing values, NaN, zeros... #135

Comments

nomoa commented Feb 16, 2018 • edited Loading

nomoa commented Feb 16, 2018

ebernhardson commented Feb 28, 2018

subsetpark commented Sep 6, 2022

nomoa commented Feb 16, 2018 •

edited

Loading