Support for `missing` attribute #19

liumy601 · 2021-12-25T15:17:00Z

Hi vruusmann,

Sorry to disturb again, i've been headache for the inconsistent problem about several months. after i checked the doc of xgboost4j, i see after version 0.9, they've made some fixes about the missing value problem. so i upgraded xgboost4j-spark to 1.2.0 with spark 3. but now i still get the inconsistent problem.

you can see i only have one categorical feature hour which doesn't contain missing values, but if i remove categorical feature and use only numeric features, then the predict is consistent.

do you have any clues?

vruusmann · 2021-12-25T17:11:29Z

i only have one categorical feature hour which doesn't contain missing values

What is your definition of a "missing value"? A Java null reference, Double.NaN (or Float.NaN value), or something else?

The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.

Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.

vruusmann · 2021-12-25T17:13:21Z

This project contains an integration test that uses sparse categorical data:
https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala

This test is 100% reproducible.

liumy601 · 2021-12-26T15:43:05Z

This project contains an integration test that uses sparse categorical data: https://github.com/jpmml/jpmml-sparkml-xgboost/blob/master/src/test/resources/XGBoostAudit.scala

This test is 100% reproducible.

i've tried SparseToDenseTransformer before, and see it fixes the inconsistent problem caused by sparse vector problem.
But my dataset is big and the features num is over 28000 dimensions, xgboost model can't run successfully as it'll have memory problem

liumy601 · 2021-12-26T15:43:28Z

i only have one categorical feature hour which doesn't contain missing values

What is your definition of a "missing value"? A Java null reference, Double.NaN (or Float.NaN value), or something else?

The JPMML-XGBoost library has been very thoroughly tested with continuous/categorical/missing/invalid data form 6+ years, without a single major issue. So, again, I must assume that the problem resides somewhere in your application code.

Please prepare & share a minimal reproducible example - a CSV data file plus an Apache Spark script (Scala or PySpark), which I can run and explore locally.

i set missing value to 0, in xgboost4j-spark 1.2.0, if i set missing to other values, then it'll give xgboost training failed error.

vruusmann · 2021-12-26T16:32:08Z

i set missing value to 0

The DataField element for the "hour" column does not convey any information about the fact that in your case, the 0 value should be regarded as a missing value (and not as a numeric zero value).

How can the PMML engine make correct predictions if it is missing this critical piece of information?

Take the PMML document, and insert the following DataField/Value child element manually:

<DataField name="hour" optype="categorical" dataType="integer">
  <!-- THIS -->
  <Value property="missing" value="0"/>
</DataField>

vruusmann · 2021-12-26T16:38:05Z

It would be nice to automate the generation of extra DataField/Value@property="missing" etc elements.

Here are some related feature requests: jpmml/jpmml-sparkml#14 and jpmml/jpmml-sparkml#25

Newer XGBoost versions also store this information in model dumps. Here's a related Scikit-Learn issue: jpmml/jpmml-sklearn#166

liumy601 · 2021-12-28T16:21:41Z

Hi vruusmann,

Unfortunately, after i add the extra DataField/Value@property="missing" fields, the inconsistent problem still exists, i'm frustrated.
and i've tried both xgboost4j-spark 0.82 and 1.2.0, both inconsistent.
Now i don't have any ideas.

vruusmann transferred this issue from jpmml/jpmml-sparkml Dec 25, 2021

liumy601 closed this as completed Dec 26, 2021

liumy601 reopened this Dec 26, 2021

vruusmann closed this as completed Dec 26, 2021

vruusmann changed the title ~~inconsistent predict problem between jpmml and xgboost4j-spark~~ Support for missing attribute Dec 26, 2021

vruusmann reopened this Dec 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `missing` attribute #19

Support for `missing` attribute #19

liumy601 commented Dec 25, 2021

vruusmann commented Dec 25, 2021

vruusmann commented Dec 25, 2021

liumy601 commented Dec 26, 2021

liumy601 commented Dec 26, 2021

vruusmann commented Dec 26, 2021

vruusmann commented Dec 26, 2021

liumy601 commented Dec 28, 2021

Support for missing attribute #19

Support for missing attribute #19

Comments

liumy601 commented Dec 25, 2021

vruusmann commented Dec 25, 2021

vruusmann commented Dec 25, 2021

liumy601 commented Dec 26, 2021

liumy601 commented Dec 26, 2021

vruusmann commented Dec 26, 2021

vruusmann commented Dec 26, 2021

liumy601 commented Dec 28, 2021

Support for `missing` attribute #19

Support for `missing` attribute #19