Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting xgboost splits data format in pmml #70

Closed
dwayne298 opened this issue Oct 13, 2021 · 4 comments
Closed

Converting xgboost splits data format in pmml #70

dwayne298 opened this issue Oct 13, 2021 · 4 comments

Comments

@dwayne298
Copy link

When using the xgboost package, I found some discrepancies between predictions in R and those using pmml file. Details of the issue can be found here:

dmlc/xgboost#7294

It seems to be an issue where xgboost uses float32 but (for example) base R uses double. In the example above, R has the value of 0.1234561702374036 but xgboost uses the float32 value 0.123456173. So the split in the tree and that gets used in the pmml is < 0.123456173.

I was wondering if it is possible to change the split that appears in the pmml file so that this discrepancy doesn't occur. For example, r2pmml finds the lowest value of double that would convert to float32 of 0.123456173 and assigns that as the split point (e.g. < 0.1234561702374036 would fix the issue in this case).

I appreciate this might be an impossible request but any feedback would be welcome.

@dwayne298
Copy link
Author

I raise this as it's not always possible to convert the input data into float32, and hence a pmml file with other data types would be extremely useful.

@vruusmann
Copy link
Member

It seems to be an issue where xgboost uses float32 but (for example) base R uses double

The R2PMML/JPMML-R stack uses the JPMML-XGBoost library for parsing and converting XGBoost model files. All numeric information (eg. split thresholds) comes straight from the XGBoost model file, and is represented using the 32-bit floating point data type (Java's java.lang.Float, gets translated to PMML's float).

It doesn't matter that the surrounding R layer defaults to the 64-bit FP.

So the split in the tree and that gets used in the pmml is < 0.123456173.

That's the exact split threshold, as a 32-bit FP value.

In Java, the java.lang.Float#toString() method automatically truncates the string representation to the "smallest number of significant digits". In Python and R, the str(<float32 value>) method doesn't do it, and prints out extra digits that do not convey any extra information.

Why insert noise into PMML documents?

I was wondering if it is possible to change the split that appears in the pmml file so that this discrepancy doesn't occur.

You still appear confused - both XGBoost and (J)PMML make correct predictions.

It is your R wrapper that messes up model input values, which leads to the selection of an incorrect tree branch.

For example, r2pmml finds the lowest value of double that would convert to float32 of 0.123456173 and assigns that as the split point (e.g. < 0.1234561702374036 would fix the issue in this case).

Nope, 0.123456173 is the correct split threshold value.

If you add extra digits to it (such as 0.1234561702374036), then you simply overflow the "natural precision" of the 32-bit floating-point data type. For example, if you do in Java System.out.println(new Float("0.1234561702374036")), then all that "extra precision" (ie. the 02374036 suffix) is truncated.

I raise this as it's not always possible to convert the input data into float32, and hence a pmml file with other data types would be extremely useful.

Any decent PMML engine should be able to represent 32-bit FP values, and evaluate such PMML documents correctly.

You can always take a PMML document, and transform it in any way you want (eg. rounding split thresholds). In R, you could load the correct PMML document using some XML package, and make it incorrect if you so desire.

@dwayne298
Copy link
Author

I may be misunderstanding but the data feeding into xgboost is the 64-bit FP with value 0.1234561702374036 - it's xgboost that converts it to float32 value 0.123456173. So if I take the data feeding into xgboost into another program it has the 64-bit FP value 0.1234561702374036 and if I don't convert it into float32 then I will get the wrong results.

Anyway, as you suggest - a decent PMML engine should be able to represent 32-bit FP values.

Thanks for your response and work on r2pmml!

@vruusmann
Copy link
Member

the data feeding into xgboost is the 64-bit FP with value 0.1234561702374036 - it's xgboost that converts it to float32 value 0.123456173

Correct - the XGBoost code only deals with 32-bit FP values.

The same applies if the XGBoost model is converted to the PMML representation. The PMML document contains an instruction "take user input, convert it to a 32-bit FP, and make comparisons in 32-bit mode".

So if I take the data feeding into xgboost into another program it has the 64-bit FP value 0.1234561702374036 and if I don't convert it into float32 then I will get the wrong results.

This other program of yours is broken.

You must "demote" (cast, losing precision) input from 64-bit to 32-bit. You cannot "promote" (cast, adding precision) 32-bit thresholds to 64-bit thresholds.

The fact that str(<float32 value>) prints you some extra digits does not mean that this is a valid promotion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants