Converting xgboost splits data format in pmml #70

dwayne298 · 2021-10-13T17:34:00Z

When using the xgboost package, I found some discrepancies between predictions in R and those using pmml file. Details of the issue can be found here:

dmlc/xgboost#7294

It seems to be an issue where xgboost uses float32 but (for example) base R uses double. In the example above, R has the value of 0.1234561702374036 but xgboost uses the float32 value 0.123456173. So the split in the tree and that gets used in the pmml is < 0.123456173.

I was wondering if it is possible to change the split that appears in the pmml file so that this discrepancy doesn't occur. For example, r2pmml finds the lowest value of double that would convert to float32 of 0.123456173 and assigns that as the split point (e.g. < 0.1234561702374036 would fix the issue in this case).

I appreciate this might be an impossible request but any feedback would be welcome.

The text was updated successfully, but these errors were encountered:

dwayne298 · 2021-10-13T17:38:27Z

I raise this as it's not always possible to convert the input data into float32, and hence a pmml file with other data types would be extremely useful.

vruusmann · 2021-10-13T18:08:08Z

It seems to be an issue where xgboost uses float32 but (for example) base R uses double

The R2PMML/JPMML-R stack uses the JPMML-XGBoost library for parsing and converting XGBoost model files. All numeric information (eg. split thresholds) comes straight from the XGBoost model file, and is represented using the 32-bit floating point data type (Java's java.lang.Float, gets translated to PMML's float).

It doesn't matter that the surrounding R layer defaults to the 64-bit FP.

So the split in the tree and that gets used in the pmml is < 0.123456173.

That's the exact split threshold, as a 32-bit FP value.

In Java, the java.lang.Float#toString() method automatically truncates the string representation to the "smallest number of significant digits". In Python and R, the str(<float32 value>) method doesn't do it, and prints out extra digits that do not convey any extra information.

Why insert noise into PMML documents?

I was wondering if it is possible to change the split that appears in the pmml file so that this discrepancy doesn't occur.

You still appear confused - both XGBoost and (J)PMML make correct predictions.

It is your R wrapper that messes up model input values, which leads to the selection of an incorrect tree branch.

For example, r2pmml finds the lowest value of double that would convert to float32 of 0.123456173 and assigns that as the split point (e.g. < 0.1234561702374036 would fix the issue in this case).

Nope, 0.123456173 is the correct split threshold value.

If you add extra digits to it (such as 0.1234561702374036), then you simply overflow the "natural precision" of the 32-bit floating-point data type. For example, if you do in Java System.out.println(new Float("0.1234561702374036")), then all that "extra precision" (ie. the 02374036 suffix) is truncated.

I raise this as it's not always possible to convert the input data into float32, and hence a pmml file with other data types would be extremely useful.

Any decent PMML engine should be able to represent 32-bit FP values, and evaluate such PMML documents correctly.

You can always take a PMML document, and transform it in any way you want (eg. rounding split thresholds). In R, you could load the correct PMML document using some XML package, and make it incorrect if you so desire.

dwayne298 · 2021-10-13T18:34:10Z

I may be misunderstanding but the data feeding into xgboost is the 64-bit FP with value 0.1234561702374036 - it's xgboost that converts it to float32 value 0.123456173. So if I take the data feeding into xgboost into another program it has the 64-bit FP value 0.1234561702374036 and if I don't convert it into float32 then I will get the wrong results.

Anyway, as you suggest - a decent PMML engine should be able to represent 32-bit FP values.

Thanks for your response and work on r2pmml!

vruusmann · 2021-10-13T18:47:08Z

the data feeding into xgboost is the 64-bit FP with value 0.1234561702374036 - it's xgboost that converts it to float32 value 0.123456173

Correct - the XGBoost code only deals with 32-bit FP values.

The same applies if the XGBoost model is converted to the PMML representation. The PMML document contains an instruction "take user input, convert it to a 32-bit FP, and make comparisons in 32-bit mode".

So if I take the data feeding into xgboost into another program it has the 64-bit FP value 0.1234561702374036 and if I don't convert it into float32 then I will get the wrong results.

This other program of yours is broken.

You must "demote" (cast, losing precision) input from 64-bit to 32-bit. You cannot "promote" (cast, adding precision) 32-bit thresholds to 64-bit thresholds.

The fact that str(<float32 value>) prints you some extra digits does not mean that this is a valid promotion.

vruusmann closed this as completed Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting xgboost splits data format in pmml #70

Converting xgboost splits data format in pmml #70

dwayne298 commented Oct 13, 2021

dwayne298 commented Oct 13, 2021

vruusmann commented Oct 13, 2021

dwayne298 commented Oct 13, 2021

vruusmann commented Oct 13, 2021

Converting xgboost splits data format in pmml #70

Converting xgboost splits data format in pmml #70

Comments

dwayne298 commented Oct 13, 2021

dwayne298 commented Oct 13, 2021

vruusmann commented Oct 13, 2021

dwayne298 commented Oct 13, 2021

vruusmann commented Oct 13, 2021