-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting xgboost splits data format in pmml #70
Comments
I raise this as it's not always possible to convert the input data into float32, and hence a pmml file with other data types would be extremely useful. |
The R2PMML/JPMML-R stack uses the JPMML-XGBoost library for parsing and converting XGBoost model files. All numeric information (eg. split thresholds) comes straight from the XGBoost model file, and is represented using the 32-bit floating point data type (Java's It doesn't matter that the surrounding R layer defaults to the 64-bit FP.
That's the exact split threshold, as a 32-bit FP value. In Java, the Why insert noise into PMML documents?
You still appear confused - both XGBoost and (J)PMML make correct predictions. It is your R wrapper that messes up model input values, which leads to the selection of an incorrect tree branch.
Nope, If you add extra digits to it (such as
Any decent PMML engine should be able to represent 32-bit FP values, and evaluate such PMML documents correctly. You can always take a PMML document, and transform it in any way you want (eg. rounding split thresholds). In R, you could load the correct PMML document using some XML package, and make it incorrect if you so desire. |
I may be misunderstanding but the data feeding into xgboost is the 64-bit FP with value Anyway, as you suggest - a decent PMML engine should be able to represent 32-bit FP values. Thanks for your response and work on r2pmml! |
Correct - the XGBoost code only deals with 32-bit FP values. The same applies if the XGBoost model is converted to the PMML representation. The PMML document contains an instruction "take user input, convert it to a 32-bit FP, and make comparisons in 32-bit mode".
This other program of yours is broken. You must "demote" (cast, losing precision) input from 64-bit to 32-bit. You cannot "promote" (cast, adding precision) 32-bit thresholds to 64-bit thresholds. The fact that |
When using the xgboost package, I found some discrepancies between predictions in R and those using pmml file. Details of the issue can be found here:
dmlc/xgboost#7294
It seems to be an issue where xgboost uses float32 but (for example) base R uses double. In the example above, R has the value of
0.1234561702374036
but xgboost uses the float32 value0.123456173
. So the split in the tree and that gets used in the pmml is< 0.123456173
.I was wondering if it is possible to change the split that appears in the pmml file so that this discrepancy doesn't occur. For example, r2pmml finds the lowest value of double that would convert to float32 of
0.123456173
and assigns that as the split point (e.g.< 0.1234561702374036
would fix the issue in this case).I appreciate this might be an impossible request but any feedback would be welcome.
The text was updated successfully, but these errors were encountered: