-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable disappearance in xgboost model #56
Comments
TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not? It is a conversion feature (not a bug), that the PMML document only contains information about features that are truly needed for scoring. All no-op features are automatically excluded. |
The thing is that PMML model has not the most important 14 variables, but just 14 first variables from the list. Like var1, var2, var3,.., var14. |
I don't like how you're constructing a feature matrix object here: data = data.matrix(train[,leave] %>% select(-target_bin)) Define a proper I suspect that your Use the syntax provided in the README file (one central |
This genDMatrix thing works!
PMML model consists of all the 26 variables, but somehow it's probability result differs from one I get using predict function:
But still get another output. |
I suspect that the use of the This is how my integration tests are generated: The above code suggests that the predict function should work fine with a predict(xgb_bl1, newdata = dtest) |
The Perhaps this function should be rewritten in Java to make it scale for bigger datasets (IIRC, the current R implementation didn't scale beyond 10k data rows). |
But it's hard to understand And am I right, that when I want to pass test dataset to final PMML model, I should first convert it using BTW, didn't see any differences with or without |
Keep your features in one There's no need to append
It's because your dataset contains some categorical features. The TLDR: The XGBoost package specifies a very difficult data input/output interface. My |
Another point - the conversion to PMML is not broken. It simply points out that you were using the XGBoost package incorrectly (because your dataset is a mix of continuous and categorical features). |
Do you mean smth like this?
|
Should be |
Sure, just misspelled.
|
So what? You're supplying a label directly to the |
Yes, I declare label in xgboost call, but still get an error:
|
Hi, Villu. After a couple of test I've just found out, that if I drop all the factor variables, leaving only numeric ones, than I get the same prediction results in both R and PMML models. But if I use factor vars, then results become different. |
Categorical variables should use R's Do your categorical columns use the Did you bother to look into my R-XGBoost to PMML integration tests (linked above)? They use categorical features, and their results are fully reproducible between R-XGBoost and (J)PMML. |
Yes, they are factor ones. And for sure I've read your test examples. |
Now that's an interesting claim - need to check this behaviour myself. I do intend to write a small technical article about "Converting R-XGBoost to PMML" fairly soon. Will use this issue and all its comments as a reference material. |
So, for now, I leave only numerics and 0/1 factors. Hope to solve factor's behaviour soon. |
Hi.
I'm trying to convert my xgboost model to pmml format. Convertation goes OK, but instead of 26 variables I get only 14. The most interesting part is that they're 14 first variables from the list. E.g., if I drop any variable from that top-14, instead of it there goes the next 15th variable. I'm totally confused. Could someone suggest the decision? Or maybe it's a bug?
The text was updated successfully, but these errors were encountered: