Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable disappearance in xgboost model #56

Closed
zakirovde opened this issue Mar 20, 2019 · 19 comments
Closed

Variable disappearance in xgboost model #56

zakirovde opened this issue Mar 20, 2019 · 19 comments

Comments

@zakirovde
Copy link

Hi.

I'm trying to convert my xgboost model to pmml format. Convertation goes OK, but instead of 26 variables I get only 14. The most interesting part is that they're 14 first variables from the list. E.g., if I drop any variable from that top-14, instead of it there goes the next 15th variable. I'm totally confused. Could someone suggest the decision? Or maybe it's a bug?

#target_bin is the binary target variable
#leave - is the list of variables, leave <- which(names(train) %in% c('target_bin', 'var1', ...'var26'))
#then the model itself
xgb_bl1 <- xgboost(data = data.matrix(train[,leave] %>% select(-target_bin)), 
                  label = data.matrix(train[,leave]$target_bin), 
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )
leave.fmap = genFMap(train[,leave] %>% select(-target_bin))
r2pmml(xgb_bl1, "xgb_virtu.pmml", fmap = leave.fmap, response_name = "target_bin", response_levels = c("0", "1"))
@vruusmann
Copy link
Member

Convertation goes OK, but instead of 26 variables I get only 14

TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?

It is a conversion feature (not a bug), that the PMML document only contains information about features that are truly needed for scoring. All no-op features are automatically excluded.

@zakirovde
Copy link
Author

Convertation goes OK, but instead of 26 variables I get only 14

TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?

The thing is that PMML model has not the most important 14 variables, but just 14 first variables from the list. Like var1, var2, var3,.., var14.
So, I get wrong predictions.
Is there a way to turn off that future, so PMML model could contain all the variables I want it to have?

@vruusmann
Copy link
Member

vruusmann commented Mar 20, 2019

I don't like how you're constructing a feature matrix object here:

data = data.matrix(train[,leave] %>% select(-target_bin))

Define a proper data.frame object in one place (not once for the xgboost() function call, and another time for the genFMap() function call - they might give different results?), and create a proper DMatrix object based on it using the r2pmml::genDMatrix() function.

I suspect that your data.frame objects are not consistent, and that doing manual data.matrix reorders data columns one more time.

Use the syntax provided in the README file (one central data.frame, and then feeding it to r2pmm::genDMatrix() and r2pmml::genFMap() functions). This should work. Once you've verified this claim locally, only then start making your hacks.

@zakirovde
Copy link
Author

zakirovde commented Mar 21, 2019

I don't like how you're constructing a feature matrix object here:

This genDMatrix thing works!

label <- as.numeric(train$target_bin)-1
data <- as.matrix(train[,leave])
mode(data) <- 'double'
dtrain <- genDMatrix(df_y = label, df_X = data)
xgb_bl1 <- xgboost(data=dtrain,
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )

PMML model consists of all the 26 variables, but somehow it's probability result differs from one I get using predict function:
predict(xgb_bl1, data.matrix(test[,leave]))
I'm not sure that this is correct use of predict, so I tried even this:

dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
#if I don't declare df_y, then I get an error
predict(xgb_bl1, dtest)

But still get another output.
Maybe I pass test dataset to PMML model in a wrong way? I just send send them in JSON format. But still, comparing to predict in both genDMatrix and source form, I get different results.

@vruusmann
Copy link
Member

vruusmann commented Mar 21, 2019

I suspect that the use of the data.matrix() function is the problem here - perhaps it's reordering columns based on some internal logic? This leads to a situation where the ordering of columns is not consistent between train and test/predict runs.

This is how my integration tests are generated:
https://github.com/jpmml/jpmml-r/blob/master/src/test/R/xgboost.R

The above code suggests that the predict function should work fine with a DMatrix that contains only feature columns. Perhaps you need to spell out the name of the newdata attribute?

predict(xgb_bl1, newdata = dtest)

@vruusmann
Copy link
Member

The r2pmml::genDMatrix() function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memory data.frame -> matrix -> DMatrix conversion workflow.

Perhaps this function should be rewritten in Java to make it scale for bigger datasets (IIRC, the current R implementation didn't scale beyond 10k data rows).

@zakirovde
Copy link
Author

The r2pmml::genDMatrix() function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memory data.frame -> matrix -> DMatrix conversion workflow.

But it's hard to understand r2pmml::genDMatrix() function's behaviour. Because, as I get it, we can't see table view of dataset, converted by this function. And the most strange thing is that when I create such dataset:
dtrain <- genDMatrix(df_y = label, df_X = train[,leave])
leave is the list of 26 variables, so we should get DMatrix of 26 variables and one target variable.
After that I train xgboost model on that dataset and try to check importances:
xgb.importance(colnames(train[,leave]), model=xgb_bl1)
I get an error Error in View : feature_names has less elements than there are features used in the model
Just for test I run such a code
xgb.importance(colnames(train[,1:200]), model=xgb_bl1)
And then I get a list of most important 131 variables! Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?

And am I right, that when I want to pass test dataset to final PMML model, I should first convert it using genDMatrix, and only after that pass to model?

BTW, didn't see any differences with or without newdata use.

@vruusmann
Copy link
Member

vruusmann commented Mar 22, 2019

so we should get DMatrix of 26 variables and one target variable.

Keep your features in one data.frame object, and the label column in another. IIRC, the xgboost() function allows you to specify X and y attributes separately; if so, pass X as a DMatrix (generated using the r2pmml::getDMatrix() function), and y as a suitable vector object type.

There's no need to append y to the feature matrix.

Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?

It's because your dataset contains some categorical features. The r2pmml::genDMatrix() function expands a single categorical feature column to multiple binary feature columns (one for each category level). If you observe the feature map definition as produced by the r2pmml::genFMap() function, then you will find a data.frame with 131 columns also.

TLDR: The XGBoost package specifies a very difficult data input/output interface. My genFMap() and genDMatrix() utility functions provide a slow but correct way of working with it. Unless you understand the internals of the XGBoost package very well, do not attempt any shortcuts (such as using data.matrix instead of DMatrix). Also, shortcuts that are valid with continuous features stop working when there are categorical features in the dataset.

@vruusmann
Copy link
Member

Another point - the conversion to PMML is not broken. It simply points out that you were using the XGBoost package incorrectly (because your dataset is a mix of continuous and categorical features).

@zakirovde
Copy link
Author

Keep your features in one data.frame object, and the label column in another. IIRC, the xgboost() function allows you to specify X and y attributes separately; if so, pass X as a DMatrix (generated using the r2pmml::getDMatrix() function), and y as a suitable vector object type.

There's no need to append y to the feature matrix.

Do you mean smth like this?

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
label <- as.numeric(train$target_bin)-1
xgb_bl1 <- xgboost(data = data, label=label,
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )
dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
predict(xgb_bl1, newdata=dtest)

@vruusmann
Copy link
Member

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = data, label=label, ..)

Should be xgboost(data = dtrain, label = label, ..)

@zakirovde
Copy link
Author

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = data, label=label, ..)

Should be xgboost(data = dtrain, label = label, ..)

Sure, just misspelled.
Unfortunately, in this case xgboost ignores label:

Warning message:
In xgb.get.DMatrix(data, label, missing, weight) :
  xgboost: label will be ignored.

@vruusmann
Copy link
Member

Unfortunately, in this case xgboost ignores label:

So what? You're supplying a label directly to the xgboost() function using the label attribute.

@zakirovde
Copy link
Author

Unfortunately, in this case xgboost ignores label:

So what? You're supplying a label directly to the xgboost() function using the label attribute.

Yes, I declare label in xgboost call, but still get an error:

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = dtrain, label=train$target_bin,
.....

@zakirovde
Copy link
Author

Hi, Villu.

After a couple of test I've just found out, that if I drop all the factor variables, leaving only numeric ones, than I get the same prediction results in both R and PMML models. But if I use factor vars, then results become different.
Do you know if I should prepare categorical variables some special way or there is a mistake in rpmml package's work?
My factor variables are like ones or zeros, and more complicated ones, like citynumber (e.g., 1 is London, 2 is New York, etc.).

@vruusmann
Copy link
Member

Do you know if I should prepare categorical variables some special way or there is a mistake in rpmml package's work?

Categorical variables should use R's factor data type. Both r2pmml::genFMap() and r2pmml::genDMatrix() then use this information to generate proper feature map and DMatrix objects.

Do your categorical columns use the factor data type? Or do they use character?

Did you bother to look into my R-XGBoost to PMML integration tests (linked above)? They use categorical features, and their results are fully reproducible between R-XGBoost and (J)PMML.

@zakirovde
Copy link
Author

Yes, they are factor ones. And for sure I've read your test examples.
The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model. But when I turn it no numeric train$var <- as.numeric(train$var)-1, then I get correct answer. That's why I'm so confused.

@vruusmann
Copy link
Member

The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model.

Now that's an interesting claim - need to check this behaviour myself.

I do intend to write a small technical article about "Converting R-XGBoost to PMML" fairly soon. Will use this issue and all its comments as a reference material.

@zakirovde
Copy link
Author

So, for now, I leave only numerics and 0/1 factors. Hope to solve factor's behaviour soon.
Thanks a lot for your brilliant packages and quick replies!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants