Variable disappearance in xgboost model #56

zakirovde · 2019-03-20T11:22:24Z

Hi.

I'm trying to convert my xgboost model to pmml format. Convertation goes OK, but instead of 26 variables I get only 14. The most interesting part is that they're 14 first variables from the list. E.g., if I drop any variable from that top-14, instead of it there goes the next 15th variable. I'm totally confused. Could someone suggest the decision? Or maybe it's a bug?

#target_bin is the binary target variable
#leave - is the list of variables, leave <- which(names(train) %in% c('target_bin', 'var1', ...'var26'))
#then the model itself
xgb_bl1 <- xgboost(data = data.matrix(train[,leave] %>% select(-target_bin)), 
                  label = data.matrix(train[,leave]$target_bin), 
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )
leave.fmap = genFMap(train[,leave] %>% select(-target_bin))
r2pmml(xgb_bl1, "xgb_virtu.pmml", fmap = leave.fmap, response_name = "target_bin", response_levels = c("0", "1"))

The text was updated successfully, but these errors were encountered:

vruusmann · 2019-03-20T11:51:59Z

Convertation goes OK, but instead of 26 variables I get only 14

TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?

It is a conversion feature (not a bug), that the PMML document only contains information about features that are truly needed for scoring. All no-op features are automatically excluded.

zakirovde · 2019-03-20T12:41:33Z

Convertation goes OK, but instead of 26 variables I get only 14

TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?

The thing is that PMML model has not the most important 14 variables, but just 14 first variables from the list. Like var1, var2, var3,.., var14.
So, I get wrong predictions.
Is there a way to turn off that future, so PMML model could contain all the variables I want it to have?

vruusmann · 2019-03-20T12:50:06Z

I don't like how you're constructing a feature matrix object here:

data = data.matrix(train[,leave] %>% select(-target_bin))

Define a proper data.frame object in one place (not once for the xgboost() function call, and another time for the genFMap() function call - they might give different results?), and create a proper DMatrix object based on it using the r2pmml::genDMatrix() function.

I suspect that your data.frame objects are not consistent, and that doing manual data.matrix reorders data columns one more time.

Use the syntax provided in the README file (one central data.frame, and then feeding it to r2pmm::genDMatrix() and r2pmml::genFMap() functions). This should work. Once you've verified this claim locally, only then start making your hacks.

zakirovde · 2019-03-21T14:37:51Z

I don't like how you're constructing a feature matrix object here:

This genDMatrix thing works!

label <- as.numeric(train$target_bin)-1
data <- as.matrix(train[,leave])
mode(data) <- 'double'
dtrain <- genDMatrix(df_y = label, df_X = data)
xgb_bl1 <- xgboost(data=dtrain,
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )

PMML model consists of all the 26 variables, but somehow it's probability result differs from one I get using predict function:
predict(xgb_bl1, data.matrix(test[,leave]))
I'm not sure that this is correct use of predict, so I tried even this:

dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
#if I don't declare df_y, then I get an error
predict(xgb_bl1, dtest)

But still get another output.
Maybe I pass test dataset to PMML model in a wrong way? I just send send them in JSON format. But still, comparing to predict in both genDMatrix and source form, I get different results.

vruusmann · 2019-03-21T16:03:31Z

I suspect that the use of the data.matrix() function is the problem here - perhaps it's reordering columns based on some internal logic? This leads to a situation where the ordering of columns is not consistent between train and test/predict runs.

This is how my integration tests are generated:
https://github.com/jpmml/jpmml-r/blob/master/src/test/R/xgboost.R

The above code suggests that the predict function should work fine with a DMatrix that contains only feature columns. Perhaps you need to spell out the name of the newdata attribute?

predict(xgb_bl1, newdata = dtest)

vruusmann · 2019-03-21T16:07:10Z

The r2pmml::genDMatrix() function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memory data.frame -> matrix -> DMatrix conversion workflow.

Perhaps this function should be rewritten in Java to make it scale for bigger datasets (IIRC, the current R implementation didn't scale beyond 10k data rows).

zakirovde · 2019-03-22T07:03:30Z

The r2pmml::genDMatrix() function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memory data.frame -> matrix -> DMatrix conversion workflow.

But it's hard to understand r2pmml::genDMatrix() function's behaviour. Because, as I get it, we can't see table view of dataset, converted by this function. And the most strange thing is that when I create such dataset:
dtrain <- genDMatrix(df_y = label, df_X = train[,leave])
leave is the list of 26 variables, so we should get DMatrix of 26 variables and one target variable.
After that I train xgboost model on that dataset and try to check importances:
xgb.importance(colnames(train[,leave]), model=xgb_bl1)
I get an error Error in View : feature_names has less elements than there are features used in the model
Just for test I run such a code
xgb.importance(colnames(train[,1:200]), model=xgb_bl1)
And then I get a list of most important 131 variables! Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?

And am I right, that when I want to pass test dataset to final PMML model, I should first convert it using genDMatrix, and only after that pass to model?

BTW, didn't see any differences with or without newdata use.

vruusmann · 2019-03-22T07:50:54Z

so we should get DMatrix of 26 variables and one target variable.

Keep your features in one data.frame object, and the label column in another. IIRC, the xgboost() function allows you to specify X and y attributes separately; if so, pass X as a DMatrix (generated using the r2pmml::getDMatrix() function), and y as a suitable vector object type.

There's no need to append y to the feature matrix.

Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?

It's because your dataset contains some categorical features. The r2pmml::genDMatrix() function expands a single categorical feature column to multiple binary feature columns (one for each category level). If you observe the feature map definition as produced by the r2pmml::genFMap() function, then you will find a data.frame with 131 columns also.

TLDR: The XGBoost package specifies a very difficult data input/output interface. My genFMap() and genDMatrix() utility functions provide a slow but correct way of working with it. Unless you understand the internals of the XGBoost package very well, do not attempt any shortcuts (such as using data.matrix instead of DMatrix). Also, shortcuts that are valid with continuous features stop working when there are categorical features in the dataset.

vruusmann · 2019-03-22T07:53:26Z

Another point - the conversion to PMML is not broken. It simply points out that you were using the XGBoost package incorrectly (because your dataset is a mix of continuous and categorical features).

zakirovde · 2019-03-22T09:17:57Z

Keep your features in one data.frame object, and the label column in another. IIRC, the xgboost() function allows you to specify X and y attributes separately; if so, pass X as a DMatrix (generated using the r2pmml::getDMatrix() function), and y as a suitable vector object type.

There's no need to append y to the feature matrix.

Do you mean smth like this?

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
label <- as.numeric(train$target_bin)-1
xgb_bl1 <- xgboost(data = data, label=label,
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )
dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
predict(xgb_bl1, newdata=dtest)

vruusmann · 2019-03-22T09:30:55Z

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = data, label=label, ..)

Should be xgboost(data = dtrain, label = label, ..)

zakirovde · 2019-03-22T09:56:43Z

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = data, label=label, ..)

Should be xgboost(data = dtrain, label = label, ..)

Sure, just misspelled.
Unfortunately, in this case xgboost ignores label:

Warning message:
In xgb.get.DMatrix(data, label, missing, weight) :
  xgboost: label will be ignored.

vruusmann · 2019-03-22T10:16:41Z

Unfortunately, in this case xgboost ignores label:

So what? You're supplying a label directly to the xgboost() function using the label attribute.

zakirovde · 2019-03-22T11:13:13Z

Unfortunately, in this case xgboost ignores label:

So what? You're supplying a label directly to the xgboost() function using the label attribute.

Yes, I declare label in xgboost call, but still get an error:

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = dtrain, label=train$target_bin,
.....

zakirovde · 2019-03-26T14:45:42Z

Hi, Villu.

After a couple of test I've just found out, that if I drop all the factor variables, leaving only numeric ones, than I get the same prediction results in both R and PMML models. But if I use factor vars, then results become different.
Do you know if I should prepare categorical variables some special way or there is a mistake in rpmml package's work?
My factor variables are like ones or zeros, and more complicated ones, like citynumber (e.g., 1 is London, 2 is New York, etc.).

vruusmann · 2019-03-26T15:12:12Z

Do you know if I should prepare categorical variables some special way or there is a mistake in rpmml package's work?

Categorical variables should use R's factor data type. Both r2pmml::genFMap() and r2pmml::genDMatrix() then use this information to generate proper feature map and DMatrix objects.

Do your categorical columns use the factor data type? Or do they use character?

Did you bother to look into my R-XGBoost to PMML integration tests (linked above)? They use categorical features, and their results are fully reproducible between R-XGBoost and (J)PMML.

zakirovde · 2019-03-27T11:16:12Z

Yes, they are factor ones. And for sure I've read your test examples.
The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model. But when I turn it no numeric train$var <- as.numeric(train$var)-1, then I get correct answer. That's why I'm so confused.

vruusmann · 2019-03-27T15:01:45Z

The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model.

Now that's an interesting claim - need to check this behaviour myself.

I do intend to write a small technical article about "Converting R-XGBoost to PMML" fairly soon. Will use this issue and all its comments as a reference material.

zakirovde · 2019-03-28T10:13:28Z

So, for now, I leave only numerics and 0/1 factors. Hope to solve factor's behaviour soon.
Thanks a lot for your brilliant packages and quick replies!

vruusmann closed this as completed in 979d9ba Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable disappearance in xgboost model #56

Variable disappearance in xgboost model #56

zakirovde commented Mar 20, 2019

vruusmann commented Mar 20, 2019

zakirovde commented Mar 20, 2019

vruusmann commented Mar 20, 2019 •

edited

Loading

zakirovde commented Mar 21, 2019 •

edited

Loading

vruusmann commented Mar 21, 2019 •

edited

Loading

vruusmann commented Mar 21, 2019

zakirovde commented Mar 22, 2019

vruusmann commented Mar 22, 2019 •

edited

Loading

vruusmann commented Mar 22, 2019

zakirovde commented Mar 22, 2019

vruusmann commented Mar 22, 2019

zakirovde commented Mar 22, 2019

vruusmann commented Mar 22, 2019

zakirovde commented Mar 22, 2019

zakirovde commented Mar 26, 2019

vruusmann commented Mar 26, 2019

zakirovde commented Mar 27, 2019

vruusmann commented Mar 27, 2019

zakirovde commented Mar 28, 2019

Variable disappearance in xgboost model #56

Variable disappearance in xgboost model #56

Comments

zakirovde commented Mar 20, 2019

vruusmann commented Mar 20, 2019

zakirovde commented Mar 20, 2019

vruusmann commented Mar 20, 2019 • edited Loading

zakirovde commented Mar 21, 2019 • edited Loading

vruusmann commented Mar 21, 2019 • edited Loading

vruusmann commented Mar 21, 2019

zakirovde commented Mar 22, 2019

vruusmann commented Mar 22, 2019 • edited Loading

vruusmann commented Mar 22, 2019

zakirovde commented Mar 22, 2019

vruusmann commented Mar 22, 2019

zakirovde commented Mar 22, 2019

vruusmann commented Mar 22, 2019

zakirovde commented Mar 22, 2019

zakirovde commented Mar 26, 2019

vruusmann commented Mar 26, 2019

zakirovde commented Mar 27, 2019

vruusmann commented Mar 27, 2019

zakirovde commented Mar 28, 2019

vruusmann commented Mar 20, 2019 •

edited

Loading

zakirovde commented Mar 21, 2019 •

edited

Loading

vruusmann commented Mar 21, 2019 •

edited

Loading

vruusmann commented Mar 22, 2019 •

edited

Loading