Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update feature_fraction_bynode #2381

Merged
merged 4 commits into from Sep 12, 2019
Merged

update feature_fraction_bynode #2381

merged 4 commits into from Sep 12, 2019

Conversation

guolinke
Copy link
Collaborator

@guolinke guolinke commented Sep 5, 2019

I just notice that feature_fraction has an alias, colsample_bytree.
Therefore, use it for the colsample by node is not straight-forward.
following is the new definition of the feature_fraction_bynode:

// alias = sub_feature_bynode, colsample_bynode
// check = >0.0
// check = <=1.0
// desc = LightGBM will randomly select part of features on each tree node if feature_fraction smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features at each tree node.
// desc = can be used to deal with over-fitting
// desc = Note: unlike feature_fraction, this cannot speed up training
// desc = Note: if both feature_fraction and feature_fraction_bynode are smaller than 1.0, the final fraction of each node is feature_fraction * feature_fraction_bynode
double feature_fraction_bynode = 1.0;

ping @BlindApe for the changes.

@guolinke
Copy link
Collaborator Author

guolinke commented Sep 5, 2019

now the behavior is the same as xgb.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual, minor style comments from me 😄

include/LightGBM/config.h Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
@StrikerRUS StrikerRUS mentioned this pull request Sep 10, 2019
@guolinke
Copy link
Collaborator Author

@StrikerRUS can this be merged?

@StrikerRUS
Copy link
Collaborator

@guolinke

can this be merged?

Yeah, I think so. I like new way of setting feature_fraction_bynode more than previous one and find it more intuitive.

@StrikerRUS StrikerRUS merged commit ad8e8cc into master Sep 12, 2019
@StrikerRUS StrikerRUS deleted the node-ff branch September 12, 2019 12:53
@mayer79
Copy link
Contributor

mayer79 commented Sep 14, 2019

A great feature. As this is one of the core elements to make random forests shine at least in high-dimensional settings, I was playing with the diamonds data set in ggplot2 in R and tried to predict log(price) by a couple of features.

The results are a bit uncomfortable:

  • a "real" rf implementation (ranger) with feature_fraction_bynode -> test R-squared = 0.99

  • xgb random forest mode with and without feature_fraction_bynode -> both R-squared 0.99

  • same for lgb -> without feature_fraction_bynode, R-squared is 0.99, with feature_fraction_bynode only 0.91.

What could be the reason for this sudden drop in the lgb version? (Of course, we cannot directly compare the results due to different parametrizations).

library(tidyverse)
library(xgboost)
library(lightgbm)
library(ranger)

# Function to measure performance
perf <- function(y, pred) {
  res <- y - pred
  c(r2 = 1 - var(res) / var(y),
    rmse = sqrt(mean(res^2)),
    mae = mean(abs(res)))
}

#==============================
# DATA PREP
#==============================

diamonds <- diamonds %>% 
  mutate_if(is.ordered, as.numeric) %>% 
  mutate(log_price = log(price),
         log_carat = log(carat))

# Train/test split
set.seed(3928272)
.in <- sample(c(FALSE, TRUE), nrow(diamonds), replace = TRUE, p = c(0.15, 0.85))

x <- c("log_carat", "cut", "color", "clarity", "depth", "table")
y <- "log_price"

train <- list(y = diamonds[[y]][.in], 
              X = as.matrix(diamonds[.in, x]))
test <- list(y = diamonds[[y]][!.in],
             X = as.matrix(diamonds[!.in, x]))
trainDF <- diamonds[.in, c(y, x)]
testDF <- diamonds[!.in, c(y, x)]

# For XGBoost
dtrain_xgb <- xgb.DMatrix(train$X, label = train$y)
watchlist <- list(train = dtrain_xgb)

# For lgb
dtrain_lgb <- lgb.Dataset(train$X, label = train$y)

#==============================
# MODELLING
#==============================
feature_fraction <- 1/3
mtry <- trunc(length(x) * feature_fraction)


#==============================
# A "real" rf
#==============================

system.time(fit_ranger <- ranger(reformulate(x, y), 
                                 data = trainDF, 
                                 num.trees = 100, 
                                 min.node.size = 5,
                                 mtry = mtry,
                                 seed = 837363)) # 1 sec
pred <- predict(fit_ranger, testDF)$predictions
perf(test$y, pred) # 0.989 R-squared

#==============================
# xgb without feature frac
#==============================

param_xgb <- list(max_depth = 20,
                  learning_rate = 1,
                  nthread = 4,
                  objective = "reg:squarederror",
                  eval_metric = "rmse",
                  subsample = 0.63,
            #      colsample_bynode = feature_fraction,
                  lambda = 0)

system.time(fit_xgb <- xgb.train(param_xgb,
                                 dtrain_xgb,
                                 watchlist = watchlist,
                                 nrounds = 1,
                                 num_parallel_tree = 100,
                                 verbose = 0)) # 5-8 sec
pred <- predict(fit_xgb, test$X)
perf(test$y, pred) # R-squared: 0.989 with feature frac, 0.989 without


#==============================
# lgb
#==============================

param_lgb <- list(max_depth = 20,
                  learning_rate = 1,
                  boosting = "rf",
                  nthread = 4,
                  min_data_in_leaf = 5,
                  num_leaves = 1000,
                  objective = "regression",
                  bagging_freq = 1,
           #       feature_fraction_bynode = feature_fraction,
                  bagging_fraction = 0.63,
                  metric = "rmse")

system.time(fit_lgb <- lgb.train(param_lgb,
                                 dtrain_lgb,
                                 nrounds = 100,
                                 verbose = 0)) # 2 sec
pred <- predict(fit_lgb, test$X)
perf(test$y, pred) # R-squared: 0.990 without feature frac, 0.913 without

@guolinke
Copy link
Collaborator Author

@mayer79 did you try different seeds?

@mayer79
Copy link
Contributor

mayer79 commented Sep 14, 2019

Not yet. With this data set, an R-squared of 0.98 is quite bad. So a value of 0.91 is extreme.

@guolinke
Copy link
Collaborator Author

@mayer79 could you provide the data file? csv format will be better.

@mayer79
Copy link
Contributor

mayer79 commented Sep 14, 2019

It is shipped along with ggplot2 in R. The raw source is https://github.com/tidyverse/ggplot2/blob/master/data-raw/diamonds.csv

@guolinke
Copy link
Collaborator Author

@mayer79
I check your data, it seems the number of features is only 6, with fraction 0.33, it will only choose one feature for each node.
In this case, it is easy to stop the growth of tree and cause the bad result.

BTW, I change the sample rate of xgb to 0.33, its result is only 0.86:

> param_xgb <- list(max_depth = 20,
+                   learning_rate = 1,
+                   nthread = 4,
+                   objective = "reg:squarederror",
+                   eval_metric = "rmse",
+                   subsample = 0.63,
+                   colsample_bynode = 0.33,
+                   lambda = 0)
>
> system.time(fit_xgb <- xgb.train(param_xgb,
+                                  dtrain_xgb,
+                                  watchlist = watchlist,
+                                  nrounds = 1,
+                                  num_parallel_tree = 100,
+                                  verbose = 0)) # 5-8 sec
   user  system elapsed
   2.88    0.26    1.66
> pred <- predict(fit_xgb, test$X)
> perf(test$y, pred) # R-squared: 0.989 with feature frac, 0.989 without
       r2      rmse       mae
0.8687868 0.3649375 0.2843431

@guolinke
Copy link
Collaborator Author

@mayer79
So I think a quick improvement, is to force at least 2 features are chosen at each node. Otherwise it is like random learning, since the used feature is random.

@mayer79
Copy link
Contributor

mayer79 commented Sep 14, 2019

@guolinke : It indeed seems like a rounding issue. I was using floating point 1/3 as rate, leading xgb to sample 6/3 = 2 and lgb to sample only one feature. I would not force to sample at least 2 features but rather keep your implementation as it is.

@mayer79
Copy link
Contributor

mayer79 commented Sep 15, 2019

Maybe rounding up the number of sampled columns would be an idea.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants