Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dart - very poor accuracy #126

Closed
gugatr0n1c opened this issue Dec 16, 2016 · 60 comments
Closed

Dart - very poor accuracy #126

gugatr0n1c opened this issue Dec 16, 2016 · 60 comments

Comments

@gugatr0n1c
Copy link

gugatr0n1c commented Dec 16, 2016

When I use dart as a booster I always get very poor performance in term of l2 result for regression task.
Even If I use small drop_rate = 0.01 or big like 0.3.

When I use dart in xgboost on same dataset, with similar setting (same learning rate, similiar num_trees) dart alwasy give me boost for accuracy (small but always).

But here accuracy is poor badly, like there is a bug, not just dart is not suitable for my task.

Can anyone confirm that dart is working for regression task in term of better accuracy?

My setting is as follows (part of the Python code for ramdom search of params):

lr = np.random.choice([0.01, 0.005, 0.0025])
list_count = np.random.choice([250, 500, 750, 1000])
min_in_leaf = np.random.choice([25, 50, 100])
subF = np.random.choice([0.15, 0.22, 0.3, 0.5, 0.66, 0.75])
subR = np.random.choice([0.66, 0.75, 0.83, 0.9])
max_depth = np.random.choice([9, 11, 15, 25, 45, 100, -1])
dart_rate = np.random.choice([0, 0, 0, 0.01, 0.03, 0.1])
max_bin = np.random.choice([63, 127, 255, 511])
lambda_l1 = np.random.choice([0, 1., 10., 100.])
lambda_l2 = np.random.choice([0, 1., 10., 100.])

iterace = 10000
if only_testing:
min_in_leaf = 25
iterace = 10

boost_type = 'gbdt'
if dart_rate > 0:
boost_type = 'dart'

params = {
'task' : 'train',
'boosting_type' : boost_type,
'objective' : 'regression',
'metric' : 'l2',
'max_depth' : int(max_depth),
'num_leaves' : int(list_count),
'min_data_in_leaf' : int(min_in_leaf),
'learning_rate' : lr,
'feature_fraction' : subF,
'bagging_fraction' : subR,
'bagging_freq': 1,
'verbose' : 0,
'nthread' : nthread,
'drop_rate': dart_rate,
'max_bin': max_bin,
'lambda_l1' : lambda_l1,
'lambda_l2' : lambda_l2
}

model = lg.train(
params,
(matrix_learn, target_learn),
num_boost_round = iterace,
valid_datas = (matrix_test, target_test),
early_stopping_rounds = 50
)

@gugatr0n1c
Copy link
Author

What is working in XGB is probably not just drop_rate but also skip_drop. http://xgboost.readthedocs.io/en/latest/tutorials/dart.html

Can you please support this as well?

@guolinke
Copy link
Collaborator

@wxchan

https://github.com/dmlc/xgboost/blob/master/src/gbm/gbtree.cc#L547-L711

can you check the code of implementation in XGBoost, and figure out why?

@wxchan
Copy link
Contributor

wxchan commented Dec 17, 2016

@guolinke I read these codes before. It's not hard to be added, but I have no idea where he came upon those parameters.

@wxchan
Copy link
Contributor

wxchan commented Dec 17, 2016

@gugatr0n1c

  1. Can you provide how you set up skip_rate etc. for xgboost?
  2. Do you know where xgboost gets the ideas of skip_rate, sample_type and normalize_type?
  3. Can you show accuracy comparisons between xgboost and lightgbm (both gbdt & dart) for your task?

@guolinke
Copy link
Collaborator

@wxchan, I am not familiar with XGBoost.
Following is my guess:

  1. weight_drop is the weights of all trees, not just dropped trees, right? If yes, I think the line you refer to is a bug. it should be weight_drop[idx_drop[i]] *= factor . If this is a bug, I am not sure why XGBoost can preform better....
  2. It seems drop tree logic is based on sum_weight in XGBoost, but LightGBM is just random. Maybe we can adapt this.

@wxchan
Copy link
Contributor

wxchan commented Dec 17, 2016

@guolinke I can add that option. I will do some investigations on this.

@gugatr0n1c
Copy link
Author

@wxchan
ad1] usually I run random search on parameters, and for xgboost on my datasets best setting is:
skip_rate = 0.3
drop_rate = 0.2
I use default for next two dart parameters.

ad2] actually dart for xgboost was done with my issue here
dmlc/xgboost#809
And it seems to that marugari add skip_rate just like his idea.
On other two setting I do not know the source.

ad3]
my task is regression problem with 500k rows, 400 features, and RMSE on cross validation:
randomforest from scikit: 5.46
deepnet (5 hidden layers, prelu, adam, xavier) from mxnet 5.43
xgboost without dart: 5.421
xgboost with dart: 5.419
lightgbm without dart: 5.418
lightgbm with dart: 5.5 - not a chance to beat randomforest

For all methods I did some random search of parameters and method should be comparable in the sence of RMSE. Speed is best for deepnet - but it is different algorithm (also depends on settings and hardware).

@wxchan
Copy link
Contributor

wxchan commented Dec 18, 2016

@gugatr0n1c
Can you test two things on your dataset:

  1. For lightgbm dart, set drop_rate to a very small number, such as drop_rate=1/num_iter; because your num_iter is big, each trees may be dropped too many times;
  2. For xgboost dart, set learning rate=1. For dart, learning rate is a different concept from gbdt. In original paper, it's fixed to 1.

@marugari
Copy link

marugari commented Dec 18, 2016

It's not so difficult to add skip_rate and normalize_type.
If there are users, I will PR.

@gugatr0n1c
Copy link
Author

@wxchan
ad1] I tried with 0.005, 0.0005, 0.0001 and got similar "bad" results... strange is that early stopping stop training very early (hundreds of iterations with compare two several thousands without dart)

ad2] I can do this tomorrow from work

@wxchan
Copy link
Contributor

wxchan commented Dec 18, 2016

@marugari where do you get the idea about skip_rate and normalize_type?

another question: when sample_type = weighted in xgboost, dparam.rate_drop * weight_drop.size() * weight_drop[i] / sum_weight (from here) can be bigger than 1, the probability of 'this' tree being selected is calculated as some number bigger than 1 but actually 1. Will it cause the expectation of dropped trees number smaller than rate_drop?

@marugari
Copy link

@wxchan Regarding skip_rate, All we have to do is leaving idx_drop empty.
In order to use learning_rate < 1 in dart, I propose the following:

/* dart.hpp L:103 */
if (normalize_type == 1) {
  shrinkage_rate_ = learning_rate_ / (1.0 + learning_rate_);
} else {
  shrinkage_rate_ = learning_rate_ / (drop_index_.size() + learning_rate_);
}
/* dart.hpp L:119 */
if (normalize_type == 1) {
  models_[curr_tree]->Shrinkage(-k / learning_rate_);
} else {
  models_[curr_tree]->Shrinkage(-1 / learning_rate_);
}

sample_type is very rough, since I think it's not involved in convergence.

@wxchan
Copy link
Contributor

wxchan commented Dec 18, 2016

@marugari sorry, what I meant is, is there some paper regarding skip_rate and normalize_type, because they are not shown in dart original paper?

@marugari
Copy link

@wxchan sorry. They are my extensions, not published.

@wxchan
Copy link
Contributor

wxchan commented Dec 18, 2016

@marugari I see, thanks.

@wxchan
Copy link
Contributor

wxchan commented Dec 19, 2016

If you want to use skip_rate, you can arrange a callback with changing drop_rate.

Just my opinion on this: I think the parameters I add 'here' should be tested, I am not saying they are wrong, they can be wonderful ideas, but I want to be convinced before I add them 'here'.

skip_rate and sample_type added to my dart branch.

@gugatr0n1c
Copy link
Author

@wxchan If I set in xgboost learning_rate to 1, accuracy is very bad as well, even worse than here...

@wxchan
Copy link
Contributor

wxchan commented Dec 19, 2016

@gugatr0n1c in original paper, they only use shrinkage_rate=1/(1+num_drop_trees); as marugari said, he set shrinkage_rate_ = learning_rate_ / (drop_index_.size() + learning_rate_); I think this is the main difference rather than skip_rate or sample_type.

But according to code in xgboost:

float factor = 1.0 * num_drop / (num_drop + lr);
for (size_t i = 0; i < idx_drop.size(); ++i) {
  weight_drop[i] *= factor;
}
for (size_t i = 0; i < size_new_trees; ++i) {
  weight_drop.push_back(1.0 / (num_drop + lr));
}

If num_drop is a reasonable integer, num_drop+lr should not have big differences when lr=1 or lr=0.1. It has a lot of math involved in deciding these weights, I actually don't fully get it.

Another difference: in original paper, their strategy is at least dropping one tree each round, which means num_drop >= 1, this is what I implemented too. xgboost don't have this 'at least one' part(in the mean time, they have skip_drop, with making num_drop=0 directly). It means, with a lr = 0.1 for example, weights of new trees can be very large (1/(0+0.1)=10). I think it's not correct.

@gugatr0n1c
Copy link
Author

gugatr0n1c commented Dec 19, 2016

As I wrote I did not used 'sample_type' parameter. I only used 'drop_rate' and 'skip_rate' (together with learning_rate << 1) in xgboost. And it was working nicely. I can not say now the influence of 'skip_rate'. But when I used random search for hyperparameters tuning, it always wanted to set 'skip_rate' to something non zero.
But, yes I agree that the main trick is to adapt shrinkage_rate as written.

But to be honest, if I can use below to simulate 'skip_rate', then it is not necessary to add this parameter, just maybe mention it as an example about callback? Actually it is even better solution, because I can then randomize drop_rate together with skip_rate (this is not even in xgboost) during learning process by some logic in generating drop_rate_list.

-->simulate 0.33 skip_rate
drop_rate_list = np.random.choice([0., 0.3, 0.3], iteration_count).tolist()
lg.train(
params,
data_train,
num_boost_round = iteration_count,
valid_sets = data_valid,
early_stopping_rounds = 50,
callbacks = [lg.reset_parameter(drop_rate = drop_rate_list)]
)

So I believe there are nice things to do: 1] allow change drop_rate in callback, should work now right?, 2] change shrinkage_rate as marugari wrote.

@marugari
Copy link

@wxchan Although weight_drop of new trees can be large, leaf values of that are scaled by learning_rate.
I think weight_drop * learning_rate = 1.0.

@wxchan
Copy link
Contributor

wxchan commented Dec 19, 2016

@marugari oh sorry, I didn't see the if (num_drop == 0) branch. My fault. I also didn't know xgboost calculates learning_rate in another place, I thought it's calculated only in CommitModel.

Then I think it's reasonable to change shrinkage_rate to see if it works. It is actually a combination of shrinkage_rate in gbdt and normalization weight in dart.

@guolinke
Copy link
Collaborator

@gugatr0n1c can you test again with latest code?

@gugatr0n1c
Copy link
Author

@guolinke I tried, but got error: Segmentation fault, I tried twice recompile from github

boostint_type = 'gbdt' is OK
boostint_type = 'dart' return Segmentation fault

@guolinke
Copy link
Collaborator

@wxchan can you take a look about this?

@wxchan
Copy link
Contributor

wxchan commented Dec 20, 2016

@guolinke it seems caused by c_api because cmd line version is fine. I take a look at segfault log: it happened in Dart::Init, seems gbdt_config_ is still null after GBDT::ResetTrainingData.

@guolinke
Copy link
Collaborator

@gugatr0n1c @wxchan fixed, can you try it again?

@gugatr0n1c
Copy link
Author

@guolinke training seems to be working, give you know about accuracy

@gugatr0n1c
Copy link
Author

@guolinke seems to be not working properly. It almost not converge (even worse than before). Seems to me that first tree is build with very large learning_rate, than additional trees has huge problem to converge.

Tried with dart_rate: 0.3, 0.05, 0.001 and with learning_rate = 0.004 and 0.1 - almost same result on all settings.

@guolinke
Copy link
Collaborator

guolinke commented Dec 20, 2016

@gugatr0n1c OK, I see.
@wxchan , I also try this, the training error is small. seems the learning rate is still wrong.
I think the learning rate is still 1 at the first iteration..

@guolinke
Copy link
Collaborator

guolinke commented Dec 20, 2016

@wxchan
I think xgboost using learning rate at here, not in GBTree.

So the learning rate of DRAT is not 1.0.

update:

sorry, I am wrong. xgboost did use lr/(k+lr)

@guolinke
Copy link
Collaborator

@gugatr0n1c can you try the xgboost with latest code? @wxchan just fix bug in xgboost.

@wxchan
Copy link
Contributor

wxchan commented Dec 20, 2016

@guolinke but weight of the first tree is 1/(num_drop+lr) = 1/lr, it canceled out the lr.

@guolinke
Copy link
Collaborator

@wxchan , I see.
Maybe we can do some experiments for these different settings.
I think we can choose our own algorithms, not limit to the implementation in xgboost.

@wxchan
Copy link
Contributor

wxchan commented Dec 20, 2016

I think:

weight of new tree: lr/(k+1)
weight of drop tree: lr/(k+1)* k/(k+1)

will be better. In this case, tree normalization is same as before, just new tree shrinks by lr.

@guolinke
Copy link
Collaborator

guolinke commented Dec 20, 2016

@wxchan I think this may can help

@marugari
Copy link

marugari commented Dec 21, 2016

New trees and sum of dropped trees has similar leaf values.
It seems that shrinking just new trees is improper.

@guolinke
Copy link
Collaborator

@marugari
In you implementation. if not trees are dropped, add new_tree with weight lr. Total delta weight is lr.

if trees are dropped, add new_tree with lr/(k+lr), and shrink k trees by k/(k+lr). Total delta weight is lr/(k+lr) - k * lr/(k+lr)* w_j . w_j is the previous weight of drop tree j.
w_j can be lr * (k/(k+lr))^i (if skipped) or lr/(k+lr)*(k/(k+lr))^i(not skip), i is drop count of this tree.

I think lr/(k+lr) - k * lr/(k+lr)* w_j is far smaller than lr, since lr/(k+lr) < lr and it minus a positive number. And when k==0(means not drop), it equal to 1, this is not consist with your operation when not trees are dropped.

BTW,

In original paper, delta weight is:
1/(k+1) - k/((k+1)*(k+1)) = 1/((k+1)(k+1))

Do you think let delta wight approx to lr/((k+1)*(k+1)) is better?

@marugari
Copy link

This is a little inaccurate. I'm Sorry...

New trees and sum of dropped trees scaled by previous weights has similar leaf values.
Thus, the normalization factor is independent of previous weights.

If weights of dropped trees are small. new trees have also small leaf values.
I don't think the delta weight is suitable characteristic.

@guolinke
Copy link
Collaborator

@marugari I think i understand your idea now.
You let weight of new tree approx to the sum of dropped trees, and use skip_rate to add new tree and increase total weight?

It seems when skip_rate==0.5 has best accuracy.

My idea is let total weight to be increased at every iteration. I think this may save half of iterations while achieve same accuracy.

@marugari
Copy link

@guolinke You are right.
In the case of small learning_rate, skipping iterations are main sources of loss reduction.

@guolinke
Copy link
Collaborator

I just push a commit at dart branch: 518bafd

It adds skip_drop and xgboost_dart_mode.
I also try some experiments, and it seems xgboost_dart_mode works well in small dataset.

@gugatr0n1c can you try this with your data?

@gugatr0n1c
Copy link
Author

love to test this, but have some xmas travelling now, so it will take some time..

@guolinke
Copy link
Collaborator

@gugatr0n1c take you time and merry xmas!

@wxchan Do you have any comments?

@wxchan
Copy link
Contributor

wxchan commented Dec 21, 2016

It's better now. I think you can merge it to master and update docs for it.

@guolinke
Copy link
Collaborator

@wxchan , I am not sure should we keep xgboost_dart_mode? And should give it a better name?

@gugatr0n1c
Copy link
Author

@guolinke merry xmas to you as well

I done some tests:
for xDart (xgboost_dart_mode = True) I got great accuracy - this is great for my datasets, beating 'gbdt'
for Dart (xgboost_dart_mode = False) still bad accuracy... (but done limited tests only with very small learning_rate)

But here both xDart and Dart have training very slow: 15mins (gdbt) vs. several hours for xDart, some iterations are very fast ~ 100ms, but some are about 15-20s (related to skip_drop?), DART in xgboost is not so slower to 'gbdt', is there space to speed it up?

@marugari
Copy link

@gugatr0n1c Really?
Yes, XGBoost-DART is very slow (espesially late in a training) because it doesn't use the buffer.
However, I seem LightGBM-(x)DART runs as fast as gbdt.

@guolinke
Copy link
Collaborator

@gugatr0n1c what values of your drop_rate and skip_drop? this speed seems abnormal.

@defaultRobot
Copy link

hi, erveryone, I have rencently trianed dart and gbdt model using same training data with some noise. For predictions, in general, the dart model is little better than gbdt model. I set all the same parameter learning_rate=0.0112. I didn't try xgboost so I can't tell something.
@gugatr0n1c I just don't know how to use mxnet to implement "deepnet (5 hidden layers, prelu, adam, xavier) ", can you provide some examples to me? Thank you very much!

@gugatr0n1c
Copy link
Author

@guolinke my setting is this:
learning_rate = 0.006
up to 10k iterations with early_stopping = 50 (stopped around 5k-7k iters with 'gdbt')
drop_rate = skip_drop = 0.3
using xDART

First like 100 iterations are fast, but I can see still that some iterations are very fast some slower
But after 1k iterations it is like 100ms vs 20s

@marugari not detaily tested in xgboost, but never have this feeling that later in the training is difference so big..

@anddelu this is very big offtopic here... send me your email to my twitter PM... @gugatr0n1c

@guolinke
Copy link
Collaborator

guolinke commented Dec 23, 2016

@gugatr0n1c
I know what happen now. I will try to optimize the speed.
The reason is it will drop many trees after trained many iterations, this will increase the computation cost of score.
A quick solution is reset drop_rate smaller during training.

@guolinke
Copy link
Collaborator

@wxchan if xgboost_dart_mode is better, I think we can just keep it and remove our version.
What do you think?

@wxchan
Copy link
Contributor

wxchan commented Dec 23, 2016

@guolinke It may depend on different task. Our version is based on original paper. I think it's better keep it for now.

@guolinke
Copy link
Collaborator

@gugatr0n1c I add an parameter max_drop to limit the number of dropped trees. This can help for the speed. But I am not sure how it will affect to the accuracy.
And if max_drop <= 0, it means not limit.
I will close this issue for now, feel free to open an new issue if it cannot speed up or have much loss on accuracy.

@gugatr0n1c
Copy link
Author

@guolinke thx, I will run some tests after New Year

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants