Dart - very poor accuracy #126

gugatr0n1c · 2016-12-16T16:00:43Z

When I use dart as a booster I always get very poor performance in term of l2 result for regression task.
Even If I use small drop_rate = 0.01 or big like 0.3.

When I use dart in xgboost on same dataset, with similar setting (same learning rate, similiar num_trees) dart alwasy give me boost for accuracy (small but always).

But here accuracy is poor badly, like there is a bug, not just dart is not suitable for my task.

Can anyone confirm that dart is working for regression task in term of better accuracy?

My setting is as follows (part of the Python code for ramdom search of params):

lr = np.random.choice([0.01, 0.005, 0.0025])
list_count = np.random.choice([250, 500, 750, 1000])
min_in_leaf = np.random.choice([25, 50, 100])
subF = np.random.choice([0.15, 0.22, 0.3, 0.5, 0.66, 0.75])
subR = np.random.choice([0.66, 0.75, 0.83, 0.9])
max_depth = np.random.choice([9, 11, 15, 25, 45, 100, -1])
dart_rate = np.random.choice([0, 0, 0, 0.01, 0.03, 0.1])
max_bin = np.random.choice([63, 127, 255, 511])
lambda_l1 = np.random.choice([0, 1., 10., 100.])
lambda_l2 = np.random.choice([0, 1., 10., 100.])

iterace = 10000
if only_testing:
min_in_leaf = 25
iterace = 10

boost_type = 'gbdt'
if dart_rate > 0:
boost_type = 'dart'

params = {
'task' : 'train',
'boosting_type' : boost_type,
'objective' : 'regression',
'metric' : 'l2',
'max_depth' : int(max_depth),
'num_leaves' : int(list_count),
'min_data_in_leaf' : int(min_in_leaf),
'learning_rate' : lr,
'feature_fraction' : subF,
'bagging_fraction' : subR,
'bagging_freq': 1,
'verbose' : 0,
'nthread' : nthread,
'drop_rate': dart_rate,
'max_bin': max_bin,
'lambda_l1' : lambda_l1,
'lambda_l2' : lambda_l2
}

model = lg.train(
params,
(matrix_learn, target_learn),
num_boost_round = iterace,
valid_datas = (matrix_test, target_test),
early_stopping_rounds = 50
)

gugatr0n1c · 2016-12-16T16:06:53Z

What is working in XGB is probably not just drop_rate but also skip_drop. http://xgboost.readthedocs.io/en/latest/tutorials/dart.html

Can you please support this as well?

guolinke · 2016-12-16T16:55:06Z

@wxchan

https://github.com/dmlc/xgboost/blob/master/src/gbm/gbtree.cc#L547-L711

can you check the code of implementation in XGBoost, and figure out why?

wxchan · 2016-12-17T03:59:31Z

@guolinke I read these codes before. It's not hard to be added, but I have no idea where he came upon those parameters.

wxchan · 2016-12-17T04:00:40Z

@gugatr0n1c

Can you provide how you set up skip_rate etc. for xgboost?
Do you know where xgboost gets the ideas of skip_rate, sample_type and normalize_type?
Can you show accuracy comparisons between xgboost and lightgbm (both gbdt & dart) for your task?

guolinke · 2016-12-17T08:25:35Z

@wxchan, I am not familiar with XGBoost.
Following is my guess:

weight_drop is the weights of all trees, not just dropped trees, right? If yes, I think the line you refer to is a bug. it should be weight_drop[idx_drop[i]] *= factor . If this is a bug, I am not sure why XGBoost can preform better....
It seems drop tree logic is based on sum_weight in XGBoost, but LightGBM is just random. Maybe we can adapt this.

wxchan · 2016-12-17T10:24:07Z

@guolinke I can add that option. I will do some investigations on this.

gugatr0n1c · 2016-12-17T10:47:40Z

@wxchan
ad1] usually I run random search on parameters, and for xgboost on my datasets best setting is:
skip_rate = 0.3
drop_rate = 0.2
I use default for next two dart parameters.

ad2] actually dart for xgboost was done with my issue here
dmlc/xgboost#809
And it seems to that marugari add skip_rate just like his idea.
On other two setting I do not know the source.

ad3]
my task is regression problem with 500k rows, 400 features, and RMSE on cross validation:
randomforest from scikit: 5.46
deepnet (5 hidden layers, prelu, adam, xavier) from mxnet 5.43
xgboost without dart: 5.421
xgboost with dart: 5.419
lightgbm without dart: 5.418
lightgbm with dart: 5.5 - not a chance to beat randomforest

For all methods I did some random search of parameters and method should be comparable in the sence of RMSE. Speed is best for deepnet - but it is different algorithm (also depends on settings and hardware).

wxchan · 2016-12-18T06:36:57Z

@gugatr0n1c
Can you test two things on your dataset:

For lightgbm dart, set drop_rate to a very small number, such as drop_rate=1/num_iter; because your num_iter is big, each trees may be dropped too many times;
For xgboost dart, set learning rate=1. For dart, learning rate is a different concept from gbdt. In original paper, it's fixed to 1.

marugari · 2016-12-18T07:21:03Z

It's not so difficult to add skip_rate and normalize_type.
If there are users, I will PR.

gugatr0n1c · 2016-12-18T08:17:18Z

@wxchan
ad1] I tried with 0.005, 0.0005, 0.0001 and got similar "bad" results... strange is that early stopping stop training very early (hundreds of iterations with compare two several thousands without dart)

ad2] I can do this tomorrow from work

wxchan · 2016-12-18T12:57:03Z

@marugari where do you get the idea about skip_rate and normalize_type?

another question: when sample_type = weighted in xgboost, dparam.rate_drop * weight_drop.size() * weight_drop[i] / sum_weight (from here) can be bigger than 1, the probability of 'this' tree being selected is calculated as some number bigger than 1 but actually 1. Will it cause the expectation of dropped trees number smaller than rate_drop?

marugari · 2016-12-18T13:54:39Z

@wxchan Regarding skip_rate, All we have to do is leaving idx_drop empty.
In order to use learning_rate < 1 in dart, I propose the following:

/* dart.hpp L:103 */
if (normalize_type == 1) {
  shrinkage_rate_ = learning_rate_ / (1.0 + learning_rate_);
} else {
  shrinkage_rate_ = learning_rate_ / (drop_index_.size() + learning_rate_);
}
/* dart.hpp L:119 */
if (normalize_type == 1) {
  models_[curr_tree]->Shrinkage(-k / learning_rate_);
} else {
  models_[curr_tree]->Shrinkage(-1 / learning_rate_);
}

sample_type is very rough, since I think it's not involved in convergence.

wxchan · 2016-12-18T14:13:25Z

@marugari sorry, what I meant is, is there some paper regarding skip_rate and normalize_type, because they are not shown in dart original paper?

marugari · 2016-12-18T14:22:16Z

@wxchan sorry. They are my extensions, not published.

wxchan · 2016-12-18T15:06:25Z

@marugari I see, thanks.

wxchan · 2016-12-19T09:11:21Z

If you want to use skip_rate, you can arrange a callback with changing drop_rate.

Just my opinion on this: I think the parameters I add 'here' should be tested, I am not saying they are wrong, they can be wonderful ideas, but I want to be convinced before I add them 'here'.

skip_rate and sample_type added to my dart branch.

gugatr0n1c · 2016-12-19T11:30:32Z

@wxchan If I set in xgboost learning_rate to 1, accuracy is very bad as well, even worse than here...

wxchan · 2016-12-19T11:37:11Z

@gugatr0n1c in original paper, they only use shrinkage_rate=1/(1+num_drop_trees); as marugari said, he set shrinkage_rate_ = learning_rate_ / (drop_index_.size() + learning_rate_); I think this is the main difference rather than skip_rate or sample_type.

But according to code in xgboost:

float factor = 1.0 * num_drop / (num_drop + lr);
for (size_t i = 0; i < idx_drop.size(); ++i) {
  weight_drop[i] *= factor;
}
for (size_t i = 0; i < size_new_trees; ++i) {
  weight_drop.push_back(1.0 / (num_drop + lr));
}

If num_drop is a reasonable integer, num_drop+lr should not have big differences when lr=1 or lr=0.1. It has a lot of math involved in deciding these weights, I actually don't fully get it.

Another difference: in original paper, their strategy is at least dropping one tree each round, which means num_drop >= 1, this is what I implemented too. xgboost don't have this 'at least one' part(in the mean time, they have skip_drop, with making num_drop=0 directly). It means, with a lr = 0.1 for example, weights of new trees can be very large (1/(0+0.1)=10). I think it's not correct.

gugatr0n1c · 2016-12-19T12:08:50Z

As I wrote I did not used 'sample_type' parameter. I only used 'drop_rate' and 'skip_rate' (together with learning_rate << 1) in xgboost. And it was working nicely. I can not say now the influence of 'skip_rate'. But when I used random search for hyperparameters tuning, it always wanted to set 'skip_rate' to something non zero.
But, yes I agree that the main trick is to adapt shrinkage_rate as written.

But to be honest, if I can use below to simulate 'skip_rate', then it is not necessary to add this parameter, just maybe mention it as an example about callback? Actually it is even better solution, because I can then randomize drop_rate together with skip_rate (this is not even in xgboost) during learning process by some logic in generating drop_rate_list.

-->simulate 0.33 skip_rate
drop_rate_list = np.random.choice([0., 0.3, 0.3], iteration_count).tolist()
lg.train(
params,
data_train,
num_boost_round = iteration_count,
valid_sets = data_valid,
early_stopping_rounds = 50,
callbacks = [lg.reset_parameter(drop_rate = drop_rate_list)]
)

So I believe there are nice things to do: 1] allow change drop_rate in callback, should work now right?, 2] change shrinkage_rate as marugari wrote.

marugari · 2016-12-19T12:17:06Z

@wxchan Although weight_drop of new trees can be large, leaf values of that are scaled by learning_rate.
I think weight_drop * learning_rate = 1.0.

wxchan · 2016-12-19T12:34:12Z

@marugari oh sorry, I didn't see the if (num_drop == 0) branch. My fault. I also didn't know xgboost calculates learning_rate in another place, I thought it's calculated only in CommitModel.

Then I think it's reasonable to change shrinkage_rate to see if it works. It is actually a combination of shrinkage_rate in gbdt and normalization weight in dart.

guolinke · 2016-12-19T15:50:54Z

@gugatr0n1c can you test again with latest code?

gugatr0n1c · 2016-12-19T16:49:46Z

@guolinke I tried, but got error: Segmentation fault, I tried twice recompile from github

boostint_type = 'gbdt' is OK
boostint_type = 'dart' return Segmentation fault

guolinke · 2016-12-20T01:56:57Z

@wxchan can you take a look about this?

wxchan · 2016-12-20T04:02:26Z

@guolinke it seems caused by c_api because cmd line version is fine. I take a look at segfault log: it happened in Dart::Init, seems gbdt_config_ is still null after GBDT::ResetTrainingData.

guolinke · 2016-12-20T08:16:40Z

@gugatr0n1c @wxchan fixed, can you try it again?

gugatr0n1c · 2016-12-20T08:30:55Z

@guolinke training seems to be working, give you know about accuracy

gugatr0n1c · 2016-12-20T08:43:30Z

@guolinke seems to be not working properly. It almost not converge (even worse than before). Seems to me that first tree is build with very large learning_rate, than additional trees has huge problem to converge.

Tried with dart_rate: 0.3, 0.05, 0.001 and with learning_rate = 0.004 and 0.1 - almost same result on all settings.

guolinke · 2016-12-20T08:45:02Z

@gugatr0n1c OK, I see.
@wxchan , I also try this, the training error is small. seems the learning rate is still wrong.
I think the learning rate is still 1 at the first iteration..

guolinke · 2016-12-20T11:10:53Z

@wxchan
I think xgboost using learning rate at here, not in GBTree.

So the learning rate of DRAT is not 1.0.

update:

sorry, I am wrong. xgboost did use lr/(k+lr)

guolinke · 2016-12-20T11:15:08Z

@gugatr0n1c can you try the xgboost with latest code? @wxchan just fix bug in xgboost.

wxchan · 2016-12-20T11:16:05Z

@guolinke but weight of the first tree is 1/(num_drop+lr) = 1/lr, it canceled out the lr.

guolinke · 2016-12-20T11:23:54Z

@wxchan , I see.
Maybe we can do some experiments for these different settings.
I think we can choose our own algorithms, not limit to the implementation in xgboost.

wxchan · 2016-12-20T12:02:32Z

I think:

weight of new tree: lr/(k+1)
weight of drop tree: lr/(k+1)* k/(k+1)

will be better. In this case, tree normalization is same as before, just new tree shrinks by lr.

guolinke · 2016-12-20T23:10:56Z

@wxchan I think this may can help

marugari · 2016-12-21T01:09:22Z

New trees and sum of dropped trees has similar leaf values.
It seems that shrinking just new trees is improper.

guolinke · 2016-12-21T02:01:25Z

@marugari
In you implementation. if not trees are dropped, add new_tree with weight lr. Total delta weight is lr.

if trees are dropped, add new_tree with lr/(k+lr), and shrink k trees by k/(k+lr). Total delta weight is lr/(k+lr) - k * lr/(k+lr)* w_j . w_j is the previous weight of drop tree j.
w_j can be lr * (k/(k+lr))^i (if skipped) or lr/(k+lr)*(k/(k+lr))^i(not skip), i is drop count of this tree.

I think lr/(k+lr) - k * lr/(k+lr)* w_j is far smaller than lr, since lr/(k+lr) < lr and it minus a positive number. And when k==0(means not drop), it equal to 1, this is not consist with your operation when not trees are dropped.

BTW,

In original paper, delta weight is:
1/(k+1) - k/((k+1)*(k+1)) = 1/((k+1)(k+1))

Do you think let delta wight approx to lr/((k+1)*(k+1)) is better?

marugari · 2016-12-21T02:59:24Z

This is a little inaccurate. I'm Sorry...

New trees and sum of dropped trees scaled by previous weights has similar leaf values.
Thus, the normalization factor is independent of previous weights.

If weights of dropped trees are small. new trees have also small leaf values.
I don't think the delta weight is suitable characteristic.

guolinke · 2016-12-21T03:11:33Z

@marugari I think i understand your idea now.
You let weight of new tree approx to the sum of dropped trees, and use skip_rate to add new tree and increase total weight?

It seems when skip_rate==0.5 has best accuracy.

My idea is let total weight to be increased at every iteration. I think this may save half of iterations while achieve same accuracy.

marugari · 2016-12-21T03:23:53Z

@guolinke You are right.
In the case of small learning_rate, skipping iterations are main sources of loss reduction.

guolinke · 2016-12-21T04:42:07Z

I just push a commit at dart branch: 518bafd

It adds skip_drop and xgboost_dart_mode.
I also try some experiments, and it seems xgboost_dart_mode works well in small dataset.

@gugatr0n1c can you try this with your data?

gugatr0n1c · 2016-12-21T06:50:14Z

love to test this, but have some xmas travelling now, so it will take some time..

guolinke · 2016-12-21T07:05:42Z

@gugatr0n1c take you time and merry xmas!

@wxchan Do you have any comments?

wxchan · 2016-12-21T10:14:59Z

It's better now. I think you can merge it to master and update docs for it.

guolinke · 2016-12-21T11:57:44Z

@wxchan , I am not sure should we keep xgboost_dart_mode? And should give it a better name?

gugatr0n1c · 2016-12-22T21:03:35Z

@guolinke merry xmas to you as well

I done some tests:
for xDart (xgboost_dart_mode = True) I got great accuracy - this is great for my datasets, beating 'gbdt'
for Dart (xgboost_dart_mode = False) still bad accuracy... (but done limited tests only with very small learning_rate)

But here both xDart and Dart have training very slow: 15mins (gdbt) vs. several hours for xDart, some iterations are very fast ~ 100ms, but some are about 15-20s (related to skip_drop?), DART in xgboost is not so slower to 'gbdt', is there space to speed it up?

marugari · 2016-12-23T00:21:31Z

@gugatr0n1c Really?
Yes, XGBoost-DART is very slow (espesially late in a training) because it doesn't use the buffer.
However, I seem LightGBM-(x)DART runs as fast as gbdt.

guolinke · 2016-12-23T00:23:42Z

@gugatr0n1c what values of your drop_rate and skip_drop? this speed seems abnormal.

defaultRobot · 2016-12-23T08:36:53Z

hi, erveryone, I have rencently trianed dart and gbdt model using same training data with some noise. For predictions, in general, the dart model is little better than gbdt model. I set all the same parameter learning_rate=0.0112. I didn't try xgboost so I can't tell something.
@gugatr0n1c I just don't know how to use mxnet to implement "deepnet (5 hidden layers, prelu, adam, xavier) ", can you provide some examples to me? Thank you very much!

gugatr0n1c · 2016-12-23T08:51:34Z

@guolinke my setting is this:
learning_rate = 0.006
up to 10k iterations with early_stopping = 50 (stopped around 5k-7k iters with 'gdbt')
drop_rate = skip_drop = 0.3
using xDART

First like 100 iterations are fast, but I can see still that some iterations are very fast some slower
But after 1k iterations it is like 100ms vs 20s

@marugari not detaily tested in xgboost, but never have this feeling that later in the training is difference so big..

@anddelu this is very big offtopic here... send me your email to my twitter PM... @gugatr0n1c

guolinke · 2016-12-23T10:03:02Z

@gugatr0n1c
I know what happen now. I will try to optimize the speed.
The reason is it will drop many trees after trained many iterations, this will increase the computation cost of score.
A quick solution is reset drop_rate smaller during training.

guolinke · 2016-12-23T10:12:20Z

@wxchan if xgboost_dart_mode is better, I think we can just keep it and remove our version.
What do you think?

wxchan · 2016-12-23T10:16:59Z

@guolinke It may depend on different task. Our version is based on original paper. I think it's better keep it for now.

guolinke · 2016-12-26T03:31:59Z

@gugatr0n1c I add an parameter max_drop to limit the number of dropped trees. This can help for the speed. But I am not sure how it will affect to the accuracy.
And if max_drop <= 0, it means not limit.
I will close this issue for now, feel free to open an new issue if it cannot speed up or have much loss on accuracy.

gugatr0n1c · 2016-12-27T09:36:17Z

@guolinke thx, I will run some tests after New Year

github-actions · 2023-08-24T03:59:42Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

wxchan mentioned this issue Dec 19, 2016

add learning rate to dart #133

Merged

guolinke closed this as completed Dec 26, 2016

ZhangTP1996 mentioned this issue Jan 14, 2021

Non-deterministic even with "deterministic=True" "seed=0" and the same number of threads in LightGBM==3.1.1 #3761

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Dart - very poor accuracy #126

Dart - very poor accuracy #126

Comments

gugatr0n1c commented Dec 16, 2016 • edited

gugatr0n1c commented Dec 16, 2016

guolinke commented Dec 16, 2016

wxchan commented Dec 17, 2016 • edited

wxchan commented Dec 17, 2016

guolinke commented Dec 17, 2016

wxchan commented Dec 17, 2016

gugatr0n1c commented Dec 17, 2016

wxchan commented Dec 18, 2016

marugari commented Dec 18, 2016 • edited

gugatr0n1c commented Dec 18, 2016

wxchan commented Dec 18, 2016

marugari commented Dec 18, 2016

wxchan commented Dec 18, 2016

marugari commented Dec 18, 2016

wxchan commented Dec 18, 2016

wxchan commented Dec 19, 2016 • edited

gugatr0n1c commented Dec 19, 2016

wxchan commented Dec 19, 2016 • edited

gugatr0n1c commented Dec 19, 2016 • edited

marugari commented Dec 19, 2016

wxchan commented Dec 19, 2016

guolinke commented Dec 19, 2016

gugatr0n1c commented Dec 19, 2016

guolinke commented Dec 20, 2016

wxchan commented Dec 20, 2016

guolinke commented Dec 20, 2016

gugatr0n1c commented Dec 20, 2016

gugatr0n1c commented Dec 20, 2016

guolinke commented Dec 20, 2016 • edited

guolinke commented Dec 20, 2016 • edited

guolinke commented Dec 20, 2016

wxchan commented Dec 20, 2016

guolinke commented Dec 20, 2016

wxchan commented Dec 20, 2016

guolinke commented Dec 20, 2016 • edited

marugari commented Dec 21, 2016 • edited

guolinke commented Dec 21, 2016

marugari commented Dec 21, 2016

guolinke commented Dec 21, 2016

marugari commented Dec 21, 2016

guolinke commented Dec 21, 2016

gugatr0n1c commented Dec 21, 2016

guolinke commented Dec 21, 2016

wxchan commented Dec 21, 2016

guolinke commented Dec 21, 2016

gugatr0n1c commented Dec 22, 2016

marugari commented Dec 23, 2016

guolinke commented Dec 23, 2016

defaultRobot commented Dec 23, 2016

gugatr0n1c commented Dec 23, 2016

guolinke commented Dec 23, 2016 • edited

guolinke commented Dec 23, 2016

wxchan commented Dec 23, 2016

guolinke commented Dec 26, 2016

gugatr0n1c commented Dec 27, 2016

github-actions bot commented Aug 24, 2023

gugatr0n1c commented Dec 16, 2016 •

edited

wxchan commented Dec 17, 2016 •

edited

marugari commented Dec 18, 2016 •

edited

wxchan commented Dec 19, 2016 •

edited

wxchan commented Dec 19, 2016 •

edited

gugatr0n1c commented Dec 19, 2016 •

edited

guolinke commented Dec 20, 2016 •

edited

guolinke commented Dec 20, 2016 •

edited

guolinke commented Dec 20, 2016 •

edited

marugari commented Dec 21, 2016 •

edited

guolinke commented Dec 23, 2016 •

edited