Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-output regression/classification #524

Closed
miaotianyi opened this issue May 17, 2017 · 26 comments
Closed

Support multi-output regression/classification #524

miaotianyi opened this issue May 17, 2017 · 26 comments

Comments

@miaotianyi
Copy link

Currently, LightGBM only supports 1-output problems. It would be interesting if LightGBM could support multi-output tasks (multi-output regression, multi-label classification, etc.) like those in multitask lasso.

I've seen a similar request on xgboost, but it hasn't been implemented yet.

@Laurae2
Copy link
Contributor

Laurae2 commented May 17, 2017

I think it would require rewriting the whole algorithm from scratch for LightGBM, as it was optimized for a one case.

In the case of xgboost, it requires rewriting the whole algorithm from scratch, which is not possible in the current state unless someone is ready to work on it.

@wxchan
Copy link
Contributor

wxchan commented May 18, 2017

Is there any introduction website or paper about multi-output task, except divided into multiple binary/regression task?

@Laurae2
Copy link
Contributor

Laurae2 commented May 18, 2017

@wxchan multi-output tasks require using an objective handling multi-output tasks.

It also includes/requires multi-split support for decision trees (multiple cutting points instead of one cutting point).

I think you can check this as starting point, it's explained very simply: http://users.math.yale.edu/users/gw289/CpSc-445-545/Slides/CPSC445%20-%20Topic%2005%20-%20Classification%20&%20Decision%20Trees.pdf

@miaotianyi
Copy link
Author

miaotianyi commented May 20, 2017

@wxchan I believe GBDT can adapt from multi-class to multi-label classification (where the labels aren't mutually exclusive) without too much additional computational cost.

In multi-label classification, the target y is a n_samples * n_labels matrix, and each column is a binary vector.

Traditionally, at the leaf node of a classification tree, the prediction is generated by an average of one-hot class probability vectors (which represent the classes of the samples belonging to the leaf). Ex. mean([0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], ...). When we use it for multi-label classification, the probability vectors will be different. Ex. mean([0, 1, 0, 1], [0, 1, 1, 0], [0, 0, 0, 1], [0, 1, 0, 1], ...).

(I know gradient boosting involves more complex maths, but that's the basic idea.)

For GBDT, some other modifications may be required:

  1. the impurity function: For multi-label classification, impurity functions mentioned in this document should change. Multi-class impurity functions across n classes would be modified into the sum of n 2-class impurity functions for each label.

e.g.
original entropy function: $-\sum_{c\in\mathcal{C}}{p(c)\log p(c)}$
new entropy function for multi-label classification: $-\sum_{l\in\mathcal{L}}{p(l)\log p(l) + (1-p(l))\log (1-p(l))}$

  1. the objective function for gradient boosting: Not certain yet, since metrics like cross entropy also apply to multi-label problems. This may be something interesting to explore.

@Laurae2 I couldn't see the necessity of multi-split support. Theoretically, any split of 1 parent node into more than 2 child nodes can be equivalently represented by a sequence of binary splits. (Did I understand you correctly?)

The class_weight parameter will still be useful. A label with a larger weight would thus be considered more important during the evaluation of each split.

I suppose, for multi-label classification, an implementation within a single lightgbm model could train more efficiently and consume less memory. Relatively speaking, multi-output regression might not be as useful.

@Laurae2
Copy link
Contributor

Laurae2 commented May 21, 2017

@TennielMiao multi-output classification is doable in xgboost/LightGBM, it is actually what is being done in multiclass problems but not in an efficient manner. Also, it returns everything, while you might be interested into a specific number of outputs (especially for classification). This is why we have softprob and softmax separate in xgboost (one is giving the raw values for your needs, while the other processes them before giving them to you).

It requires modification of the objective/loss function like you described. For instance, if you were to optimize F1 or F2 score, then you would have to put in the metric part an optimizer which finds the best threshold for each class for each iteration. For the loss function, you would have to find a proxy which is continuous and a local statistic (unlike F1/F2 Score requiring discrete inputs over a global statistic).

For proper multi-output classification, if you can have more splits instead of binary splits you will require a lower depth for trees which would also requires less splits. As the sum of loss from binary splits is only an approximation of the sum of loss from multi-splits (mathematically if you consider a graph with chained losses), the representation of a multi-split is not always identical to the representation of multiple binary splits (if you split more, you have higher odds to end up with something different).

As for the speed, there are two major cases:

  • O(n) for n classes: using n models for n classes/outputs is the easiest to implement. If you have 10,000 classes, then you have 10,000 models to train.
  • O(log(n)) for n classes: using 1 model for n classes/outputs is harder to implement and not trivial. It would also mean 10,000 classes would train 2,500x faster (theoretically) than a one-vs-all or one-vs-one classifier/regressor.

For the class_weight you mentioned: @guolinke did you implement the class_weight-like parameter in a previous PR? #314

@chivee
Copy link
Collaborator

chivee commented May 26, 2017

@TennielMiao The main bottleneck for the implementation of multi-output classification( regression) are memories, It's been highly non-efficiency since we need maintain all the residual value of all sample over all features.

@guolinke
Copy link
Collaborator

Just find a related paper in ICML 2017: "Gradient Boosted Decision Trees for High Dimensional Sparse Output" .

@huanzhang12 is one of the author.

@Laurae2
Copy link
Contributor

Laurae2 commented May 26, 2017

The excerpt:

In this paper, we study the gradient boosted decision trees (GBDT) when the output space is high dimensional and sparse. For example, in multilabel classification, the output space is a L-dimensional 0/1 vector, where L is number of labels that can grow to millions and beyond in many modern applications. We show that GBDT can easily run out of memory or encounter near-forever running time in this regime, and propose a new GBDT variant, called GBDT-SPARSE to resolve this problem by employing L_0 regularization. We then discuss in detail how to utilize this sparsity to conduct GBDT training, including splitting the nodes, computing the sparse residual, and predicting in sublinear time. Finally, we apply our algorithm to extreme multilabel classification problems, and show that the proposed GBDT-SPARSE achieves an order of magnitude improvements in model size and prediction time over existing methods, while yielding similar performance.

That method seems way faster than O(n) so it will be interesting to see what is used in GBDT-SPARSE to reach that speed.

@marugari
Copy link

marugari commented Jul 3, 2017

@huanzhang12
Copy link
Contributor

@marugari Thanks for posting the link here!
Let me know if you have any questions regarding our paper.

@marugari
Copy link

@huanzhang12 Trees for top-k labels have same (feature_id, threshold)?
Why the following algorithm fails?

Q <- index set of top-k p_s
for q in Q do
  for j = 1...D do
    for i = 1...N do
  best[q] <- (f_best, t_best)

If this discussion is not suitable for this issue, I will send a e-mail.

@huanzhang12
Copy link
Contributor

@marugari Yes, please send me an email if you have specific questions on our paper.

@guolinke guolinke added this to the v3.0 milestone Aug 3, 2017
@marugari
Copy link

I have made a prototype of #524 (comment).

https://github.com/marugari/LightGBM/tree/fast_multi
https://github.com/marugari/Notebooks/blob/master/LightGBM-Fast_Multi.ipynb

@guolinke
Copy link
Collaborator

@marugari
Thanks very much, it seems is a training top-k worst class solution ?

@huanzhang12
Copy link
Contributor

@marugari Sorry for my late reply. I will look into this.

@marugari
Copy link

@guolinke
Yes. This algorithm's focus is not gains but residuals, unlike the paper.

@huanzhang12
It seems that classes improve gain easily often don't match classes have large loss.
This causes poor prediction accuracy in my implementation.

@albertauyeung
Copy link

Hi all, I have a question and I am not sure if it is related to this discussion. When I train a multi-class model, from the log I can see that the number of trained trees is usually equal to the number of classes times number of iterations. Does that means LGB is training something like a one-vs-all classifier? If so can we not take the output for each class and somehow achieve multilabel classification? Please correct me if I am wrong. Thanks.

@marugari
Copy link

marugari commented Sep 5, 2017

@albertauyeung
LGB uses softmax (or softprob) and minimizes cross-entropy.
However, the number of trees per iteration should be equal to the number of classes because TreeLearner doesn't support multi-output.

@marugari
Copy link

marugari commented Nov 3, 2017

Multiclass classification in the TensorFlow Boosted trees.
Does this work well? (I don't believe)
https://arxiv.org/abs/1710.11547v1

@shoegazerstella
Copy link

Any update on multi-label classification?

@guolinke
Copy link
Collaborator

temporal solution: using sklearn multi-output wrapper https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html#sklearn.multioutput.MultiOutputRegressor and
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html#sklearn.multioutput.MultiOutputClassifier
you can use LGBMRegressor or LGBMClassifier as their base estimators.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@StatMixedML
Copy link
Contributor

StatMixedML commented Oct 14, 2022

There is a new feature in XGBoost that allows modelling of multiple-outputs:

https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html

Any plans to include this also into LightGBM? Would be great since then I could also implement a multivariate probabilistic framework, similar to Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting.

@jameslamb
Copy link
Collaborator

Thanks for the @ @StatMixedML .

Also linking this related conversation from XGBoost that you've been contributing on: dmlc/xgboost#2087

And these related LightGBM conversations:

To answer your question directly... I'm not aware of anyone currently working on adding this support to LightGBM. It has been almost a year since LightGBM's last substantive release, so the small team of mostly-volunteer maintainers here is currently focused on trying to get a year of other improvements and bugfixes out in a new major release (#5153). If you're interested in attempting to add multi-output support to LightGBM we can try to support you with reviews and advice, but at this point can't commit to more than that.

@StatMixedML
Copy link
Contributor

@jameslamb Kindly asking if there is an update on this?

@jameslamb
Copy link
Collaborator

Please don't post "is there an update" types of comments here.

We'd welcome your help if you'd like to try to contribute this. Otherwise, you can subscribe to feature requests here to be notified of activity on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests