Add regression tree #2905

RishabhGarg108 · 2021-04-05T19:32:00Z

This PR attempts to add regression tree support to mlpack. Relevant discussion #2619.
This is going to be a long PR so hopefully, I will divide it into multiple parts.

The following checklist will broadly keep track of the PR.

RishabhGarg108 · 2021-04-07T13:30:32Z

Hello @rcurtin, this is a follow-up from our discussion on IRC, I will move forward as discussed there. Another thing I want to clarify is that how do we modify the constructors for the class? Since the current implementation was for classification, it requires numClasses as a required argument. But this doesn't apply to the regression case.

There are two approaches that I have in my mind to solve it.

One is to add 6 new constructors without the numClasses argument. And in those constructors, we can call the corresponding Train function.
Another one is that we can set numClasses = 0 by default. So, if the user needs to do classification, he will require to set it to some non-zero value and in the default case, it does regression. Inside the constructors, we can add an if statement which will call the corresponding train function based on whether numClasses is zero or not.

From the end user's perspective, both are almost the same but what will be good from the perspective of the design principles and existing codebase?

PS I have accumulated other ideas too which I compiled here. This can provide some better insights into my thoughts.

rcurtin · 2021-04-08T21:38:52Z

Thanks for the clear writeup! In fact it is sufficiently comprehensive that I don't have anything to add. I agree with your preference---creating a separate class. Where possible, I would suggest sharing functionality between DecisionTree<> and DecisionTreeRegressor<>. So, it may make sense to factor out the overload of Train() that actually trains the tree into a generic standalone function that can be used by both classes. 👍

rcurtin · 2021-07-03T02:52:08Z

src/mlpack/methods/decision_tree/mse_gain.hpp

+   * @param minimum The minimum number of elements in a leaf.
+   */
+  template<bool UseWeights, typename ResponsesType, typename WeightVecType>
+  void CalculateStatistics(const ResponsesType& responses,


Nice, this refactoring looks great! I have a few comments and they mostly have to do with naming and API simplicity. I think they would all be simple refactorings. Mostly the thought I have here is "how can we future-proof this API for other split types?" and also "how can we make it as general as possible, so we don't have to change it later?"

The first suggestion would be to prefix these functions' names with something like Binary, since these three functions (CalculateStatistics(), UpdateStatistics(), and the new overload of Evaluate()) are specific to the case where we are looking to find the best binary split by scanning an entire array.

Next, I see that you're calling Evaluate() with two different indices to get the left and right gain. But I wonder if it would be simpler to just return a std::tuple<double, double> (i.e. both gains at once), since the strategy we are using here is restricted to a binary split. (It seems possible but very nontrivial to generalize to a more-than-binary split, and we don't need it for our purposes anyway...) Another way to achieve the same thing would be to take two double&s that you set to the left and right gains in the function. That might "look" more like other functions inside of mlpack.

If you did return both gains at once, then actually it would be possible to simplify further and combine UpdateStatistics() with this new function that computes both gains.

So, for instance, we might have two functions like BinaryScanInitialize() (I believe this actually does not need any parameters---more comments below) and BinaryGains(const ResponsesType& responses, const WeightVecType& weights, const size_t splitIndex, double& leftGain, double& rightGain).

Those are just some ideas... let me know what you think. Like any design, part of it is personal preference, so, don't feel obligated to take every suggestion.

Take a look at https://github.com/RishabhGarg108/mlpack-1/blob/ad64ad07d717c2b2d19be7bac82be03a2329071b/src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp#L407 from lines 407 to 417. First, we update statistics, then we do a check if we can skip the gain computation for a particular index and then we evaluate gain and later do other stuff.

Now, if we want to combine the UpdateStatistics and Evaluate method, then we have to move this check that skips the gain computation into that function too and it will be ugly to continue the loop from the function. We will have to create a flag variable that will be set to true when the skip condition is true in the function and based on that we will have to put another condition in the loop to continue it.

So, I think we can keep them separated. Let me know if I overlooked something or if there is some other way to achieve this.

One thing that we can definitely do is to return a tuple from the Evaluate method. 👍

Also, I am not that good at naming functions. Can you tell the exact names for these functions that you would like? :)

Ahh, so the only thing I was thinking is that the update on line 407 can be removed, but then you would need to have the function that computes the gain allow updating the index by many points at once (like described here). But I do think either way is fine, so, up to you.

For names I might suggest:

BinaryScanInitialize()

BinaryStep() for UpdateStatistics() (since it 'step's one index at a time)

BinaryGains() to compute the left and right gains

Let me know what you think. 👍

The names make sense. Thanks :)

src/mlpack/methods/decision_tree/mse_gain.hpp

rcurtin · 2021-07-03T03:00:02Z

src/mlpack/methods/decision_tree/mse_gain.hpp

+
+    if (UseWeights)
+    {
+      const WType w = weights[index];


If the user does not pass index as 1 greater than the previous index that was used when UpdateStatistics() was called, this could give an incorrect result. I wonder if it might be better to internally store lastIndex (initialized to 0), and loop over all values between lastIndex + 1 and index (inclusive) to update the value, instead of just index.

I think it is just fine because we don't expect the user to call this function directly.

Moreover, iterating in this way allows us to check if the data value has changed from the last index or not. This helps us to skip computing gain for some of the indexes where the value doesn't change. Now, I don't deny that this can't be done with the way you are suggesting here but for that too, something similar would be needed to skip some indexes. So, I think this is okay the way it is now. Let me know if it doesn't make sense :)

Yeah, agreed, there are a couple ways to do it---either you have two functions, one of which takes a step but doesn't return the gain, and the other of which takes a step and returns the gain (this is the way you have it now), or you have one function that allows 'fast forwarding', e.g., taking possibly multiple steps and returning the gain of the last one.

(By 'step' there I mean 'increase the index'.)

Up to you which you want to go with---personally, I think just one function is cleaner, but, I can see advantages and disadvantages to both approaches. (They are pretty minor tradeoffs though.)

src/mlpack/methods/decision_tree/mse_gain.hpp

…noptimized overloads of SplitIfBetter

rcurtin

Everything is looking good to me here! I have a couple small comments---some little style issues, but I think there is an SFINAE bug (it should be easy to fix).

It looks like there is a merge conflict---can you try merging master into this branch? Alternately, I know the history is a little messed up in this branch already, so it might be worth creating a new branch, cherry-picking the relevant commits from this branch, and then opening a new PR. Either way should work, I think. 👍

HISTORY.md

src/mlpack/methods/decision_tree/best_binary_numeric_split.hpp

rcurtin · 2021-07-09T21:31:29Z

src/mlpack/methods/decision_tree/best_binary_numeric_split.hpp

+  */
+  template<bool UseWeights, typename VecType, typename ResponsesType,
+          typename WeightVecType>
+  typename std::enable_if<


This one isn't marked static, but I think it should be. However, when you do this, I expect you will have an ambiguous function call compilation error, because MSEGain can match both this overload and the other one. Thus, you need to use std::enable_if<> with the other overload, with the negated conditional: std::enable_if<!HasBinaryScanInitialize ... || !HasBinaryStep ..., double.

Note also you can use SFINAE as default-valued arguments to the function, like this:

SplitIfBetter( const double bestGain, const VecType& data, const ResponsesType& responses, const WeightVecType& weights, const size_t minimumLeafSize, const double minimumGainSplit, double& splitInfo, AuxiliarySplitInfo& /* aux */, typename std::enable_if<..., void>::type* = 0);

But, to my knowledge they both work the same way, so it doesn't make a difference which you want to use.

Hey @rcurtin, I added the static keyword to the function and it did compile. But, when I added the negated condition to the regular overload, It is giving me an error that "no matching overload of SplitIfBetter could be found". Can you please take a look at what is going wrong here?

src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp

src/mlpack/methods/decision_tree/decision_tree_regressor_impl.hpp

Co-authored-by: Ryan Curtin <ryan@ratml.org>

src/mlpack/methods/decision_tree/best_binary_numeric_split.hpp

rcurtin · 2021-07-15T00:41:15Z

Awesome---let's continue this in #3011.

RishabhGarg108 added 2 commits April 6, 2021 00:22

Added mad gain implementation and tests

37f24d0

Added forgotten import and fixed test

c7f58d1

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Apr 5, 2021

Fixed implementation bug and normal distribution test

beb8931

RishabhGarg108 added 6 commits April 8, 2021 19:37

Fixed logic error in calculating weighted mean

928b7ef

Implemented SIMD sum in utils.hpp

2bcc529

Refactored mad_gain to use utils.hpp

bd7df25

Implemented MSE gain

c8452b5

Implemented tests for MSE gain

8eba8c1

Removed print statement

bf8c2f4

RishabhGarg108 added 13 commits April 9, 2021 13:08

Removed documentation of ElemType

d9a0d4a

Added constructors

f2ccc97

Added constructors

27eb4ae

Added copy and move ctors and dtor

4c2582b

Implemented ctors and dtor

e7e6f96

Added Train overloads

d14ee71

Refactored AllCategoricalSplit

f5587ce

Refactored AllCategoricalSplit patch 2

63e8882

Improved documentation

18b2646

Shifted regression tree tests to new file

04f1d2a

Added ignored numClasses to MSE and MAD gains

1039e91

Added tests for AllCategoricalSplit for regression

02ed207

Refactored BestBinaryNumericSplit

aefccf0

zoq added c: methods and removed s: unanswered s: unlabeled labels Apr 10, 2021

rcurtin reviewed Jul 3, 2021

View reviewed changes

RishabhGarg108 added 5 commits July 3, 2021 10:01

Removed sumSquared vector to use O(1) space

ad64ad0

Return tuple from Evaluate method.

6ac931d

Change names of functions

d7645cf

Finally after so many errors... Add SFINAE to resolve optimized and u…

7e44ae2

…noptimized overloads of SplitIfBetter

Fix style warning

85e0aa7

RishabhGarg108 requested a review from rcurtin July 7, 2021 13:18

RishabhGarg108 changed the title ~~[WIP] Add regression tree~~ Add regression tree Jul 7, 2021

Add a log to HISTORY.md

7966e8e

rcurtin reviewed Jul 9, 2021

View reviewed changes

RishabhGarg108 and others added 6 commits July 10, 2021 07:31

Fix style issues from code review

7b6a893

Update documentation.

5f03d02

Add BinaryGains() to SFINAE check

b12d3f3

Add static keyword to function... Doesn't compile yet.

12ce7d9

Change auto to std::tuple<double, double>

9da580a

Update HISTORY.md

5226dd9

Co-authored-by: Ryan Curtin <ryan@ratml.org>

rcurtin reviewed Jul 10, 2021

View reviewed changes

src/mlpack/methods/decision_tree/best_binary_numeric_split.hpp Outdated Show resolved Hide resolved

Removing template function checks is failing too

8ec8261

RishabhGarg108 mentioned this pull request Jul 12, 2021

Add Decision Tree Regressor. #3011

Merged

RishabhGarg108 added 3 commits July 13, 2021 08:05

Fix the SFINAE bug with @rcurtin's patch

9c707ef

Update comment

5533432

Removed debugging print statement

44490ec

RishabhGarg108 mentioned this pull request Jul 13, 2021

SFINAE utility to check for template member functions. #3013

Closed

rcurtin closed this Jul 15, 2021

RishabhGarg108 deleted the regression-tree branch July 15, 2021 01:23

RishabhGarg108 mentioned this pull request Jul 19, 2021

Use AuxiliarySplitInfo to store all the split related info in DecisionTree and DecisionTreeRegressor. #3020

Closed

This was referenced Oct 14, 2022

Release version 4.0.0 #3285

Closed

Release version 4.0.0 #3286

Closed

rcurtin mentioned this pull request Oct 23, 2022

Release version 4.0.0 #3293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regression tree #2905

Add regression tree #2905

RishabhGarg108 commented Apr 5, 2021 •

edited

RishabhGarg108 commented Apr 7, 2021 •

edited

rcurtin commented Apr 8, 2021

rcurtin Jul 3, 2021

RishabhGarg108 Jul 3, 2021

rcurtin Jul 6, 2021

RishabhGarg108 Jul 7, 2021

rcurtin Jul 3, 2021

RishabhGarg108 Jul 3, 2021

rcurtin Jul 6, 2021

rcurtin left a comment

rcurtin Jul 9, 2021

RishabhGarg108 Jul 10, 2021

rcurtin commented Jul 15, 2021

Add regression tree #2905

Add regression tree #2905

Conversation

RishabhGarg108 commented Apr 5, 2021 • edited

RishabhGarg108 commented Apr 7, 2021 • edited

rcurtin commented Apr 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin commented Jul 15, 2021

RishabhGarg108 commented Apr 5, 2021 •

edited

RishabhGarg108 commented Apr 7, 2021 •

edited