Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing the number of trees in the random forest model does not improve performance #1887

Closed
pensivevoice opened this issue Apr 30, 2019 · 13 comments

Comments

Projects
None yet
3 participants
@pensivevoice
Copy link

commented Apr 30, 2019

Issue description

Increasing the number of trees in the random forest model does not improve performance. It is as if all the trees in the forest are equal. Is there a setting to turn on randomness that I am missing?
RandomForestNumTreesPotentialBug.txt

Your environment

  • version of mlpack: 3.0.4
  • operating system: Windows 10
  • compiler: Visual Studio 2015
  • version of dependencies (Boost/Armadillo): 1_66_0/7.800.2
  • any other environment information you think is relevant:

Steps to reproduce

Create one random forest model with numTrees = 2 and another with numTrees = 10. Keep all other parameters constant. Training performance (e.g. accuracy) does not change.

Expected behavior

Training performance should improve when the number of trees increases since the more complex model is better able to overfit the training data.

Actual behavior

Training performance does not change.

RandomForestNumTreesPotentialBug.txt

@MuLx10

This comment has been minimized.

Copy link
Member

commented May 3, 2019

@pensivevoice thanks for opening the issue. I think there might be a bug. I will let you know if I find any.

@MuLx10

This comment has been minimized.

Copy link
Member

commented May 4, 2019

I guess #1891 resolves the issue.

@rcurtin

This comment has been minimized.

Copy link
Member

commented May 4, 2019

Right, thanks for pointing it out @MuLx10. I should have updated this too. I ended up having some time to look into this yesterday after an IRC discussion and I think I found what the issue was. Sorry if I stepped on your toes a bit while you were debugging.

@pensivevoice If you can test that #1891 fixes the issue, it would be really helpful. 👍

@pensivevoice

This comment has been minimized.

Copy link
Author

commented May 7, 2019

Thanks so much for looking into it. Now changing the number of trees makes a difference but not in the expected way. Training performance follows an inverted U shape: it reaches a maximum at some point instead of continue increasing with the number of trees. Also training performance peaks at a low value. It is not what you would expect from a random forest model with enough trees and a small minimum leaf size.

@pensivevoice

This comment has been minimized.

Copy link
Author

commented May 7, 2019

In case this might help this is the output I obtained after running the attached program. Here Type is training data (as opposed to test data), Trees is the number of trees in the forest, MLSz is the minimum leaf size parameter, Acc stands for accuracy and Prec for precision.
RandomForestNumTreesBug2.txt

Type, Trees, MLSz, Acc, Prec, Recall, F1
Train, 2, 1, 0.876, 0.864, 0.954, 0.906486
Train, 10, 1, 0.873, 0.834, 0.997, 0.908171
Train, 20, 1, 0.868, 0.827, 1.000, 0.905172
Train, 40, 1, 0.880, 0.840, 1.000, 0.913043
Train, 100, 1, 0.878, 0.838, 1.000, 0.911722
Train, 200, 1, 0.878, 0.838, 1.000, 0.911722

@rcurtin

This comment has been minimized.

Copy link
Member

commented May 8, 2019

You're right, there is something else wrong here. I am currently digging into the issue.

@pensivevoice

This comment has been minimized.

Copy link
Author

commented May 8, 2019

It was pointed out to me that I should use the template MultipleRandomDimensionSelect instead of RandomDimensionSelect. When doing this the random forest model seems to work as expected. I am attaching a sample C++ program that illustrate how to use the random forest model so that it performs as most users would expect. The results now look good. Once again thanks for looking into this issue.

Dataset german.csv has 23 features, 4 = sqrt(23) will be used for splitting.

Type, Trees, MLSz, Acc, Prec, Recall, F1
Train, 2, 1, 0.923000, 0.971039, 0.904762, 0.936730
Train, 10, 1, 0.985000, 0.978227, 0.998413, 0.988217
Train, 20, 1, 0.999000, 0.998415, 1.000000, 0.999207
Train, 40, 1, 0.998000, 0.996835, 1.000000, 0.998415
Train, 100, 1, 0.999000, 0.998415, 1.000000, 0.999207
Train, 200, 1, 0.998000, 0.996835, 1.000000, 0.998415
RandomForestExample.txt

@rcurtin

This comment has been minimized.

Copy link
Member

commented May 16, 2019

Hey @pensivevoice, sorry for the slow response. I've been looking into it for some time and I think a fix is prepared in #1891. You can give that a shot and let me know if it works. Once that is merged we'll release 3.1.1 with the fix.

And yeah, using MultipleRandomDimensionSelect with a large number of dimensions should help in the existing code. However the PR has some additional fixes that I think will help further. 👍

@rcurtin rcurtin removed the s: unanswered label May 16, 2019

@pensivevoice

This comment has been minimized.

Copy link
Author

commented May 17, 2019

Hi @MuLx10, @rcurtin. Thanks for making time to deal with this issue. Here are my results.

Validation The mlpack results were validated against scikit-learn in a binary classification task using 100 bootstrap datasets and their complement (those samples not in the bootstrap set). I trained on the bootstrap dataset and used the complement as a test set. The hyperparameters used were: 50 trees, a minimum leave size of 2, and the splitting dimensions to use was set to the square root of the number of features. For this last parameter I used MultipleRandomDimensionSelect(sqrt(number_of_features) in mlpack. The mean squared error on each test set was selected as the performance measure.

Results Plotting mlpack versus scikit-learn produced points along a 45 degree line. It was very interesting that both implementations had a "hard time" on the same datasets. The difference of the means was 3e-4 and all differences were within 1e-3. In summary, no statistically significant difference was detected among the implementations. A speed-up of 2.7 was detected. This means that what took 2.7 units of time before, now takes 1 unit.

@rcurtin

This comment has been minimized.

Copy link
Member

commented May 17, 2019

@pensivevoice nice results, glad that the fixes helped. Any chance you could share the plot?

@pensivevoice

This comment has been minimized.

Copy link
Author

commented May 17, 2019

mse

@rcurtin

This comment has been minimized.

Copy link
Member

commented May 23, 2019

With #1891 merged, and your feedback, I think that this issue is solved. Once I see that the builds are passing I will release mlpack 3.1.1 with the fix. :) Thanks again for the report! 👍

@rcurtin rcurtin closed this May 23, 2019

@rcurtin rcurtin added the s: fixed label May 23, 2019

@pensivevoice

This comment has been minimized.

Copy link
Author

commented May 23, 2019

Thanks for fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.