Can terminal-node quantities for prediction be extracted as a dataset? #393

glyden · 2023-09-28T14:40:15Z

glyden
Sep 28, 2023

My random forest is being used to calculate predictions in a Shiny app, but the forest object is very large. I'm using the save.memory=T option to save space, but an alternative pathway would be to store the covariate combinations and terminal-node quantities (from save.memory=F) for rapid prediction in a SQL database. (My organization has used the SQL-database pathway before for Shiny apps that need to access large datasets.)

I'm not sure how the terminal-node quantities are stored on the backend in rfsrc, but is there a dataset-like object that could be extracted from the forest object and modified for use in this way? Thanks in advance!

Answered by kogalur

Sep 29, 2023

So we do have ways of extracting terminal node information from the forest object on the R-side. When we don't care as much about the time it takes to create a model, but we do want rapid prediction, we typically save terminal node information in a fatter R-side object, and are able to restore the topology of the forest with terminal node quantities, without having to re-send training data down the forest during model restoration and prediction. The downside is that the forest object is large, but the upside is that terminal node restoration occurs on as as-needed just-in-time basis. If you only have one individual you wanted to send down the forest, we only restore the path from the root…

View full answer

kogalur · 2023-09-29T19:26:16Z

kogalur
Sep 29, 2023
Maintainer

So we do have ways of extracting terminal node information from the forest object on the R-side. When we don't care as much about the time it takes to create a model, but we do want rapid prediction, we typically save terminal node information in a fatter R-side object, and are able to restore the topology of the forest with terminal node quantities, without having to re-send training data down the forest during model restoration and prediction. The downside is that the forest object is large, but the upside is that terminal node restoration occurs on as as-needed just-in-time basis. If you only have one individual you wanted to send down the forest, we only restore the path from the root to the terminal node associated with that individual. Now that we know the terminal node identifier, we access the previously saved information for that terminal node, and calculate the predicted value.

What is clear here is that we need the forest topology and the terminal node predicted value. We don't have a way of storing split information (i.e. the topology) and terminal node quantities the way you suggest.

This discussion somewhat reminds me of an issue where a user was complaining about slow prediction times. It may not help you, but here is the link:

#389

I think it might be worth pursuing the anonymous forest object approach above and then reevaluate your prediction times. I think it would also help if you told us something about your data, and what family it comprises. Dimensions like n, p ntree, and the family would help.

1 reply

glyden Oct 7, 2023
Author

Thank you for referring me to that discussion - I believe that has resolved my issue! Switching to rfsrc.anonymous and reducing the trees has made both the rfsrc object smaller and also the predictions significantly faster. Very much appreciate the help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can terminal-node quantities for prediction be extracted as a dataset? #393

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Can terminal-node quantities for prediction be extracted as a dataset? #393

glyden Sep 28, 2023

Replies: 1 comment · 1 reply

kogalur Sep 29, 2023 Maintainer

glyden Oct 7, 2023 Author

glyden
Sep 28, 2023

Replies: 1 comment 1 reply

kogalur
Sep 29, 2023
Maintainer

glyden Oct 7, 2023
Author