Continuous Treatment Memory Error #558

ginward · 2019-10-31T16:09:51Z

Description of the bug
I am estimating causal forest on a large dataset (around 2-3 GB with 1.6 million observations and 8 independent variables, as well as 1 dependent variable) with 4000 trees. When I use a binary treatment W, the forest runs fine. However, when I switch to a continuous treatment W, R crashes.

I am running it on a HPC with 32 CPUs and 12GB RAM per CPU.

The error message is (I masked some of my memory addresses with *):

*** Error in `/PATH/lib/R/bin/exec/R': break adjusted to free malloc space: 0x000******** ***
======= Backtrace: =========
/lib64/libc.so.6(+0x82257)[0x7f5*********]
/lib64/libc.so.6(+0x82cea)[0x7f5*********]
/lib64/libc.so.6(__libc_malloc+0xc7)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp_alloc.so(malloc+0x22)[0x7f5*********]
/PATH/lib/R/bin/exec/../../../libstdc++.so.6(_Znwm+0x15)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt6vectorImSaImEE17_M_default_appendEm+0xc3)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf4Tree15find_leaf_nodesERKNS_4DataERKSt6vectorImSaImEE+0xb7)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer21repopulate_leaf_nodesERKSt10unique_ptrINS_4TreeESt14default_deleteIS2_EERKNS_4DataERKSt6vectorImSaImEEb+0xe0)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer5trainERKNS_4DataERNS_13RandomSamplerERKSt6vectorImSaImEERKNS_11TreeOptionsE+0x408)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer14train_ci_groupERKNS_4DataERNS_13RandomSamplerERKNS_13ForestOptionsE+0x135)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer11train_batchEmmRKNS_4DataERKNS_13ForestOptionsE+0x1a9)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultISt6vectorIS0_IN3grf4TreeESt14default_deleteISA_EESaISD_EEEES3_ENSt6thread8_InvokerISt5tupleIJMNS9_13ForestTrainerEKFSF_mmRKNS9_4DataERKNS9_13ForestOptionsEEPKSL_mmSO_SP_EEEESF_EEE9_M_invokeERKSt9_Any_data+0x67)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(+0x526e9)[0x7f5*********]
/lib64/libpthread.so.0(+0x620b)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJMN3grf13ForestTrainerEKFSt6vectorISt10unique_ptrINS5_4TreeESt14default_deleteIS9_EESaISC_EEmmRKNS5_4DataERKNS5_13ForestOptionsEEPKS6_mmSH_SI_EEEESE_EC4EOSQ_EUlvE_EEEEE6_M_runEv+0x116)[0x7f5*********]
/PATH/lib/R/bin/exec/../../../libstdc++.so.6(+0xc819d)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35eeb)[0x7f5*********]
/lib64/libpthread.so.0(+0x7ea5)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35cad)[0x7f5*********]
/lib64/libc.so.6(clone+0x6d)[0x7f5*********]

Does continuous treatment consume more memory than binary treatment?

GRF version
development

The text was updated successfully, but these errors were encountered:

erikcs · 2019-10-31T18:00:03Z

If you decrease the HPC memory to something insufficient, do you get a different error message than the malloc above for binary W?

erikcs · 2019-10-31T19:39:50Z

For easier bug analysis could you please post a simulated example that replicates your error message?

ginward · 2019-11-07T10:45:41Z

Thanks @erikcs I will try to generate a simulated example.

But just curious - does continuous treatment generally take more memory than binary treatment?

ginward · 2019-11-07T11:51:39Z

I tested on a small dataset, and it seems that binary treatments takes half of the time continuous treatment takes to run.

I think it probably takes half of the memory as well?

I was what part of the code is lagging the performance? @erikcs

erikcs · 2019-11-07T20:02:51Z

I don't see how the type of W should matter for speed when what being passed to causal train is always continuous?

ginward · 2019-11-28T10:47:51Z

So I have tried the code in machines with different RAMs.

It fails when the RAMs of the HPCs are 300GB or 1TB.

When I raise the RAM to around 3TB, it finally runs through. I am not sure what is the bottle neck here - still need to investigate. But this is probably a memory issue. Not sure if it is related to #187 . But it always stuck in the training phase on machines with RAMs below 1TB. And it only happens with continuous treatment.

I have not been able to create a simulated example yet - I guess I would need to simulate a very large dataset. Will keep trying though.

On another note, I am also use DMTCP to checkpoint my jobs on HPCs. Not sure if that is related to the issue.

Because 3TB is actually a lot of RAM ... I could barely find the machine to do the job.

erikcs · 2019-11-28T23:43:43Z

Have you tried running the forest with num.trees = 1, ci.group.size = 1 and seen how much memory that required? The memory required should grown roughly linearly with the number of trees and 4 000 trees with N = 1 600 000 may be in the terabyte territory.

erikcs · 2019-12-15T17:49:01Z

Could you try the same experiment but setting min.node.size larger, for example 200? (if for some reason there are so much more splits with continuous W the number of nodes causes the tree vectors to grow unreasonably large)

ginward · 2019-12-16T11:52:22Z

Could you try the same experiment but setting min.node.size larger, for example 200? (if for some reason there are so much more splits with continuous W the number of nodes causes the tree vectors to grow unreasonably large)

By setting min.node.size larger. does it limit the trees' depth? @erikcs

ginward · 2019-12-16T12:06:36Z

Actually, probably limit the number of nodes.

ginward · 2019-12-16T12:27:24Z

It is actually possible that there are too many nodes. Because when I tried to plot the graphs with the best tree function provided in #281 , it gives me the following error:

> entry_best_tree=find_best_tree(entry.forest, type = "causal", cost = 1)
Error in asNamespace(ns) : node stack overflow
Calls: find_best_tree ... get_r_loss -> get_r_loss -> r_loss -> mean -> mean.default
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

The fact that there is a stack overflow when traversing through the trees indicates that there might be far too many nodes in the trees to be plotted.

ginward · 2019-12-16T12:28:31Z

Anyways, I have submitted a new job with min.node.size equal to 100. Let's wait and see what happens.

ginward · 2019-12-16T23:48:08Z

Looks like uplifting the min.node.size to 100 worked! Maybe it is the problem that the number of nodes grow unreasonably large when the treatment is continuous ...

erikcs · 2019-12-17T03:02:05Z

Ok, thanks for checking

tianshengwang · 2020-01-19T01:14:50Z

@ginward
Thanks for developing the version for best_tree in instrumental_forest (https://gist.github.com/ginward/451043145ef914f57af5a7272cf02489), but when trying to get the best tree from causal_forest or instrumental_forest in large claim data using this version, I got similar errors you mentioned above. It seems you have not fixed it? :
There were 50 or more warnings (use warnings() to see the first 50) error

warnings()
Warning messages: 1: In mean.default(Tau.hat[samples]) : argument is not numeric or logical: returning NA

Here I just use a simulation to deomnstrate the error, which shows it only works for regression_forest, but not for causal_forest or instrumental_forest:

n <- 5000; p <- 10
X <- matrix(rnorm(n*p), n, p)
Y <- X[,1] * rnorm(n)
W=rbinom(n, 1, 0.4+0.2*(X[,1]>0))
Z=rbinom(n, 1, 0.5)
forest_regression   <- grf::regression_forest(X,Y, min.node.size = 200, seed=12345)
forest_causal       <- grf::causal_forest(X,Y,W, min.node.size = 200, seed=12345)
forest_instrumental <- grf::instrumental_forest(X, Y, W, Z, min.node.size = 200, seed=12345)

# Find the best tree
best_tree_info <- find_best_tree(forest_regression, "regression")
best_tree_info$best_tree

best_tree_info <- find_best_tree(forest_causal, "causal")
best_tree_info$best_tree

best_tree_info <- find_best_tree(forest_instrumental, "instrumental")
best_tree_info$best_tree

thanks
Tian

ginward · 2020-01-19T13:04:51Z

@tianshengwang Have you tried limiting the size of nodes to, say, 500?

I think it is the fact that continuous treatment will expand the trees too wide and created too many nodes, thus creating a stackoverflow.

tianshengwang · 2020-01-19T13:53:53Z

@tianshengwang Have you tried limiting the size of nodes to, say, 500?

I think it is the fact that continuous treatment will expand the trees too wide and created too many nodes, thus creating a stackoverflow.

You mean increase the min.node.size? I used, 500, 1000, 2000, still don't work for causal_forest and instrumental_forest.

ginward · 2020-01-19T13:57:40Z

@tianshengwang It works on my side.

There were 50 or more warnings (use warnings() to see the first 50)
> best_tree_info$best_tree
[1] 922

I have ignored the warnings.

tianshengwang · 2020-01-19T14:17:41Z

@tianshengwang It works on my side.
There were 50 or more warnings (use warnings() to see the first 50)
> best_tree_info$best_tree
[1] 922
I have ignored the warnings.

Got it, thanks!

jtibshirani · 2020-08-21T01:06:29Z

Closing this out because we didn't identify a bug and you seem to have found a way forward.

erikcs added the bug label Nov 14, 2019

erikcs mentioned this issue Jan 9, 2020

Checkpointing #605

Closed

tianshengwang mentioned this issue Mar 8, 2020

Find the best tree in the random forest #281

Open

jtibshirani closed this as completed Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous Treatment Memory Error #558

Continuous Treatment Memory Error #558

ginward commented Oct 31, 2019 •

edited

erikcs commented Oct 31, 2019

erikcs commented Oct 31, 2019

ginward commented Nov 7, 2019

ginward commented Nov 7, 2019 •

edited

erikcs commented Nov 7, 2019

ginward commented Nov 28, 2019 •

edited

erikcs commented Nov 28, 2019

erikcs commented Dec 15, 2019 •

edited

ginward commented Dec 16, 2019

ginward commented Dec 16, 2019

ginward commented Dec 16, 2019 •

edited

ginward commented Dec 16, 2019

ginward commented Dec 16, 2019

erikcs commented Dec 17, 2019

tianshengwang commented Jan 19, 2020 •

edited

ginward commented Jan 19, 2020

tianshengwang commented Jan 19, 2020

ginward commented Jan 19, 2020

tianshengwang commented Jan 19, 2020

jtibshirani commented Aug 21, 2020

Continuous Treatment Memory Error #558

Continuous Treatment Memory Error #558

Comments

ginward commented Oct 31, 2019 • edited

erikcs commented Oct 31, 2019

erikcs commented Oct 31, 2019

ginward commented Nov 7, 2019

ginward commented Nov 7, 2019 • edited

erikcs commented Nov 7, 2019

ginward commented Nov 28, 2019 • edited

erikcs commented Nov 28, 2019

erikcs commented Dec 15, 2019 • edited

ginward commented Dec 16, 2019

ginward commented Dec 16, 2019

ginward commented Dec 16, 2019 • edited

ginward commented Dec 16, 2019

ginward commented Dec 16, 2019

erikcs commented Dec 17, 2019

tianshengwang commented Jan 19, 2020 • edited

ginward commented Jan 19, 2020

tianshengwang commented Jan 19, 2020

ginward commented Jan 19, 2020

tianshengwang commented Jan 19, 2020

jtibshirani commented Aug 21, 2020

ginward commented Oct 31, 2019 •

edited

ginward commented Nov 7, 2019 •

edited

ginward commented Nov 28, 2019 •

edited

erikcs commented Dec 15, 2019 •

edited

ginward commented Dec 16, 2019 •

edited

tianshengwang commented Jan 19, 2020 •

edited