Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Treatment Memory Error #558

Closed
ginward opened this issue Oct 31, 2019 · 20 comments
Closed

Continuous Treatment Memory Error #558

ginward opened this issue Oct 31, 2019 · 20 comments
Labels

Comments

@ginward
Copy link
Contributor

ginward commented Oct 31, 2019

Description of the bug
I am estimating causal forest on a large dataset (around 2-3 GB with 1.6 million observations and 8 independent variables, as well as 1 dependent variable) with 4000 trees. When I use a binary treatment W, the forest runs fine. However, when I switch to a continuous treatment W, R crashes.

I am running it on a HPC with 32 CPUs and 12GB RAM per CPU.

The error message is (I masked some of my memory addresses with *):

*** Error in `/PATH/lib/R/bin/exec/R': break adjusted to free malloc space: 0x000******** ***
======= Backtrace: =========
/lib64/libc.so.6(+0x82257)[0x7f5*********]
/lib64/libc.so.6(+0x82cea)[0x7f5*********]
/lib64/libc.so.6(__libc_malloc+0xc7)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp_alloc.so(malloc+0x22)[0x7f5*********]
/PATH/lib/R/bin/exec/../../../libstdc++.so.6(_Znwm+0x15)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt6vectorImSaImEE17_M_default_appendEm+0xc3)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf4Tree15find_leaf_nodesERKNS_4DataERKSt6vectorImSaImEE+0xb7)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer21repopulate_leaf_nodesERKSt10unique_ptrINS_4TreeESt14default_deleteIS2_EERKNS_4DataERKSt6vectorImSaImEEb+0xe0)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer5trainERKNS_4DataERNS_13RandomSamplerERKSt6vectorImSaImEERKNS_11TreeOptionsE+0x408)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer14train_ci_groupERKNS_4DataERNS_13RandomSamplerERKNS_13ForestOptionsE+0x135)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer11train_batchEmmRKNS_4DataERKNS_13ForestOptionsE+0x1a9)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultISt6vectorIS0_IN3grf4TreeESt14default_deleteISA_EESaISD_EEEES3_ENSt6thread8_InvokerISt5tupleIJMNS9_13ForestTrainerEKFSF_mmRKNS9_4DataERKNS9_13ForestOptionsEEPKSL_mmSO_SP_EEEESF_EEE9_M_invokeERKSt9_Any_data+0x67)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(+0x526e9)[0x7f5*********]
/lib64/libpthread.so.0(+0x620b)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJMN3grf13ForestTrainerEKFSt6vectorISt10unique_ptrINS5_4TreeESt14default_deleteIS9_EESaISC_EEmmRKNS5_4DataERKNS5_13ForestOptionsEEPKS6_mmSH_SI_EEEESE_EC4EOSQ_EUlvE_EEEEE6_M_runEv+0x116)[0x7f5*********]
/PATH/lib/R/bin/exec/../../../libstdc++.so.6(+0xc819d)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35eeb)[0x7f5*********]
/lib64/libpthread.so.0(+0x7ea5)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35cad)[0x7f5*********]
/lib64/libc.so.6(clone+0x6d)[0x7f5*********]

Does continuous treatment consume more memory than binary treatment?

GRF version
development

@erikcs
Copy link
Member

erikcs commented Oct 31, 2019

If you decrease the HPC memory to something insufficient, do you get a different error message than the malloc above for binary W?

@erikcs
Copy link
Member

erikcs commented Oct 31, 2019

For easier bug analysis could you please post a simulated example that replicates your error message?

@ginward
Copy link
Contributor Author

ginward commented Nov 7, 2019

Thanks @erikcs I will try to generate a simulated example.

But just curious - does continuous treatment generally take more memory than binary treatment?

@ginward
Copy link
Contributor Author

ginward commented Nov 7, 2019

I tested on a small dataset, and it seems that binary treatments takes half of the time continuous treatment takes to run.

I think it probably takes half of the memory as well?

I was what part of the code is lagging the performance? @erikcs

@erikcs
Copy link
Member

erikcs commented Nov 7, 2019

I don't see how the type of W should matter for speed when what being passed to causal train is always continuous?

@erikcs erikcs added the bug label Nov 14, 2019
@ginward
Copy link
Contributor Author

ginward commented Nov 28, 2019

So I have tried the code in machines with different RAMs.

It fails when the RAMs of the HPCs are 300GB or 1TB.

When I raise the RAM to around 3TB, it finally runs through. I am not sure what is the bottle neck here - still need to investigate. But this is probably a memory issue. Not sure if it is related to #187 . But it always stuck in the training phase on machines with RAMs below 1TB. And it only happens with continuous treatment.

I have not been able to create a simulated example yet - I guess I would need to simulate a very large dataset. Will keep trying though.

On another note, I am also use DMTCP to checkpoint my jobs on HPCs. Not sure if that is related to the issue.

Because 3TB is actually a lot of RAM ... I could barely find the machine to do the job.

@erikcs
Copy link
Member

erikcs commented Nov 28, 2019

Have you tried running the forest with num.trees = 1, ci.group.size = 1 and seen how much memory that required? The memory required should grown roughly linearly with the number of trees and 4 000 trees with N = 1 600 000 may be in the terabyte territory.

@erikcs
Copy link
Member

erikcs commented Dec 15, 2019

Could you try the same experiment but setting min.node.size larger, for example 200? (if for some reason there are so much more splits with continuous W the number of nodes causes the tree vectors to grow unreasonably large)

@ginward
Copy link
Contributor Author

ginward commented Dec 16, 2019

Could you try the same experiment but setting min.node.size larger, for example 200? (if for some reason there are so much more splits with continuous W the number of nodes causes the tree vectors to grow unreasonably large)

By setting min.node.size larger. does it limit the trees' depth? @erikcs

@ginward
Copy link
Contributor Author

ginward commented Dec 16, 2019

Actually, probably limit the number of nodes.

@ginward
Copy link
Contributor Author

ginward commented Dec 16, 2019

It is actually possible that there are too many nodes. Because when I tried to plot the graphs with the best tree function provided in #281 , it gives me the following error:

> entry_best_tree=find_best_tree(entry.forest, type = "causal", cost = 1)
Error in asNamespace(ns) : node stack overflow
Calls: find_best_tree ... get_r_loss -> get_r_loss -> r_loss -> mean -> mean.default
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

The fact that there is a stack overflow when traversing through the trees indicates that there might be far too many nodes in the trees to be plotted.

@ginward
Copy link
Contributor Author

ginward commented Dec 16, 2019

Anyways, I have submitted a new job with min.node.size equal to 100. Let's wait and see what happens.

@ginward
Copy link
Contributor Author

ginward commented Dec 16, 2019

Looks like uplifting the min.node.size to 100 worked! Maybe it is the problem that the number of nodes grow unreasonably large when the treatment is continuous ...

@erikcs
Copy link
Member

erikcs commented Dec 17, 2019

Ok, thanks for checking

@erikcs erikcs mentioned this issue Jan 9, 2020
@tianshengwang
Copy link

tianshengwang commented Jan 19, 2020

@ginward
Thanks for developing the version for best_tree in instrumental_forest (https://gist.github.com/ginward/451043145ef914f57af5a7272cf02489), but when trying to get the best tree from causal_forest or instrumental_forest in large claim data using this version, I got similar errors you mentioned above. It seems you have not fixed it? :
There were 50 or more warnings (use warnings() to see the first 50) error

warnings()
Warning messages: 1: In mean.default(Tau.hat[samples]) : argument is not numeric or logical: returning NA

Here I just use a simulation to deomnstrate the error, which shows it only works for regression_forest, but not for causal_forest or instrumental_forest:

n <- 5000; p <- 10
X <- matrix(rnorm(n*p), n, p)
Y <- X[,1] * rnorm(n)
W=rbinom(n, 1, 0.4+0.2*(X[,1]>0))
Z=rbinom(n, 1, 0.5)
forest_regression   <- grf::regression_forest(X,Y, min.node.size = 200, seed=12345)
forest_causal       <- grf::causal_forest(X,Y,W, min.node.size = 200, seed=12345)
forest_instrumental <- grf::instrumental_forest(X, Y, W, Z, min.node.size = 200, seed=12345)

# Find the best tree
best_tree_info <- find_best_tree(forest_regression, "regression")
best_tree_info$best_tree

best_tree_info <- find_best_tree(forest_causal, "causal")
best_tree_info$best_tree

best_tree_info <- find_best_tree(forest_instrumental, "instrumental")
best_tree_info$best_tree

thanks
Tian

@ginward
Copy link
Contributor Author

ginward commented Jan 19, 2020

@tianshengwang Have you tried limiting the size of nodes to, say, 500?

I think it is the fact that continuous treatment will expand the trees too wide and created too many nodes, thus creating a stackoverflow.

@tianshengwang
Copy link

@tianshengwang Have you tried limiting the size of nodes to, say, 500?

I think it is the fact that continuous treatment will expand the trees too wide and created too many nodes, thus creating a stackoverflow.

You mean increase the min.node.size? I used, 500, 1000, 2000, still don't work for causal_forest and instrumental_forest.

@ginward
Copy link
Contributor Author

ginward commented Jan 19, 2020

@tianshengwang It works on my side.

There were 50 or more warnings (use warnings() to see the first 50)
> best_tree_info$best_tree
[1] 922

I have ignored the warnings.

@tianshengwang
Copy link

@tianshengwang It works on my side.

There were 50 or more warnings (use warnings() to see the first 50)
> best_tree_info$best_tree
[1] 922

I have ignored the warnings.

Got it, thanks!

@jtibshirani
Copy link
Member

Closing this out because we didn't identify a bug and you seem to have found a way forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants