Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How can I use the package to build a single causal tree? #548

Closed
ferlocar opened this issue Oct 26, 2019 · 14 comments
Closed

Question: How can I use the package to build a single causal tree? #548

ferlocar opened this issue Oct 26, 2019 · 14 comments
Labels

Comments

@ferlocar
Copy link

I'm trying to build a single causal tree using the following code:

model <- causal_forest(X_train, y_train, z_train, num.trees = 1)

However, I noticed that the causal_forest method has the parameter sample.fraction, which defines the fraction of the data that is used to build each tree (and is 0.5 by default). Because I want to use the entire data set to build the causal tree, I want to set this to 1, but when I run the following code:

model <- causal_forest(X_train, y_train, z_train, num.trees = 1, sample.fraction=1)

I get the following error message:

"Error in causal_train(data$default, data$sparse, outcome.index, treatment.index, :
When confidence intervals are enabled, the sampling fraction must be less than 0.5."

Could you please tell me how to disable confidence intervals in order to build a tree using the entire sample? Thanks in advance!

@erikcs
Copy link
Member

erikcs commented Oct 26, 2019

You can disable confidence intervals with ci.group.size=1 (last paragraph in the reference on variance estimates)

@ferlocar
Copy link
Author

Thanks for your fast response.

I tried that, and it removed the error message, but then the predictions no longer work. The following code:

predict(model, X_test)$predictions

Gives NaN for all predictions. I also tried including the parameter estimate.variance = FALSE as part of the predict method, but I obtained the same results.

Any idea of why?

@erikcs
Copy link
Member

erikcs commented Oct 26, 2019

causal_forest by default predicts Y.hat and W.hat with a separate regression_forest, but since sample.fraction=1 all of these OOB predictions will be NaN, and so your causal_forest predictions are nan.

You can avoid this by instead supplying Y.hat and W.hat to causal_forest. (e.g. W.hat = predict(regression_forest(X, W))$predictions; Y.hat = predict(regression_forest(X, Y))$predictions)

@ferlocar
Copy link
Author

Thanks again for your response.

Could you please elaborate on why Y.hat and W.hat are required to make predictions?

My understanding about a causal tree is that (assuming I use the entire sample to build the tree) a fraction of the data is used to learn the tree structure and the remaining fraction is used to fill the leaves in the tree (I'll call this the prediction fraction).

Then, the prediction for 'observation i' corresponds to the estimated average treatment effect of the observations in the prediction fraction that are also in the leaf of 'observation i'. The reference on predictions also seems to agree with this. So, what's the role of Y.hat and W.hat in making predictions?

My concern about using Y.hat as you propose is that I would be using the entire sample to estimate the regression tree, and I'm not sure how does that play a role with 'honesty'.

Again, I appreciate your quick and helpful responses!

@susanathey
Copy link
Collaborator

You may do better using the https://github.com/susanathey/causaltree package for your use case. The residualizing comes in if you have an observational study, but you wouldn't necessarily want to use it for a single tree with a randomized experiment. You can set the y.hat and w.hat to be constants.

@erikcs
Copy link
Member

erikcs commented Oct 26, 2019

Could you please elaborate on why Y.hat and W.hat are required to make predictions?

As @susanathey mentions above, this is not related to honesty but orthogonalization. (W.hat and Y.hat could be estimated with an arbitrary estimator, not necessarily a regression forest)

@ferlocar
Copy link
Author

@susanathey Thanks, I'll check that out!
@erikcs Thanks for the orthogonalization reference, that clears my question.

By the way, I'm really impressed with the package and the fast responses. Thanks a lot for this!

@sudonghua91
Copy link

You can avoid this by instead supplying Y.hat and W.hat to causal_forest. (e.g. W.hat = predict(regression_forest(X, W))$predictions; Y.hat = predict(regression_forest(X, Y))$predictions)

@erikcs I followed your expertise but the predictions are still all NaN...Why? Thanks.

@erikcs
Copy link
Member

erikcs commented Aug 23, 2021

Hi @sudonghua91, could you please give some more details? If you train a forest with for example only 1 tree, then some OOB (out of bag) predictions may be NaN by construction.

@sudonghua91
Copy link

sudonghua91 commented Aug 23, 2021

Hi @erikcs , thanks for your response.
As I understand, sample.fraction is used for splitting the whole sample into a training sample and a hold-out sample. Now, I am trying to do the splitting before inputting the training sample to causal_forest. So I want to set sample.fraction=1 and only take honesty.fraction=0.5. But I need 20,000 trees so setting ci.group.size=1 cannot help this case. I am wondering how I can set sample.fraction=1 in other possible ways? Thanks.

@erikcs
Copy link
Member

erikcs commented Aug 24, 2021

With sample.fraction=1 you can predict on a test set predict(forest, X.test), but OOB predictions predict(forest) on X.train will naturally be all NaN.

@sudonghua91
Copy link

@erikcs Thanks. Actually I did predict on a test set but they are still NaN(you can of course try it yourself). Btw, does predict(forest) cause overfitting?

@erikcs
Copy link
Member

erikcs commented Aug 25, 2021

@erikcs Thanks. Actually I did predict on a test set but they are still NaN(you can of course try it yourself). Btw, does predict(forest) cause overfitting?

n <- 500
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(rnorm(n * p), n, p)
W <- rbinom(n, 1, 0.5)
Y <- pmax(X[, 1], 0) * W + rnorm(n)
cf <- causal_forest(X, Y, W, W.hat = 0.5, Y.hat = 0, ci.group.size = 1, sample.fraction = 1)
head(predict(cf, X.test)$predictions)
# [1] 0.6489251 0.7139381 0.3271824 0.3057034 0.3641508 0.3184822

@sudonghua91
Copy link

@erikcs well thanks! I got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants