Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in if (sum(weights[treat == tval1_0] > 0) < 1 || sum(weights[treat != : missing value where TRUE/FALSE needed #64

Open
mitchellcameron123 opened this issue Jun 20, 2024 · 6 comments

Comments

@mitchellcameron123
Copy link

Hi,

I am trying to perform GBM weighting on this dataset (attached). The dataset is in an excel document and publicly available.

coffee_data.csv

My code is:
coffee_formula <- as.formula(as.factor(certified) ~ age_hh+ agesq+ nonfarmincome_access + logtotal_land+ depratio+ badweat+ edu+ gender+ years_cofeproduction+ access_credit)

mod <- weightit(coffee_formula, data = coffee_data, method="gbm", estimand = "ATE", shrinkage=0.1, interaction.depth=1:2, bag.fraction=0.8, criterion= "smd.mean", n.trees=10000)

This results in the error:
Error in if (sum(weights[treat == tval1_0] > 0) < 1 || sum(weights[treat != :
missing value where TRUE/FALSE needed

I can make this error stop by choosing fewer trees or messing with the other parameters given. There doesn't seem to be any logic to when the error appears. Using the browse() and looking inside the col_w_smd() function, it would appear that one of the weights is NaN formatting and I am not sure why but seems to cause this error.

I have tried:

  • Update to Rstudio
  • Update R
  • Uninstall/Reinstall all relevant packages
  • Different ways of loading the data
  • Specifiying each variable individually as its datatype
  • dataframe vs tibble
  • A fair bit of stuff really.

I am hoping for some guidance here because I have no idea what to do. Thank you very much.

@ngreifer
Copy link
Owner

Thank you so much for this report and sorry about the bug. This occurred because some propensity scores were estimated to be 0 or 1 in some trees of the GBM model, which yielded non-finite weights that were not correctly processed by col_w_smd(). This issue has been fixed in the development version, which you can install using

remotes::install_github("ngreifer/WeightIt")

My solution was to truncate extreme propensity scores and improve the robustness of the function for computing weights. In practice, any tree with such extreme propensity scores will not be chosen as optimal because the weights will be extreme.

@mitchellcameron123
Copy link
Author

mitchellcameron123 commented Jun 25, 2024 via email

@ngreifer
Copy link
Owner

Hi Mitchell,

Feel free to send me whatever would be helpful. Results for GBM are only random if you set bag.fraction to something less than 1. The default and recommended behavior is to set it to 1, in which case there is no randomness unless cross-validation is used to select the tuning parameter. When there is no randomness, you don't need to set a seed at all and your problem is avoided.

@mitchellcameron123
Copy link
Author

mitchellcameron123 commented Jun 25, 2024 via email

@ngreifer
Copy link
Owner

Hi Mitchell,

I didn't receive an attachment. I don't think email attachments work in GitHub issues; you need to upload the attachment directly into the issue on GitHub rather than including it in an email reply. But I think I understand the problem anyway: if the seed differs between runs with different tuning parameters, any differences in performance could be due either to the different samples drawn at each iteration or to the different values of the tuning parameters, so ordering the resulting specifications by performance doesn't accurately allow you to assess which specification is best in general. Holding the samples drawn at each iteration constant (i.e., by using the same seed each time the model is fit with a different tuning parameter specification) allows you to isolate the variability in performance due to the different specifications.

It's true that the original literature on GBM and McCaffrey et al (2004) recommend using a bag fraction less than 1, but I don't think that advice applies anymore. It relies on the idea that a machine learning model should seek to avoid overfitting, but in propensity score analysis, overfitting is not a problem because balance achieved in the training sample, not good predictive performance in the test sample, is the criterion of interest. Overfitting to prevent perfect separation is controlled by other parameters already, including the number of trees and the shrinkage parameter. So I don't think one needs to introduce additional randomness to prevent overfitting.

twang, the package developed by the team that wrote McCaffrey et al (2004) to implement the method, has always used a bag.fraction of 1 as the default, and it is not even named as a manipulable parameter in the user guide, suggesting that they recommend not changing it.

All that said, I'll take your suggestion on board. R has facilities to retain and reset a seed, so it's possible to ensure that each fit of the model uses the same seed.

@mitchellcameron123
Copy link
Author

Hi,

Yes, I believe we are on the same page. I've attached the file anyway (probably redundant now) as a txt file as GitHub does not seem to allow R files.

For Noah.txt

Thank you very much for all of your work on this package :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants