Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sample size planning functions #27

Open
bsiepe opened this issue Feb 7, 2024 · 2 comments
Open

Implement sample size planning functions #27

bsiepe opened this issue Feb 7, 2024 · 2 comments

Comments

@bsiepe
Copy link

bsiepe commented Feb 7, 2024

Hello Phil,

as we previously briefly discussed via email, it could be useful to have functionality that allows for sample size planning for simulation studies to achieve a desired Monte Carlo Standard Error (MCSE). These functions should allow users to specify their performance measure of interest and the desired precision. The functions then return the number of repetitions needed to achieve said precision. The calculations can be based on the formulas we provide in Siepe et al. (2023).

We (Samuel Pawel, František Bartoš, and I) would like to contribute to this functionality.

Our Suggestions:

  • implement helper functions plan_*, where * stands for performance measures such as bias or coverage as implemented in the SimDesign summary functions
  • let users either specify a 'worst-case' scenario (in case of performance measures with known SE) or an empirical variance of the estimates (based on previous simulation studies or a pilot simulation)
  • the plan_* can be used within the Summarise() function. Users can then run a pilot simulation study to obtain the empirical variance and return the required sample size for each condition/method.
  • this would seamlessly build upon existing infrastructure. We could create a wiki/vignette that explains the idea.

Sketch of what such a function could look like:

For a performance measure with known SE:

plan_EDR <- function(target_mcse,
                     target_edr = 0.5){ 
  n_rep <- target_edr * (1 - target_edr) / target_mcse^2
  n_rep
}

For a performance measure with unknown SE:

plan_bias <- function(target_mcse,
                      empirical_var){
  n_rep <- empirical_var / target_mcse^2
  n_rep
}

Depending on your input, we will open a pull request suggesting the functions soon.

Best
Björn

@philchalmers
Copy link
Owner

philchalmers commented Feb 10, 2024

Hi Björn,

This sounds quite reasonable to me.

  • implement helper functions plan_*, where * stands for performance measures such as bias or coverage as implemented in the SimDesign summary functions

Agreed, this is a nice convention to use in the package, and the respective functions (bias() and plan_bias()) could be linked to in the package documentation.

  • let users either specify a 'worst-case' scenario (in case of performance measures with known SE) or an empirical variance of the estimates (based on previous simulation studies or a pilot simulation)

I like this idea, but worry about the use of empirical estimates for the purpose of simulation planning. As the empirical variance estimates are themselves a function of the replication size one could quite easily over/under estimate the requisite number of replications to obtain the desired precision, particularly if the replication size was initially too low. Ideally, some type of confidence interval should be included for this type of situation, where either the complete vector of observations used to obtain said empirical variance estimates is passed to the function (where obtaining internal uncertainty quantifiers could be applied, even if from the large sample normal family or via bootstrapping) or the standard error be included by the user to indicate the degree of precision in the empirical variances. I'd be fine with either.

  • *the plan_ can be used within the Summarise() function. Users can then run a pilot simulation study to obtain the empirical variance and return the required sample size for each condition/method. *

It's unclear to me why this would be necessary within a Summarise() call. As SimDesign stores the results information one could just extract the analyse results out and pass these to plan_* in raw form (see above), or reduce manually as well. Basically, if this were to be constructed with Summarise() support then utilizing the raw results data would be the most ideal path.

  • this would seamlessly build upon existing infrastructure. We could create a wiki/vignette that explains the idea.

A vignette would be great! Though let me know your thoughts about my above points before proceeding. Thanks!

@bsiepe
Copy link
Author

bsiepe commented Feb 21, 2024

Hello Phil,

Thank you for your thoughts and willingness to include the idea in your package.

  1. Yes, linking the functions via the documentation is a good idea.
  2. That is a good point. We will include both a point estimate for the sample size and a lower/upper bound which is based on an uncertainty estimate for the empirical variance. The latter can then be obtained in the two ways that you mentioned. In this way, users will notice that the implied sample size may be imprecise if they only use a few replications for their pilot run.
  3. Sorry if we were unclear. Indeed, the plan_*() functions could both be used within and outside of a Summarise() call if users pass the raw results data. We just thought that using the functions within a Summarise() call might be a useful workflow for users incrementally building their simulation study, but it is not necessary.
  4. Great, we suggest first creating a pull request for the functionality itself and then creating a vignette later on when you have reviewed and possibly accepted our suggestions.

We will prepare a pull request incorporating this functionality. Please excuse that this may take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants