From a443480e5882191207574eb2d6fdd390de6d264b Mon Sep 17 00:00:00 2001 From: Johannes Titz Date: Fri, 10 Jan 2020 10:43:45 +0100 Subject: [PATCH] add html vignette --- vignettes/fastpos.html | 425 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 425 insertions(+) create mode 100644 vignettes/fastpos.html diff --git a/vignettes/fastpos.html b/vignettes/fastpos.html new file mode 100644 index 0000000..e0ef561 --- /dev/null +++ b/vignettes/fastpos.html @@ -0,0 +1,425 @@ + + + + + + + + + + + + + + + +Introduction to fastpos + + + + + + + + + + + + + + + + + + + + + +

Introduction to fastpos

+

Johannes Titz

+

January, 2020

+ + + +

The R package fastpos provides a fast algorithm to calculate the required sample size for a Pearson correlation to stabilize within a sequential framework (Schönbrodt & Perugini, 2013, 2018). Basically, one wants to find the sample size at which one can be sure that 1-α percent of many studies will fall into a specified corridor of stability around an assumed population correlation and stay inside that corridor if more participants are added to the study. For instance, find out how many participants per study are required so that, out of 100k studies, 90% would fall into the region between .4 to .6 (a Pearson correlation) and not leave this region again when more participants are added (under the assumption that the population correlation is .5). This sample size is also referred to as the critical point of stability for the specific parameters.

+

This approach is related to accuracy in parameter estimation (AIPE, e.g. Maxwell, Kelley, & Rausch, 2008) and as such can be seen as an alternative to power analysis. Unlike AIPE, the concept of stability incorporates the idea of sequentially adding participants to a study. Although the approach is young, it has already attracted a lot of interest in the psychological research community, which is evident in over 600 citations of the original publication (Schönbrodt & Perugini, 2013). To date there exists no easy way to use sequential stability for individual sample size planning because there is no analytical solution to the problem and a simulation approach is computationally expensive. The package fastpos overcomes this limitation by speeding up the calculation of correlations. For typical parameters, the theoretical speedup should be at least around 250. An empirical benchmark for a typical scenario even shows a speedup of about 500, paving the way for a wider usage of the stability approach.

+

If you have found this page, I assume you either want to (1) calculate the critical point of stability for your own study or (2) explore the method in general. If this is the case, read on and you should find what you are looking for. Let us first load the package and set a seed for reproducibility:

+
library(fastpos)
+set.seed(19950521)
+

In most cases you will just need the function find_critical_pos which will give you the critical point of stability for your specific parameters.

+

Let us reproduce one example from Schönbrodt and Perugini’s work (this should take only a couple of seconds on a modern CPU):

+
find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 1000,
+                  n_studies = 10000)
+#>     rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit upper_limit n_studies n_not_breached precision precision_rel
+#> 1 0.6997798  63  95 130              20            1000         0.6         0.8     10000              0       0.1         FALSE
+

The result is very close to Schönbrodt and Perugini’s table (see https://github.com/nicebread/corEvol).

+

Note that find_critical_pos will throw a message if at least one study did not reach the corridor of stability with the maximum sample size. This happened in Schönbrodt and Perugini’s work, but quite seldom. Still, it should be be avoided for a proper estimate of the point of stability.

+
find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 400,
+                  n_studies = 10000)
+#> Warning in find_critical_pos(rho = 0.7, sample_size_min = 20, sample_size_max = 400, : 3 simulation[s] did not reach the corridor of
+#>             stability.
+#> Increase sample_size_max and rerun the simulation.
+#>     rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit upper_limit n_studies n_not_breached precision precision_rel
+#> 1 0.6998704  65  97 131              20             400         0.6         0.8     10000              3       0.1         FALSE
+

In this case, do what the message suggests and increase the maximum sample size. Note that larger sample sizes are more resource intensive because the correlations are calculated in the reverse way (from the maximum sample size downwards). Thus, you usually would not like to increase the maximum sample size, unless there are studies that did not reach the corridor of stability.

+

If you need different confidence levels, just state it:

+
find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 1000,
+                  n_studies = 10000, confidence_levels = c(.6, .85))
+#>     rho_pop 60% 85% sample_size_min sample_size_max lower_limit upper_limit n_studies n_not_breached precision precision_rel
+#> 1 0.7006911  38  78              20            1000         0.6         0.8     10000              0       0.1         FALSE
+

This has no effect on resource consumption because the time consuming part is to simulate the distribution, not calculating quantiles of the distribution.

+

If you need a different precision level or even relative precision, specify it:

+
find_critical_pos(rho = c(.5, .7), sample_size_min = 20, sample_size_max = 2500,
+                  n_studies = 10000, precision = .10, precision_rel = T)
+#> Warning in find_critical_pos(rho = c(0.5, 0.7), sample_size_min = 20, sample_size_max = 2500, : 9 simulation[s] did not reach the corridor of
+#>             stability.
+#> Increase sample_size_max and rerun the simulation.
+#>     rho_pop 80% 90%     95% sample_size_min sample_size_max lower_limit upper_limit n_studies n_not_breached precision
+#> 1 0.4990567 596 856 1119.05              20            2500        0.45        0.55     10000              9       0.1
+#> 2 0.6997279 137 199  261.00              20            2500        0.63        0.77     10000              0       0.1
+#>   precision_rel
+#> 1          TRUE
+#> 2          TRUE
+

As you can see in the output, the limits were set relatively to the population correlation (+-25% of the population correlation).

+

If you want to dig deeper, you can have a look at the functions that find_critical_pos builds upon. simulate_pos is the workhorse of the package. It calls a C++ function to calculate correlations sequentially and it does this pretty quickly (but you know that already, right?). A rawish approach would be to create a population with create_pop and pass it to simulate_pos:

+
pop <- create_pop(0.5, 1000000)
+pos <- simulate_pos(x_pop = pop[,1],
+                    y_pop = pop[,2],
+                    number_of_studies = 1000,
+                    sample_size_min = 20,
+                    sample_size_max = 1000,
+                    replace = T,
+                    lower_limit = 0.4,
+                    upper_limit = 0.6)
+hist(pos, xlim = c(0, 1000), xlab = c("Point of stability"),
+     main = "Histogram of points of stability for rho = .5+-.1")
+

+
quantile(pos, c(.8, .9, .95), na.rm = T)
+#>   80%   90%   95% 
+#> 141.0 214.1 270.0
+

Note that no warning message appears if the corridor is not reached, but instead an NA value is returned. Pay careful attention if you work with this function, and adjust the maximum sample size as needed.

+

create_pop creates the population matrix by using mvrnorm. This is a much simpler way than Schönbrodt and Perugini’s approach, but the results do not seem to differ. If you are interested in how population parameters (e.g. skewness) affect the point of stability, you should instead refer to the population generating functions in Schönbrodt and Perugini’s work.

+

As you can see, there is not really much to the sequential definition of stability, except for calculating billions of correlations. This is done quite fast with the help of Rcpp.

+

Let us reproduce Schönbrodt and Perugini’s quite famous and oft-cited table of the critical points of stability for a precision of 0.1. We set the maximum sample size a bit higher, so we avoid studies where the corridor is never reached. We reduce the number of studies to 10k so that it runs fairly quickly.

+
find_critical_pos(rho = seq(.1, .7, .1), sample_size_max = 1000,
+                  n_studies = 10000)
+#> Warning in find_critical_pos(rho = seq(0.1, 0.7, 0.1), sample_size_max = 1000, : 30 simulation[s] did not reach the corridor of
+#>             stability.
+#> Increase sample_size_max and rerun the simulation.
+#>     rho_pop   80% 90%    95% sample_size_min sample_size_max lower_limit upper_limit n_studies n_not_breached precision
+#> 1 0.1005987 253.0 365 479.00              20            1000         0.0         0.2     10000             11       0.1
+#> 2 0.1998043 234.0 335 433.00              20            1000         0.1         0.3     10000             10       0.1
+#> 3 0.3004318 214.2 304 393.00              20            1000         0.2         0.4     10000              6       0.1
+#> 4 0.4004198 184.0 267 354.05              20            1000         0.3         0.5     10000              2       0.1
+#> 5 0.4986327 145.0 208 276.00              20            1000         0.4         0.6     10000              1       0.1
+#> 6 0.5991916 104.0 150 200.00              20            1000         0.5         0.7     10000              0       0.1
+#> 7 0.6999457  65.0  96 128.05              20            1000         0.6         0.8     10000              0       0.1
+#>   precision_rel
+#> 1         FALSE
+#> 2         FALSE
+#> 3         FALSE
+#> 4         FALSE
+#> 5         FALSE
+#> 6         FALSE
+#> 7         FALSE
+

You can obviously parallelize the process, which will be especially useful if you want to run many simulations. For instance, if you increase the number of studies to 100k (as in the original article), it will take less than a minute on a modern CPU with several cores. On my i7-2640 with 4 cores, it takes about 30 s. Overall, this speedup is substantial compared to the original implementation. A rough benchmark can be found here: https://github.com/johannes-titz/fastpos which results in a speedup of about 500 for a typical scenario.

+

If you are interested in this package, there is still some work to do and I am happy if you like to contribute. Specifically, I would like to use RcppParallel to speed up the simulation directly in C++. This is rather of academic interest, as the functions are fast enough to find the critical point of stability for an individual study in a few seconds for most use cases. Indeed, I hope the package will be used this way—quite similar to a power analysis for significance testing.

+
+

References

+
+
+

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537–563. https://doi.org/10.1146/annurev.psych.59.103006.093735

+
+
+

Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize? Journal of Research in Personality, 47, 609–612. https://doi.org/10.1016/j.jrp.2013.05.009

+
+
+

Schönbrodt, F. D., & Perugini, M. (2018). Corrigendum to “At What Sample Size Do Correlations Stabilize?” [J. Res. Pers. 47 (2013) 609–612]. Journal of Research in Personality, 74, 194. https://doi.org/10.1016/j.jrp.2018.02.010

+
+
+
+ + + + + + + + + + +