Plotting
Testing
Conclusion

This is a quick attempt to reproduce the results from:

Zachary R. Smith and Craig S. Wells’s Central Limit Theorem and Sample Size

We begin with three probability distributions: Normal (dnorm(mean = 50, sd = 10))), Uniform (dunif(10,30)), Bimodal, and then we will consider a family of skewed distributions.

From each of the three populations, 10,000 samples of size 5, 10, 15, 20, 25, and 30 were randomly drawn, rounding to the nearest integer, as is often observed in educational and psychological measures. The mean for each sample was then computed, using four decimal places. Twenty replications were performed for each condition (i.e., the process was repeated 20 times for each sample size).

After the sampling distribution for the mean was constructed for a replication, a one sample Kolmogorov-Smirnov (KS) test was used to test whether the sample mean followed a normal distribution. This test was used on each sample size taken from all distributions. The proportion of replications in which the distribution of the sample means were concluded to depart normality according to the KS-test was recorded.

So lets start sampling in the manner described.

sample_normal <- function(sample_size, N = 20, mean = 50, sd = 10) {
  replicate(N, mean(round(rnorm(sample_size, mean, sd))));
}

sample_unif <- function(sample_size, N=20, lower_bound = 10, upper_bound = 30) {
  replicate(N, mean(round(runif(sample_size, lower_bound, upper_bound))))
}

sample_bimodal <- function(sample_size, N=20, theta = 0.75) {
  samples = c();
  for (i in 1:N) {
    w = rbinom(sample_size, 1, theta);
    samples <- append(samples, mean(round(w*rnorm(sample_size, 50, 10) + (1 - w)*rnorm(sample_size, 85, 10))));
  }
  samples
}

fleisman <- function(n, a, b, c, d, mean = 50, sd = 10) {
  x <- rnorm(n, mean, sd)
  a + x*(b + x*(c + d*x))
}
coefs <- list(list(-0.399, 0.930, 0.399, -0.036),
              list(-0.221, 0.866, 0.221, 0.027),
              list(-0.161, 0.819, 0.161, 0.049),
              list(-0.119, 0.789, 0.119, 0.062));


sample_fleisman <- function(sample_size, k, N=20, theta = 0.75) {
  a <- coefs[[k]][[1]]
  b <- coefs[[k]][[2]]
  c <- coefs[[k]][[3]]
  d <- coefs[[k]][[4]]
  replicate(N, mean(round(fleisman(sample_size, a, b, c, d))));
}

Plotting

Before continuing on, we can plot some histograms to see what the data looks like.

qplot(sample_bimodal(5, N=200), geom="histogram", fill=I("#008FD5"), ylab="Bimodal(5, N=200)") +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Bimodal(5, N=200)") + ylab("density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample_bimodal(10, N=200), geom="histogram", fill=I("#008FD5")) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Bimodal(10, N=200)") + ylab("density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample_bimodal(30, N=200), geom="histogram", fill=I("#008FD5")) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Bimodal(30, N=200)") + ylab("density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample_fleisman(5, 4, N=200), geom="histogram", fill=I("#FF2700")) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Fleisman4(5, N=200)") + ylab("density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Testing

Perfect, now we are in a state to imitate the testing procedure:

Samples of 5 through 300 at intervals of 5 were drawn with no rounding from each of the four skewed distributions over 10,000 repetitions. After the transformation, the sampling means were rounded to the fourth decimal place. Once the sampling distribution for the mean is complete, the one sample Kolmogorov-Smirnov (KS) test was applied.

We will go up to 50 at intervals of 5, using a real test for normality. We intentionally avoid handling the multiple comparisons problem (expecting around 5% of the cases to be exceptional).

results <- data.frame(uniform = rep(NA,10),
                      normal = rep(NA,10),
                      bimodal = rep(NA,10),
                      fleisman1 = rep(NA,10),
                      fleisman2 = rep(NA,10),
                      fleisman3 = rep(NA,10),
                      fleisman4 = rep(NA,10))
# sidak significance = 1 - (1 - 0.05)**(1e-4);
significance_level <- 0.05;
for (i in 1:10) {
  print(i)
  results$uniform[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_unif(5*i))$p)))
  results$normal[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_normal(5*i))$p)))
  results$bimodal[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_bimodal(5*i))$p)))
  results$fleisman1[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 1))$p)))
  results$fleisman2[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 2))$p)))
  results$fleisman3[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 3))$p)))
  results$fleisman4[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 4))$p)))
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

So how many “extreme” cases (seemingly non-normal samples) do we have?

kable(results)

uniform	normal	bimodal	fleisman1	fleisman2	fleisman3	fleisman4
421	548	662	1278	1086	1111	1080
470	553	543	859	874	800	771
486	511	508	746	725	690	717
469	466	522	718	625	652	648
441	484	531	661	605	596	631
484	473	498	627	610	576	591
483	475	459	625	613	559	583
501	524	524	599	551	584	585
537	481	493	569	561	588	561
456	484	515	606	588	571	553

This is nothing at all what Smith and Wells report…

Conclusion

The central limit theorem works, the world is safe, and we can all sleep well again. AND DO NOT CITE THIS DAMN PAPER!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2019-06-18-central-limit-theorem.md

2019-06-18-central-limit-theorem.md

Plotting

Testing

Conclusion

Files

2019-06-18-central-limit-theorem.md

Latest commit

History

2019-06-18-central-limit-theorem.md

File metadata and controls

Plotting

Testing

Conclusion