Skip to content

Latest commit

 

History

History
695 lines (400 loc) · 9.95 KB

2019-06-18-central-limit-theorem.md

File metadata and controls

695 lines (400 loc) · 9.95 KB

This is a quick attempt to reproduce the results from:

We begin with three probability distributions: Normal (dnorm(mean = 50, sd = 10))), Uniform (dunif(10,30)), Bimodal, and then we will consider a family of skewed distributions.

From each of the three populations, 10,000 samples of size 5, 10, 15, 20, 25, and 30 were randomly drawn, rounding to the nearest integer, as is often observed in educational and psychological measures. The mean for each sample was then computed, using four decimal places. Twenty replications were performed for each condition (i.e., the process was repeated 20 times for each sample size).

After the sampling distribution for the mean was constructed for a replication, a one sample Kolmogorov-Smirnov (KS) test was used to test whether the sample mean followed a normal distribution. This test was used on each sample size taken from all distributions. The proportion of replications in which the distribution of the sample means were concluded to depart normality according to the KS-test was recorded.

So lets start sampling in the manner described.

sample_normal <- function(sample_size, N = 20, mean = 50, sd = 10) {
  replicate(N, mean(round(rnorm(sample_size, mean, sd))));
}

sample_unif <- function(sample_size, N=20, lower_bound = 10, upper_bound = 30) {
  replicate(N, mean(round(runif(sample_size, lower_bound, upper_bound))))
}

sample_bimodal <- function(sample_size, N=20, theta = 0.75) {
  samples = c();
  for (i in 1:N) {
    w = rbinom(sample_size, 1, theta);
    samples <- append(samples, mean(round(w*rnorm(sample_size, 50, 10) + (1 - w)*rnorm(sample_size, 85, 10))));
  }
  samples
}

fleisman <- function(n, a, b, c, d, mean = 50, sd = 10) {
  x <- rnorm(n, mean, sd)
  a + x*(b + x*(c + d*x))
}
coefs <- list(list(-0.399, 0.930, 0.399, -0.036),
              list(-0.221, 0.866, 0.221, 0.027),
              list(-0.161, 0.819, 0.161, 0.049),
              list(-0.119, 0.789, 0.119, 0.062));


sample_fleisman <- function(sample_size, k, N=20, theta = 0.75) {
  a <- coefs[[k]][[1]]
  b <- coefs[[k]][[2]]
  c <- coefs[[k]][[3]]
  d <- coefs[[k]][[4]]
  replicate(N, mean(round(fleisman(sample_size, a, b, c, d))));
}

Plotting

Before continuing on, we can plot some histograms to see what the data looks like.

qplot(sample_bimodal(5, N=200), geom="histogram", fill=I("#008FD5"), ylab="Bimodal(5, N=200)") +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Bimodal(5, N=200)") + ylab("density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample_bimodal(10, N=200), geom="histogram", fill=I("#008FD5")) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Bimodal(10, N=200)") + ylab("density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample_bimodal(30, N=200), geom="histogram", fill=I("#008FD5")) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Bimodal(30, N=200)") + ylab("density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample_fleisman(5, 4, N=200), geom="histogram", fill=I("#FF2700")) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + xlab("Fleisman4(5, N=200)") + ylab("density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Testing

Perfect, now we are in a state to imitate the testing procedure:

Samples of 5 through 300 at intervals of 5 were drawn with no rounding from each of the four skewed distributions over 10,000 repetitions. After the transformation, the sampling means were rounded to the fourth decimal place. Once the sampling distribution for the mean is complete, the one sample Kolmogorov-Smirnov (KS) test was applied.

We will go up to 50 at intervals of 5, using a real test for normality. We intentionally avoid handling the multiple comparisons problem (expecting around 5% of the cases to be exceptional).

results <- data.frame(uniform = rep(NA,10),
                      normal = rep(NA,10),
                      bimodal = rep(NA,10),
                      fleisman1 = rep(NA,10),
                      fleisman2 = rep(NA,10),
                      fleisman3 = rep(NA,10),
                      fleisman4 = rep(NA,10))
# sidak significance = 1 - (1 - 0.05)**(1e-4);
significance_level <- 0.05;
for (i in 1:10) {
  print(i)
  results$uniform[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_unif(5*i))$p)))
  results$normal[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_normal(5*i))$p)))
  results$bimodal[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_bimodal(5*i))$p)))
  results$fleisman1[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 1))$p)))
  results$fleisman2[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 2))$p)))
  results$fleisman3[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 3))$p)))
  results$fleisman4[i] <- length(Filter(function(p) {p < significance_level}, replicate(10000, shapiro.test(sample_fleisman(5*i, 4))$p)))
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

So how many “extreme” cases (seemingly non-normal samples) do we have?

kable(results)

uniform

normal

bimodal

fleisman1

fleisman2

fleisman3

fleisman4

421

548

662

1278

1086

1111

1080

470

553

543

859

874

800

771

486

511

508

746

725

690

717

469

466

522

718

625

652

648

441

484

531

661

605

596

631

484

473

498

627

610

576

591

483

475

459

625

613

559

583

501

524

524

599

551

584

585

537

481

493

569

561

588

561

456

484

515

606

588

571

553

This is nothing at all what Smith and Wells report…

Conclusion

The central limit theorem works, the world is safe, and we can all sleep well again. AND DO NOT CITE THIS DAMN PAPER!