## Sampling

Install and load packages.

In [None]:
# User agent for Linux binaries
options(HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])))

# Set P3M as CRAN repo
options(repos = c(CRAN = "https://packagemanager.posit.co/cran/__linux__/focal/latest"))

options(scipen=999)

install.packages(c("infer", "moderndive", "dplyr", "ggplot2"))

In [None]:
library(infer)
library(moderndive)
library(dplyr)
library(ggplot2)

Load the `house_prices` data.

In [None]:
data(house_prices)

What is the average price of homes in this dataset?

In [None]:
round(mean(house_prices$price),2)

What if I sample 50 houses?

In [None]:
round(mean(sample(house_prices$price, 50)),2)

Recall that the *95% confidence interval* means that if we repeat the sample 100 times, 95 of the means will fall within the confidence interval.
Therefore, we are 95% confident that the true mean falls within those 95 samples. (5% chance of Type 1 error - the true mean is one of the other 5 samples.)
If we repeat the sample 1000 times, then 950 of the means will fall within the confidence interval. The number of repetitions isn't important - it's only theoretical (we don't actually repeat samples to generate confidence intervals).

In [None]:
rep_sample_n(house_prices, size = 50, reps = 1000) |>
  group_by(replicate) |>
  summarize(mean_price = mean(price)) |>
  ggplot() +
  geom_histogram(aes(x = mean_price)) +
  stat_summary(
    aes(x = mean(mean_price), y = mean_price),
    fun.data = \(x) data.frame(xintercept = quantile(x, c(.025, .975))),
    geom = "vline",
    color = "red",
    linetype = "dashed"
  ) +
  stat_summary(
    aes(x = mean(mean_price), y = mean_price),
    fun.data = \(x) data.frame(xintercept = mean(x)),
    geom = "vline",
    color = "blue",
    size = 1
  )

This is known as the sampling distribution. The sampling distribution gets narrower (tighter) as the sample size increases.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)
bind_rows(
  rep_sample_n(house_prices, size = 50, reps = 1000) |>
    mutate(sample = "50"),
  rep_sample_n(house_prices, size = 100, reps = 1000) |>
    mutate(sample = "100"),
  rep_sample_n(house_prices, size = 500, reps = 1000) |>
    mutate(sample = "500"),
  rep_sample_n(house_prices, size = 1000, reps = 1000) |>
    mutate(sample = "1000")
) |>
  group_by(sample, replicate) |>
  summarize(mean_price = mean(price)) |>
  mutate(sample = factor(sample, levels = c("50", "100", "500", "1000"))) |>
  ggplot() +
  geom_histogram(aes(x = mean_price)) +
  stat_summary(
    aes(x = mean(mean_price), y = mean_price),
    fun.data = \(x) data.frame(xintercept = quantile(x, c(.025, .975))),
    geom = "vline",
    color = "red",
    linetype = "dashed"
  ) +
  stat_summary(
    aes(x = mean(mean_price), y = mean_price),
    fun.data = \(x) data.frame(xintercept = mean(x)),
    geom = "vline",
    color = "blue",
    size = 1
  ) +
  facet_wrap(~sample, nrow = 1)