# Survey Design - Sample Size

Your primary concern when conducting a survey should not be with the meaning, but with the __reliability of results__. An important factor for drawing general conclusions from a survey is its sample size. As opposed to running a power analysis for A/B tests, decisions on minimum required __sample sizes for surveys are based on confidence intervals__, thus the level of uncertainty one is willing to accept when reporting population estimates sunch means or proportions.

Depending on a survey's target KPI there are different ways to estimate a minimum required sample size. However, they do share similar input parameters:
- __Margin of error:__
    - The margin of error marks the amount of deviation around the true value (that we try to estimate with our survey) we are willing to accept. It's typically set to $ ME ≤ 0.05$ (smaller or equal to 5%) and is half the width of the confidence interval.
- __Alpha:__
    - $Type I$ error or probability of finding an effect when there is none (false positive). It's used to define the confidence level which represents the degree of certainty we aim to have in the range of possible true values given by our population estimate and it's margin of error. Typically, you'd choose a 5% $Type I$ error rate for 95% confidence (100-5), thus set $\alpha = 0.05$.
- __Size of population:__
    - Oftentimes we don't know exactly how large the underlying population of our survey is, but if we have an idea of its size (and it's not infite) we can use this information to correct our sample size estimate:
    
    $n_corr = \frac{nN}{n+N-1}$,
    
    where $n$ is the estimated minimum required sample size and $N$ is the size of the population.

## Continuous Outcome
_Normal Distributions (Approximately)_

The __minimum required sample size__ for survey designs with continuous outcome can be approximated as follows:

$N = (\frac{Z*S}{Err})^2$,

where
- $Z$ is the z-score i.e. value from the normal distribution representing your confidence level (e.g. 1.96 for 95% confidence)
- $S$ is the population standard deviation
- $Err$ is the margin of error (e.g. 0.05 for 5%)

The problem with this method is that the __population standard deviation is rarely known in advance__ unless previous results from similar studies are available. As work-around one could use the six-sigma rule for bell shaped distributions which says 98% of observations lie within 6 standard deviations (3 to each side around the mean). Thus, the standard deviation is measured as the range of oberservations divided by 6 i.e.

$S = \frac{Max-Min}{6}$ 


In [58]:
# Sample size function for surveys with continuous outcome
n.cont <- function(sd=NULL, min=1, max=5, alpha = 0.05, err = 0.05, N = Inf){
    # Estimates single sample size for surveys where outcome
    # is represented as proportion
    #
    # Args:
    # - sd: Expected standard deviation of outcome
    # - min: Minimum of outcome variable (default set to 1 for 5 point likert skala)
    # - max: Maximum of outcome variable (default set to 5 for 5 point likert skala)
    # - alpha: Type I error rate (typically set to 0.01, 0.05 or 0.10)
    # - err: Margin of error (typically set between 0.10 and 0.01)
    # - N: Population size for finite populations
    #
    # Returns:
    # - n: Number of required participants
    
    if (is.null(sd)){
        sd <- (max-min) / 6
    }
    
    # Estimate z-Score
    z <- qnorm(1 - alpha/2)
    
    # Estimate sample size for infinite populations ...
    if (N == Inf){
        n <- ceiling((z * sd / err)^2)
    
    # ... or correct sample size for finite populations
    } else if (!is.null(N)){
        n <- (z * sd / err)^2
        n <- ceiling((n*N) / (n+N-1))
    }
    
    # Return sample size
    return(n)
}


# Estimate sample size
n_participants <- n.cont()
print(paste0('Number of required survey participants: ', n_participants))


# Estimate sendout volume in order to achieve required response n
expected_response <- 0.2
n_sendouts <- n_participants / expected_response
print(paste0('Number of required sendouts: ', n_sendouts))

[1] "Number of required survey participants: 683"
[1] "Number of required sendouts: 3415"


## Proportions
_Binomial Distributions_

The __minimum required sample size__ for survey designs with binary outcomes can be approximated as follows:

$N = (\frac{Z}{Err})^2 p (1-p)$,

where
- $Z$ is the z-score i.e. critical value from the normal distribution representing your confidence level (e.g. 1.96 for 95% confidence)
- $p$ is the expected proportion of the characteristic of interest
- $Err$ is the margin of error (e.g. 0.05 for 5%)

Similar to the standard deviation in the equation for continuous outcomes, we have to decide on a value that's oftentimes unknow in advance, $p$. In the __most conservative setting where we have no idea about the true value of $p$__ we'd choose $p = 0.5$. If $p > 0.5$ you should estimate your survey's sample size based on the proportion NOT to have the characteristic of interest.

Apparently, this type of sample size calculation is the one __predominantly used in survey sample size calculators on the web__.

In [59]:
# Sample size function for surveys with categorical outcome
n.prop <- function(p = 0.5, alpha = 0.05, err = 0.05, N = Inf){
    # Estimates single sample size for surveys where outcome
    # is represented as proportion
    #
    # Args:
    # - p: Expected proportion of outcome (0.5 = maximum uncertainty)
    # - alpha: Type I error rate (typically set to 0.01, 0.05 or 0.10)
    # - err: Margin of error (typically set between 0.1 and 0.01)
    # - N: Population size for finite populations
    #
    # Returns:
    # - n: Number of required participants
    
    # Estimate z-Score
    z <- qnorm(1 - alpha/2)
    
    # Estimate sample size for infinite populations ...
    if (N == Inf){
        n <- ceiling(z^2 * p*(1-p) / err^2)
    
    # ... or correct sample size for finite populations
    } else if (!is.null(N)){
        n <- z^2 * p*(1-p) / err^2
        n <- ceiling((n*N) / (n+N-1))
    }
    
    # Return sample size
    return(n)
}


# Estimate sample size
n_participants <- n.prop()
print(paste0('Number of required survey participants: ', n_participants))


# Estimate sendout volume in order to achieve required response n
expected_response <- 0.2
n_sendouts <- n_participants / expected_response
print(paste0('Number of required sendouts: ', n_sendouts))

[1] "Number of required survey participants: 385"
[1] "Number of required sendouts: 1925"


## References
- [Appropriate Sample Size in Survey Research](https://www.opalco.com/wp-content/uploads/2014/10/Reading-Sample-Size1.pdf)
- [Sample Size Estimation](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&ved=2ahUKEwjyh-uX-aTjAhUtQRUIHcmsCnY4ChAWMAB6BAgBEAI&url=http%3A%2F%2Fwww.columbia.edu%2F~mvp19%2FRMC%2FM6%2FM6.doc&usg=AOvVaw2Nu7fWpm6dLxk5Q770x643)
- [Sample Size For Single Proportions](https://www.stat.auckland.ac.nz/~wild/ChanceEnc/Ch08.psampsize.pdf)
- [Determining Sample Size](https://pdfs.semanticscholar.org/aee4/333d7f9f2d8ae8ad2f193dba35548901d370.pdf)
