# Probability Sampling

Probability Sampling Basics

1. Construct a list of all units in population aka a **sampling frame**
2. Determine the probability of selection for every unit on list (known and non-zero)
3. Select units from list at random, with sampling rates for different subgroups determined by probabilities of selection
4. Attempt to measure randomly selected units

Non-Probability Sampling

All non probabilities of selection, which make it difficult to make inference on selection. 

1. Opt-in Web Surveys
2. Quota Sampling
3. Snowball sampling
4. Convenience sampling (collecting data from people on the street, or classes, no probabilites of selection)

Main problem - no statistical basis for making inference about the target population, high potential for bias

**Why Probability Sampling?**

Random selection of population units protects us against bias from the sample selection mechanism, allows us to make population inferences based on sampling distributions. 

**Big Idea**

With careful sample design, probability samples yield representative, realistic, random samples from larger populations, and such samples have important properties. 

**Simple Random Sampling (SRS)**, and links to i.i.d (independent and identically distributed) data

Obtaining a representative selection of values from a possible sample of data. 

- With SRS we start with a known list of $N$ population units, and randomly select $n$ units from the list. 
- Every unit has **equal probability of selection** = $\frac{n}{N}$
- All possible samples of size $n$ are equally likely
- Estimates of means, proportions, and totals based on SRS are unbiased (equal to the population values on average)

- Can be with replacement or without replacement
- For both: probability of selection for each unit still $\frac{n}{N}$

- SRS rarely used in practice ~ collecting data from $n$ randomly samples units in large populations can be expensive

SRS is usually done when samples are smaller and less expensive to do so. 

SRS Connection to I.I.D data

- Recall: i.i.d observations are independent and identically distributed
- SRS will generate i.i.d data for a given variable, in theory...

### SRS **Example**

- Customer Service database, N=2500 email requests in 2018
- Director wants to estimatae, mean email response time
- Exact calculation require manual review of each email trend
- Asks analytics team: sample, process and analyze $n = 100$ emails

**Naive Approach**

- Process the first 100 emails on the list
- Estimated mean could be biased if customer service representatives learn or get better over time at responding more quickly. 
- First 100 observations may come from a small group of staff, not fully representative, independent, or identically distributed. 
- No random selection according to specific probabilities

**Better Approach**

- Number emails 1 to 2500 and randomly select 100 using a random number generator
- Every email has a known probabiity of selection = $\frac{100}{2500}$
- Produces 

## Non - Probability Sampling

**No Probabilities of select that govern the sampled units**

Features of Non-Probability samples
- Probabilies of selection **can not be determined** for sampled units
- **No Random selection** of individual units
- Samples can be divided into groups (strata) or clusters, but **clusters are not randomly sampled** in earlier stages
- Data colection ofter very **cheap** relative to probability sampling

**Example**

- Study of **volunteers**, in clinical trials (because the researcher does not have any control over whose probabilies of selection will be included in the study)
- **Opt-in** / Intercept web surveys
- **Snowball** sampling - sample grows by people telling others
- **Convenience** samples, when university try to sample people from a specific class or during a specific time of day
- **Quota** samples - certain targets that you try to meet to get a certain sample size. 

### Common Feature
Probabilities of selection **cannot be determined** a priori (before you actually begin the study), this is the crucial difference between probability sampling vs. non-probability sampling.

**What is the problem?**

- In a non-probability sample there is **no statistical basis for making inference** about larger population from which sample selected, because we don't control the probabilities of selection. 
- **Knowing probabilities of selection** - in addition to population strata and randomly samples clusters
    - Can estimate features of sampling distribution, if were to take many random samples using same design

- Sampled units are **not selected at random** - We don't know these random probabilities of being included in a given sample, and we don't use random selection (strong risk of **sampling bias**)
- Sampled units **not generally representative** of larger target population of interest
- **"Big Data"** (information from millions of tweets) - **often from non-probability samples**

Thus we have no statistical basis for making conclusions about the larger population given these design features. 

** So what can we do?**

- Many data sets arise from non-probability samples

- Elliott and Valliant, provided a technical deep dive in 2017, statistical science journal mentioning they can provide many different estimation approaches for using non-probability sampling data sets.  

- **Two Possible Approaches**
    - Pseduo-Randomization (with some work, we treat the non-probability sample like a probability sample)
    - Calibration (weight the non-probability sample to look more like the population you are interested)

- **Pseduo-Randomization Approach**
    - **Combine non-probability sample with a probability sample** that collected similar measurements ("stack" data sets together), both samples must collect similar measurements
    - **Estimate probability of being included in non-probability sample** as a function of auxiliary information available in both samples
    - **Treat estimated probabilities of selection as "known"** for non-probability sample, use probability sampling methods for analysis
   
Using an indicator to determine whether your in the probability sample, or the non-probability sample. Then use logistic regression to estimate the probability of being in that non-probability sample as a function of all these other variables (age, gender etc.)

- **"Calibration" Approach"**
    - **Compute weights for responding units** in a non-probability sample that allow weighted sampled to mirror a known population
        - We want it to mirror the population, so we determine the weights of say males vs females then down/up grade each weight based off what the population represents.
        - This is specifically important if this characteristic (kind of imbalance) is correlated with a variable we are interested in that we ultimately want to make population statements about. 
        - If our sample looks more like the population in terms of a characteristic that has a strong correlation with our variables of interest, we'll get closer and closer to making unbiased statements about the population. 
        - **Limitation**: if weighting factor not related to variable(s) of interest will not reduce possible sampling bias
        
**Example**

API to extract info from several hundred thousand tweets and indicator of support for President Trump computed. 

- **Probability** of a tweet being selected **cannot be determined**.  
- **Twitter users not a random sample** of a larger population
- **Lots of data**
    - High Potential for sampling bias
    - lack of representation: may only capture people with strong opinions

**Whats next?**

- **Sampling distributions and sampling variance** 
    - How to estimate features of these distributions based on only one probability sample. 
- **Examples of making population inferences**
    - Based on type of sample selected
- Introduce **model-based** approaches to analyzing data

## Probability Sampling: Sampling Variance & Sampling Distributions

### What is a Sampling Distribution?

Example: **Normal Distribution** (bell curve)

- **Assume** values on variable of interest would follow certain distributions _if we could measure entire population_

- When we select probability samples to make inferential statements about larger populations, we refer to a **sampling distribution**
- **Sampling Distribution** is a distribution of **survey estimates** we would see if we selected many random samples using **same sample design** and **computed an estimate from each**

#### Key Properties
- **Hypothetical** What would happen if we had luxury of drawing thousands of probability samples and measuring each of them?
- Generally very **different in appearance from distribution of values on a single variable of interest...**
- With **large enough probability sample size**, sampling distribution of estimates will look like a **normal distribution**, regardless of what estimates are being computed. **Central Limit Theorem: CLT**

**CLT** The larger our sample size gets, the more that this distribution of estimates is going to tend towards normality as we draw more and more estimates. 

### What is a Sampling Variance?

- **Sampling Variance** is the variability in the estimates described by the sampling distribution. 
- Because we select a **sample** and do not measure everyone in a **population** a survey estimate based on a **single sample will not** be exactly equal to the population quantity of interest (we are selecting cases at random)



A sampling distribution is the distribution of all possible estimates that would aris from hypothetical repeated sampling, and larger sample sizes will result in a sampling distribution with less variance, meaning that estimates are more precise. 