# Understanding and Visualizing Data with Python

## Week 1

### Variable Types
* Quantitative Variables
    * Continuous
    * Discrete
* Categorical (or Qualitative) Variables
    * Ordinal - groups have an order or ranking
    * Nominal

## Week 2

### Quantitative data
* Histograms
    - Shape: right-skewed (tails on the right)
    - Center
    - Spread
    - Outliers
* Numerical Summaries (Summary Statistics)
    - 5 Number Summary
        * Min
        * 1st Quartile (25%, Q1)
        * Median (50%)
        * 3rdd Quartile (75%, Q3)
        * Max <br>
        Others: <br>
        IQR = Q3 - Q1 (Inter Quartile Range) <br>
        Range = Max - Min
    - Left skewed histogram: mean is **less than** mediam since there are outliers on the left.
    - Range is not robust to outliers. IQR is robust to outliers.
    - Standard deviation: roughly the average distance of the values from their mean.
* Imperical Rule
    - ($\mu-\sigma$, $\mu+\sigma$): 68%
    - ($\mu-2\sigma$, $\mu+2\sigma$): 95%
    - ($\mu-3\sigma$, $\mu+3\sigma$): 99.7%
    _ standardize: z-score=(Obs-mean)/SD
* Boxplots
    - 5 Number Summary
    - con: Boxplots can help identify outliers.
    - pro: Boxplots can hide gaps and clusters

## Week 3
### Multivariate Data
* Multivariate Categorical Data
    - Histogram
    - Boxplot

* Multivariate Quantitative Data
    - Scatterplot
        * `Pearson correlation` (R or $\rho$): between -1 and 1 indicating the strength and direction of relationship between two variables. 
        * **Correlation Does Not Imply Causation**.
        * Outliers

## Week 4
### Populations vs. Samples
Well-defined target population 
#### Probability Sampling
    1. Simple Random Sampling (SRS)
        * randomly select n units from N population units
        * equal probability of selection = n/N
        * all possible samples of size n are equally likely
        * estimates are unbiased on average
        * can be **with replacement** or **without replacement**
        * for both: probability of selection = n/N
        * SRS is rarely used in practice: too expensive
    2. Complex Sampling for Larger Populations
        * stratums: stratification
        * clusters
        * simple random sampling within each cluster
#### Non-Probability Sampling <br>
1. Properties
    * **Probabilities of selection cannot be determined for sampled units**
    * no random selection
    * clusters are not randomly selected
    * not expensive compared to SRS
2. Examples
    * volunteers
    * opt-in web surveys
    * snowball sampling
    * convenience samples
    * quota samples
3. Cons
    * strong risk of sampling bias
    * no statistical basis for making inference
4. What can we do?
    * Pseudo-Randomization Approach
        - combine non-probability sample with a probabilitiy sample
        - estimate probability of being included in non-probability sample
        - treat estimated probabilities of selection as "known" for non-probability sample
    * Calibration Approach
        - compute weights for responding units in non-probability sample that allow weightedd sampled to mirror a known population
        - downweight/upweight
        - limitation: if weighting factor not related to variables of interest -> will not reduce bias

### Probability Samples
#### Probability Samples -> Sampling Distributions
* Sampling Distributions
    - distribution of survey estimates we would see if we selected many random samples using same sampling design and computed an estimate from each.
    - key properties:
        * hypothetical
        * large sample size -> normal distribution (**Central Limit Theorem**)
* Sampling Variance
    - variability in the estimates described by the sampling distribution
    - sampling errors randomly vary
    - variability of these sampling errors describes the variance of the sampling distribution
    - larger samples -> less sampling variance
* Why is Sampling Variance Important?
    - sampling theory allows us to estimate the variancee of sampling distribution based on **only one sample**
* Sampling Distributions of Other Common Statistics
    - (**Central Limit Theorem**) <span style='color:maroon'>Given large enough samples, sampling distributions of most statistics of interest tend to normality (regardless of how the input variables are distributed)</span>.
    - Pearson Correlations: between -1 and 1. 
    - Non-Normal Sampling Distributions
        * **Not all** statistics have normal sampling distributions
        * In these cases, more specialized procedures needed to make population inferences (e.g., Bayesian methods).


### Inference in Practice
#### Making Population Inference Based on Only One Probability Sample
Key Assumption: Normality (sampling distribution of statistics is normal *if the sample size is large*)
1. Confidence Interval Estimate for Parameters of Interest
    - To form a confidence interval: best estimates +/- margin of error
    - 95% confidence interval: **expect 95% of intervals will cover true population value**
2. Hypothesis Testing about Parameters of Interest
These two inferential procedures are valid if **probability sampling** was used.

#### Inference for Non-Probability Samples
Non-probability samples do not let us rely on sampling theory for making population inferences based on expected sampling distributions. <br>
For any of th following estimation techniques for non-probability samples, we need to have common variables in the two data sets. 
1. Quasi-Randomization (or pseudo-randomization)
    * combine non-probability sample with prior data from a probablity sample that collect the same features
    * stack the two datasets
    * code: if member of non-probability sample, label as 1; if member of probability sample, label as 0
    * fit **logistic regression model**: predict label with common variables
        - weighting non-probability cases by 1
        - weighting probability cases by their survey weights 
    * big idea: 
        1. can predict probability of being in non-probability sample
        2. invert predicted probabilities for non-probability samples, treat as survey weight (i.e., inverse of probability of being selected)
    * issue: how to estimate sampling variance?
        - not clear yet: replication method
    
2. Population Modeling
    * Big idea: use predictive modeling to predict aggregate sample quantities (usually totals) on key variables of interest for population units not included in the non-probability sample
    * compute estimates of interest using estimated totals <br>
In summary, 
* leverage other auxiliary information
* predict values


#### Complex Samples
Complex samples: any probability sample where design involves more than Simple Random Sampling (SRS). <br>
Features of complex samples: stratification. aim -> choose sample scheme to reduce sampling variance. <br>
Proportion Allocation
* Features of Complex Samples
    - Stratification: eliminate between stratum variance in means (or total) on variable from the sampling variance
        * count for stratification in analysis -> conservative; large confidence intervals
    - Clustering: random sampling of larger clusters of population elements, possibly across multiple stages
        * reduce cost of data collection -> inference too liberal, confidence intervals too narrow
    - Weighting: complex samples are still probability samples
        * unequal probabilities of selection for different units 
        * weights = inverse of probability of selection
            - If my probability is 1/100 -> my weight is 100
            - I represent myself and 99 others in the population


