Estimation & Sampling

1) Modeling
![para_vs_nonpara](para_vs_nonpara.gif)
![para_vs_non_para_chart](para_vs_non_para_chart.jpg)
* **Parametric** - assumes data comes from a type of probability distribution and makes inferences about the parameters
    * example: $Normal(\mu,\sigma^2), Poisson(\lambda)$
    * makes use of common sample statistics:
        * $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i$
        * $s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2$
* **Non-parametric** - makes no assumption about the underlying probability distribution from which the variables arise

2) Inference
* **Method of Moments (MOM)** - derives equations related to population moments (parameter estimation strategy)
![mom](mom.png)
    * What is a moment? $E[X]$ - first moment, $E[X^2]$ - second moment, $E[X^3]$ - third moment
    * Example: Assumes data comes from Binomial distribution
        * $X_i$~$Binomial(N,\underline{p})$
        * $E[X_i]=\bar{x}=Np$ - compute first moment from **sample data**
        * $\hat{p}=\frac{\bar{x}}{N}$ - estimate parameter $p$ based on first moment
    * Example: Assumes data comes from Uniform distribution
        * $X_i$~$Uniform(-\theta,\underline{\theta})$
        * $E[X_i]=\frac{\theta-\theta}{2}=0$ - **cannot** compute based on first moment from sample data
        * $Var(X_i) = \frac{(2\theta)^2}{12}=\frac{\theta^2}{3}$ - compute second moment from sample data
        * $s^2 =\frac{\hat{\theta}^2}{3}$ - estimate parameter $\theta$ based on first/second moments
* **Maximum Likelihood (MLE)** - sets values of parameters to maximize the likelihood $f(n)$ (parameter estimation strategy)
![mle](mle.png)
    * What is a likelihood function?
        * Assume $x_1,x_2,\dots,x_n$ are independent and identically distributed random variables (same probability distribution and mutually independent)
        * A joint density function (**likelihood, $L$**):
            * $\begin{align} L(\theta\mid x_1,\dots,x_n) 
                & = f(x_1\mid \theta)f(x_2\mid \theta)f(x_3\mid \theta)\cdots f(x_n\mid \theta) \\
                & = f(x_1,x_2,\dots,x_n\mid \theta) \\
                & = \prod_{i=1}^n f(x_i\mid \theta) \\
                \end{align}$
        * **Log likelihood** (makes the calculus easier)
            * $logL(\theta\mid x_1,\dots,x_n) = \sum_{i=1}^n log[f(x_i\mid \theta)]$
        * Maximizing the log likelihood gives us the parameter estimate
            * $\hat{\theta}_{mle} = argmax_{\theta \in \Theta}$ $logL(\theta\mid x_1,\dots,x_n)$
    * Example: Assume data comes from Binomial distribution
        * $X_i$~$Binomial(N,\underline{p})$
        * $\begin{align} 
            & f(x_i) = (_{x_i}^{N})p^{x_i}(1-p)^{N-x_i} \\
            & L(p\mid x) = \prod_{i=1}^n (_{x_i}^{N}p^{x_i}(1-p)^{N-x_i}) \\
            & logL(p\mid x) = \sum_{i=1}^n log (_{x_i}^{N}) + x_i log(p) + (N-x_i)log(1-p) \\
            & \frac{\delta logL(p\mid x)}{\delta p} = \sum_{i=1}^n \bigg[\frac{x_i}{\hat{p}}-\frac{N-x_i}{1-\hat{p}} = 0\bigg] \\
            & \hat{p} = \frac{\bar{x}}{N} \\
            \end{align}$
    * Example: Logistic regression has observations that can be treated like Bernoulli trials with feature vector, $x_i$, and observed response $y_i$
        * Pick coefficients that maximize the joint likelihood
        * $L(B_0,B\mid x_1,\dots,x_n) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}$
        * No closed form solution
        * Use numerical method like gradient descent
* **MOM vs MLE**:
    * MOM is an older method and most people now prefer MLE
    * Advantages of MOM:
        * Fairly simple
        * Useful if MLE computationally intractable
        * Can be useful as stepping stone to solving MLE by using first approximation to solutions of likelihood equations
* **Maximum a posteriori (MAP)** (most likely class) - finds the mode of the posterior distribution
![map](map.png)
    * Similar to MLE, but assumes prior $g$ over $\Theta$
    * MLE: $\hat{\theta}_{mle} = argmax_{\theta \in \Theta}$ $logL(\theta\mid x_1,\dots,x_n)$
    * Assume a prior $g$ over $\Theta$, and then obtain the posterior
    * MAP: $f(\theta\mid x)=\frac{f(x\mid \theta)g(\theta)}{\int_{\upsilon \in \Theta}f(x\mid \upsilon)g(\upsilon)d\upsilon}$
    * $\hat{\theta}_{map} = argmax_{\theta \in \Theta}$ $\frac{f(x\mid \theta)g(\theta)}{\int_{\upsilon \in \Theta}f(x\mid \upsilon)g(\upsilon)d\upsilon} = argmax_{\theta \in \Theta} f(x\mid \theta)g(\theta)$
* **Kernel Density Estimation (KDE)** - non-parametric way to estimate PDF of a random variable
![kde](kde.png)
    * Varying bandwidths for histograms in terms of bins presents a problem
    * Instead of summing rectangles, can sum using Gaussian kernels
    * Instead, use Gaussian kernels to determine which bandwidth is overfitting or underfitting

3) Sampling
* Statistical Data Discovery Steps
    1. Begin with a question or hypothesis
    2. Design an experiment
    3. Collect sample data
    4. Check the results / Make inference
    5. Repeat? Redesign?
* Obtaining Good Data
    * A sample should be representative of the population
    * If a sample has poor representation, that is same as putting junk in (junk in = junk out)
    * **Random sampling** is often the best way
* Sampling methods
    * **Simple Random sampling (SRS)** - each subject has an equal chance of being in the sample
    ![srs](https://faculty.elgin.edu/dkernler/statistics/ch01/images/srs.gif)
    * **Systematic sampling** - each subject is selected from ordered sampling frame where the sample is divided by population size by desired sample size (fixed periodic interval)
    ![systemic_samp](https://faculty.elgin.edu/dkernler/statistics/ch01/images/sys-sample1.gif)
    * **Stratified sampling** - each subject is drawn via SRS from each group independently where each group, called strata, arises from division of the population
    ![stratified_samp](https://faculty.elgin.edu/dkernler/statistics/ch01/images/strata-sample.gif)
    * **Cluster sampling** - when mutually homogeneous yet internally heterogeneous grouping is relevant in the population. These groups are divided into clusters and then subjects are randomly selects $n$ clusters
    ![cluster_samp](https://qph.fs.quoracdn.net/main-qimg-d81ad85a22e2f8ada09aeea1d13421f9)
* Population Inference
    * We want to know an answer about a question in the population
    * We randomly select subjects as our sample
    * We obtain the sample mean statistic $\bar{x}$
    * We make inference about the population mean $\mu$ to obtain understanding of the population
* **Central Limit Theorem (CLT)**
    ![CLT_diagram](https://cdn-images-1.medium.com/max/1024/1*tJoyMMcdILCO8PQJ6d5RRA.jpeg)
    * Given certain conditions, the **mean** of a sufficiently large number of independent and identically distributed (i.i.d.) random variables, will be approximately normal, regardless of the underlying distribution
    ![CLT](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/IllustrationCentralTheorem.png/400px-IllustrationCentralTheorem.png)
    * In other words, if we **draw enough i.i.d. samples** from the underlying distribution and average them, we should get an approximately normal distribution
    ![CLT_from_dist](https://i.pinimg.com/originals/e7/e2/a9/e7e2a937489492d074c841b41b0e96e1.png)
    * Sample mean is normally distributed: $\bar{X}$~$Normal(\mu,\frac{\sigma^2}{n})$
    * From any normally distributed random variable, we can derive a standard normal variable: $Z = \frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}$
* **Confidence Intervals (CI)** - an interval estimate for the **population parameter**
    * e.g. Average height of people aged between 20 and 24 years old in your country.
    * **Confidence level** refers to the **percentage of all possible samples** that can be expected to include the true population parameter. 
        * Confidence level is typically stated at 95%, but it can be shown at any CI e.g. 50%, 90%, 99%
        ![confidence_level](https://qph.fs.quoracdn.net/main-qimg-fb825a5584d571ed78ef869a50793a83.webp)
        * For example, suppose all possible samples were selected from the same population, and a confidence interval were computed for each sample. A 95% confidence level implies that **95% of the confidence intervals would include the true population parameter**.
        ![confidence_level2](https://qph.fs.quoracdn.net/main-qimg-a9bd1376510c8289a0daf15f5bcd376f)
    * Confidence interval for mean is given by: $(\bar{x}-1.96\frac{\sigma}{\sqrt{n}}, \bar{x}+1.96\frac{\sigma}{\sqrt{n}})$ or $\bar{x}\pm1.96\frac{\sigma}{\sqrt{n}}$ where $\sigma$ is the population standard deviation
    * Since we don't know $\sigma$, estimate CI by:
        * N>=30: $\bar{x}\pm1.96\frac{s}{\sqrt{n}}$
        * N<30: $\bar{x}\pm t_{(\frac{\alpha}{2},n-1)}\frac{s}{\sqrt{n}}$
    * e.g. With confidence level of 95%, the population mean, $\mu$, lies in the interval (88.28,115.72)
* **Resampling** - drawing repeated samples from the given data (different methodologies)
    * **Bootstrapping** - estimates sampling distribution of an estimator by random sampling **with replacement**
    ![bootstrap](http://slideplayer.com/8340508/26/images/6/Non-parametric+bootstrap.jpg)
        * Why use Bootstrapping? To obtain accuracy of a sample estimate using estimated standard errors and confidence intervals that reflects the population parameter
        * When to use Bootstrapping?
            * When theoretical distribution of statistic is complicated or unknown
            * When sample size is too small
            * When estimating the variance of statistic using a small pilot sample for power calculations
        * Real World: $F \rightarrow \textbf{X}=(x_1,x_2,\dots,x_n) \rightarrow \hat{\theta}=s(\textbf{x})$
            * $F$ - unknown probability model
            * $\textbf{X}$ - observed data
            * $\hat{\theta}$ - statistic of interest
        * Bootstrap World: $\textbf{X} \rightarrow \hat{F} \rightarrow \textbf{x}^*=(x_1^*,x_2^*,\dots,x_n^*) \rightarrow \hat{\theta}^*=s(\textbf{x}^*)$
            * $\hat{F}$ - estimated probability model
            * $\textbf{x}^*$ - bootstrap sample
            * $\hat{\theta}^*$ - bootstrap replication
        * Steps for Bootstrapping: (Bootstrap variance estimation)
        * An observed sample: $\hat{\theta} = t(\hat{F}_n^1)$
            1. Draw sample with replacement $n$ times: $X_1^*,\dots,X_n^*$~$\hat{F}_n$
            2. Compute $\theta$ estimates from $n$ samples: $\hat{\theta}^* = t(X_1^*,\dots,X_n^*)$
            3. Repeat steps 1 & 2, $K$ times, to get: $\hat{\theta}_1^*,\dots,\hat{\theta}_K^*$
            4. Obtain standard errors, confidence intervals, variance: 
                * Variance: $v_{boot} = \frac{1}{K}\sum_{k=1}^K (\hat{\theta}_k^*-\frac{1}{K}\sum_{\tau=1}^K \hat{\theta}_{\tau}^*)^2$
                * Standard error: $\hat{se}_{boot}=\sqrt{v_{boot}}$
                * Percentile method: $C_n = (\theta_{\frac{\alpha}{2}}^*, \theta_{\frac{1-\alpha}{2}}^*)$
                * Normal interval: $\hat{\theta} \pm z_{\frac{\alpha}{2}}\hat{se}_{boot}$
    * **Jackknifing** - a resampling technique that estimates the parameter for each subsample omitting the $i$-th observation
        ![jackknife](http://slideplayer.com/8340508/26/images/4/Jack-Knife+Empirical+Data+Jack+knife+sample+3+Jack+knife+sample+1.jpg)
        * Useful for Variance and Bias estimation
        * Estimate population mean $x$:
        * Mean of sampling distribution is avg of $n$ estimates:
        * Estimate of variance from distribution:
    * **Cross-validation** - a resampling technique where subsets of data are held out as validation set, and the rest of data is used to fit to the model and used to predict the validation set
    ![cv](https://d3ansictanv2wj.cloudfront.net/emlm_0302-6a388b903f6e1e04c95e718940eff039.png)
    * **Permutation tests** - a statistical significance test in which the test statistic under null hypothesis is obtained by calculating all possible values of test statistic under **rearrangements of the labels on the observed data points**
    ![methodologies](https://i.stack.imgur.com/FfXIT.jpg)
        * Is a subset of non-parametric statistics
        * Also called randomization test or re-randomization test or exact test
        * Example: Two groups $A$ and $B$ whose sample means are $\bar{x}_A$ and $\bar{x}_B$, and that we want to test, at 5% significance level, whether they come from the same distribution
            * The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis, $H_0$, that the two groups have identical probability distributions