### Parametric and NonParametric Models

**Statistical model** is a set of distributions ${F}$.

**Parametric Model** is a set ${F}$ that can be parameterized by a finite number of parameters.

For example with data from a normal distribution,
$$ {F} = \left\{ f(x; \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2\sigma^2} (x -\mu)^2\right), \mu \in \mathbb{R}, \sigma >0 \right\} $$

In general, a **parametric model** takes the form where $\theta$ is an unknown parameter that takes values in the parameter space $\Theta$. $$ {F} = \{f(x; \theta): \theta \in \Theta\} $$

If $\theta$ is a vector and we're interested in one component of $\theta$, the remaining parameters are named nuisance parameters.

Nonparametric Model
A nonparametric model is a set ${F}$ that cannot be parameterized by a finite number of parameters.

Parametric Model Notation
If ${F} = \{f(x; \theta): \theta \in \Theta\}$ is a parametric model, we note:

**Probability:**
$$ P_{\theta}(X \in A) = \int_A f(x; \theta) \, dx $$
**Expectation:**
$$ E_{\theta}(X) = \int x \cdot f(x; \theta) \, dx $$

### Concepts in Inference

Let $X_1, ..., X_n$ be $n$ iid data points from some distribution $F$.

**Point Estimator**
A point estimator $\hat{\theta}_n$ of a parameter $\theta$ is some function
$$ \hat{\theta}_n = g(X_1, ..., X_n) $$

**Bias**
We define the bias of $\hat{\theta}_n$ as:
$$ \text{bias}(\hat{\theta}_n) = E_{\theta}(\hat{\theta}_n) - \theta $$
We say that $\hat{\theta}_n$ is unbiased if $E_{\theta}(\hat{\theta}_n) = \theta$.

**Consistent Estimator**
A point estimator $\hat{\theta}_n$ of a parameter $\theta$ is consistent if $\hat{\theta}_n \xrightarrow{P} \theta$.

**Sampling Distribution**
The distribution of $\hat{\theta}_n$ is called the sampling distribution.

**Standard Error**
The standard deviation of $\hat{\theta}_n$ is called the standard error, denoted by se.
$$ \text{se} = \text{se}(\hat{\theta}_n) = \sqrt{V(\hat{\theta}_n)} $$

Often it is not possible to compute the standard error, but usually we can estimate the standard error.

**Mean Square Error (MSE)** measures the quality of a point estimate.

- $ \text{MSE} = E_{\theta}(\hat{\theta}_n - \theta)^2 $,    
- $ \text{MSE} = \text{bias}(\hat{\theta}_n)^2 + V_{\theta}(\hat{\theta}_n) $


### Confidence Sets

**Confidence Interval**

A $(1 - \alpha)$ confidence interval for a parameter $\theta$ is an interval $C_n = (a, b)$, where

$a = a(X_1, \ldots, X_n)$ and $b = b(X_1, \ldots, X_n)$ are functions of the data such that:
$ P(\theta \in C_n) \geq 1 - \alpha, \quad \forall \theta \in \Theta $

The value $(1 - \alpha)$ is the coverage of the confidence interval.

$C_n$ is random, and $\theta$ is fixed.

Confidence Set
If $\theta$ is a vector, then we use a confidence set instead of an interval.


**Theorem: Normal-based Confidence Interval**

Suppose that $\hat{\theta}_n \sim N(\theta, \hat{se}^2)$.
Let $\Phi$ be the CDF of a standard normal distribution, and

let $z_{\alpha/2} = \Phi^{-1}(1 - (\alpha/2))$.

Then, $ P(Z > z_{\alpha/2}) = \frac{\alpha}{2} \quad \text{and} \quad P(-z_{\alpha/2} < Z < z_{\alpha/2}) = 1 - \alpha, \quad $

where $Z \sim N(0, 1) $

Let $C_n = (\hat{\theta}_n - z_{\alpha/2} \cdot \hat{se}, \hat{\theta}_n + z_{\alpha/2} \cdot \hat{se})$.

Then, $P_{\theta}(\theta \in C_n) \rightarrow 1 - \alpha$.


### Hypothesis Testing

### Parametric Inference

### Parametric Models and Inference

**Parametric models** are noted in the form $F = \{f(x; \theta): \theta \in \Theta\}$.

Here, $\Theta \in \mathbb{R}^k$ is the parameter space, and $\theta = (\theta_1, \ldots, \theta_k)$ is the parameter.

**The Problem of Inference** reduces to the problem of estimating the parameter $\theta$.

Parameter of Interest and Nuisance Parameter
We are often interested in a function $T(\theta)$.

For example, if $X \sim N(\mu, \sigma^2)$, the parameter is $\theta = (\mu, \sigma)$.

If the goal is to estimate $\mu = T(\theta)$, then $\mu$ is called the parameter of interest, and $\sigma$ is called the nuisance parameter.

### Method of Moments Estimator

Suppose that the parameter $\theta = (\theta_1, \dots, \theta_k)$ has $k$ components.
For $1 \leq j \leq k$, define the $j$-th moment
$$ \alpha_j \equiv \alpha_j(\theta) \equiv E_{\theta}(X^j) = \int x^j \, dF_{\theta}(x) $$

and the $j$-th sample moment
$$ \hat{\alpha}_j = \frac{1}{n}\sum_{i=1}^{n} X_i^j \quad \text{for } i \in [1, n] $$

**The method of moments estimator** $\hat{\theta}_n$ is defined to be the value of $\theta$
such that
\begin{align*}
\alpha_1(\hat{\theta}_n) &= \hat{\alpha}_1 \\
\alpha_2(\hat{\theta}_n) &= \hat{\alpha}_2 \\
& \dots \\
\alpha_k(\hat{\theta}_n) &= \hat{\alpha}_k
\end{align*}
This defines a system of $k$ equations with $k$ unknowns.

**Theorem**
Let $\hat{\theta}_n$ denote the method of moments estimator. Given some conditions:
- The estimate $\hat{\theta}_n$ exists with probability tending to 1.
- The estimate is consistent: $\hat{\theta}_n \xrightarrow{P} \theta$ (converges in Probability).
- The estimate is asymptotically Normal:
$ \sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} N(0, \Sigma) $
where $\Sigma = gE_{\theta}(Y T^T) g^T$,
$ Y = (X, X^2, ..., X^k)^T, \quad g = (g_1, ..., g_k)^T $ and
$ g_j = \frac{\delta(\alpha_j^{-1}(\theta))}{\delta\theta} $

### Maximum Likelihood

Let X\_1, …, X\_n be i.i.d. random variables with probability density function (pdf) f\(x; \\theta\)

The **Likelihood function** is defined as:
$$ L_n(\theta) = \prod_{i=1}^{n} f(X_i; \theta) $$

The **Log-likelihood function** is defined as:
$$ l_n(\theta) = \log L_n(\theta) = \sum_{i=1}^{n} \log f(X_i; \theta) $$

The likelihood function L_n is the joint pdf, but treated as a function of the parameter.

L_n, and the likelihood function is not a pdf, so not true it integrates to 1.

**Maximum Likelihood Estimator (MLE)**
The maximum likelihood estimator (MLE), is the value that maximizes L_n.

### Properties of MLE

Given certain conditions on the model, the MLE \\hat\{\\theta\}\_n is a desirable estimator due to its properties:

**Consistency**:

The MLE is consistent, meaning it converges in probability to the true value of the parameter \\theta:
$$ \hat{\theta}_n \xrightarrow{p} \theta $$
where \\theta is the true value of the parameter.

**Kullback-Leibler Distance and Moment**:

The Kullback-Leibler distance between two pdfs f and g is given by:
$$ D(f, g) = \int f(x) \log\left(\frac{f(x)}{g(x)}\right) dx $$
The moment of \\theta, M\_n\(\\theta\), is related to the Kullback-Leibler distance:
$$ M_n(\theta) \sim -D(\theta_n, \theta) $$

**Equivalence**:

If $\hat{\theta}_n$ is the MLE of $\theta$, and $\tau = g(\theta)$ is a one-to-one transformation of $\theta$, then the MLE of $\tau$ is $\hat{\tau}_n = g(\hat{\theta}_n)$.

**Asymptotic Normality**:

The MLE is asymptotically normal:
$$ \sqrt{n}(\hat{\theta}_n - \theta) / \hat{se} \xrightarrow{d} N(0, 1) $$

where $\hat{se}$ is the estimated standard error.

Score Function and Fisher Information: The score function is defined as:
$$ s(X; \theta) = \frac{d}{d\theta} \log(f(X; \theta)) $$
    
The Fisher information for a sample of size $n$ is:
$$ I_n(\theta) = V_{\theta}\left(\sum_{i=1}^{n} s(X_i; \theta)\right) $$

**Asymptotic Optimality (Efficiency)**:
Among well-behaved estimators, the MLE is asymptotically optimal or efficient, meaning it has the smallest variance for large samples (at least). Under certain regularity conditions, let $se = \sqrt{1/I_n(\theta)}$. Then:
$$ (\hat{\theta}_n - \theta) / se \xrightarrow{d} N(0, 1) $$

Similarly, using the estimated Fisher information $I_n(\hat{\theta}_n)$, let $\hat{se} = \sqrt{1/I_n(\hat{\theta}_n)}$. Then:
$$ (\hat{\theta}_n - \theta) / \hat{se} \xrightarrow{d} N(0, 1) $$

**Approximation to Bayes Estimator**:

MLE is approximately the Bayes estimator under certain conditions.
