# Sufficient Statistics & The Exponential Family

## Exposition

Sufficient statistics are an intuitive way of reducing the data down to what is necessary for computation. For an unrealistic example, say we have amassed $1e10$ data points and know the data follows a normal distribution, then we do not need to store the $1e10$ points, but rather we could compute the sample mean and standard deviation only. Thus sufficiency is primarily a tool akin to filtering and dimension reduction for reducing our data to what is necessary for a particular distribution. **Notice** however, that sufficiency requires us to **know beforehand the underlying distribution**. For this reason, sufficiency is primarly of interest to wacky theoretical statiscians and not scientists in practice who enjoy non-parametric inference....


## Definitions

### Sufficient Statistic: 

- Definition 1 (Wasserman - confusing): A statistic $T(x^n)$, where $x^n$ is a vector of $n$ observations, is **sufficient** for a parameter $\theta$ if $P(T(x^n)|\theta)=cP(T(y^n)|\theta)$ implies that $P(x^n|\theta)=dP(y^n|\theta)$ for some constants $c,d\in\mathbb{R}$ that can depend on $x^n$ and $y^n$ but not $\theta$. 


- Definition 2 (Keener): Suppose that a random variable $X:\Omega\rightarrow \mathcal{X}$ has distribution from a family $\mathcal{P}=\{P_\theta:\theta\in\Omega\}$. Then $T=T(X)$ is a **sufficient** statistic for $\mathcal{P}$ (or for $X$ or for $\theta$) if for every $t$ and $\theta$, the conditional distribution of $X$ under $P_\theta$ given $T=t$ does not depend on $\theta$. 


- Definition 3 (Rao): A statistic $T$ (a random variable mapping observations/data to some $\sigma$-algebra of the reals) is said to be **sufficient** for the family of measures $P_\theta$ iff one of the folowing equivalent conditions holds:
    - $P(A|T=t)$ is independent of $\theta$ for every $A$ measurable under $P$.
    - $\mathbb{E}[Y|T=t]$ is independent of $\theta$ for every random variable $Y$ such that $\mathbb{E}(Y)$ exists.
    - The conditional distribution of every random variable $Y$ given $T=t$, which always exists, is independent of $\theta$.
    

### Minimally Sufficient Statistic:

- Definition (Wasserman): A statistic $T$ is minimal sufficient if 1) it is sufficient and 2) it is a function of every other sufficient statistic.

### Complete Statistic:

- Definition (Keener): A statistic $T$ is complete for a family $\mathcal{P}=\{P_\theta:\theta\in\Omega\}$ if 

$$\mathbb{E}[f(T)|\theta]=c\;\;\;\forall \theta$$ 

implies $f(T)=c$.

### Fisher-Neyman Factorization Theorem:

- Statement (Wasserman): $T$ is sufficient iff there are functions $g(t,\theta)$ and $h(x)$ such that $f(x^n|\theta)=g(t(x^n),\theta)h(x^n)$. 

### Rao-Blackwell Theorem:

- Statement (Wasserman): Let $\hat{\theta}$ be an estimator and let $T$ be a sufficient statistic. Define a new estimator by 

$$\tilde{\theta}=\mathbb{E}[\hat{\theta}|T].$$

Then for every $\theta$, we have $R(\theta,\tilde{\theta})\leq R(\theta,\hat{\theta})$ where the risk $R$ is the mean squared error.

### Exponential Family: 

- Definition 1 (Wasserman): We say that $\{f(x|\theta):\theta\in\Theta\}$ is a **one-parameter exponential family** if there are functions $\eta (\theta)$, $B(\theta)$, $T(x)$, and $h(x)$ such that 

$$f(x|\theta)=h(x)e^{\eta(\theta)T(x)-B(\theta)}.$$


- Definition 2 (Keener): The family of densities $\{P(x|\eta):\eta\in \Xi\}$ is called an **$s$-parameter exponential family in canonical form** where 

$$\Xi=\{\eta:A(\eta)<\infty\}$$

is called the **natural parameter space**. Keener defines for $\eta\in\mathbb{R}^s$ the function 

$$A(\eta)=\log\int\limits_{\mathbb{R}^n} \exp{\left[\sum\limits_{i=1}^s \eta_i T_i(x)\right]} h(x)d\mu(x)$$

where $h:\mathbb{R}^n\rightarrow \mathbb{R}$ is non-negative and $T_1,\dots,T_s$ are measurable functions from $\mathbb{R}^n$ to $\mathbb{R}$. For finite $A(\eta)$, Keener defines the density by

$$P(x|\eta)=\exp{\left[\sum\limits_{i=1}^s \eta_i T_i(x) - A(\eta)\right]} h(x).$$

## Problems

**1.** Find a sufficient statistic for $\theta$ where

a) $X_1,\dots,X_n\sim \text{Unif}(-\theta,\theta)$

b) $X_1,\dots,X_n\sim \mathcal{N}(\theta,\sigma)$ for a known $\sigma\in\mathbb{R}$

c) $X_1,\dots,X_n\sim \text{Gamma}(\frac{\theta}{2},\frac{1}{2})$

d) $X_1,\dots,X_n\sim \text{Beta}(\theta, \theta)$
    
- answer

**2.** Find an example of a minimally sufficient statistic that is not complete. Can you find an example of the converse?

- answer

**3. (Rao Chapter 2 - Problem 21)**: Consider the probability space $(\Omega, \mathcal{B},P)$. Let $Y$ be a space of points $y$, $\mathcal{A}$ a $\sigma$-algebra of sets of $Y$, and $T:\Omega\rightarrow Y$ a function such that $A\in\mathcal{A}\Rightarrow T^{-1}(A)\in\mathcal{B}$. Let $f(\omega,y)$ be a real-valued $\mathcal{B}\times \mathcal{A}$ measurable function on $\Omega\times Y$. Then show that 

$$\mathbb{E}\left[f(\omega,T(\omega))|T(\omega)=y\right]=\mathbb{E}\left[f(\omega,y)|T(\omega)=y\right].$$

- answer

**4. (Rao Chapter 2 - Problem 22)**: Let $X$ be a vector valued random variable whose distribution depends on a vector parameter $\theta$. Further, let $T$ be a statistic such that the distribution of $T$ only depends on $\phi$, a function of $\theta$. Then $T$ is said to be **inference sufficient** for $\phi$ if the conditional distribution of $X$ given $T$ depends only on functions of $\theta$ which are independent of $\phi$. Let $X_1,\dots, X_n$ be $n$ independent observations from $\mathcal{N}(\mu,\sigma^2)$. Show that $\sum (X_i-\bar{X})^2$ is inference sufficient for the parameter $\sigma^2$.

- answer

**4. (Rao Chapter 5 - Problem 6.1)** Consider a power series distribution with proability 

$$P(X=r)=\frac{a_r}{f(\theta)}\theta^r,\;\;\;\;r=c,c+1,\dots,\infty.$$

Let $x_1,\dots,x_n$ be a sample of size $n$ witha. total, $T=x_1+\cdots +x_n$. Show that $T$ is a complete sufficient statistic for $\theta$ and the distribution of $T$ is of the same form 

$$P(T=t)=\frac{b_t\theta^t}{f(\theta)^n},\;\;\;\;t=nc,nc+1,\dots,\infty.$$

- answer

**5. (Keener Chapter 2 - Problem 1)** Consider independent Bernoulli trials with success probability $p$ and let $X$ be the number of failures before the first success. Then $P(X=x)=p(1-p)^x$, for $x=0,1,\dots$, and $X$ has the geometric distribution with parameter $p$. 

a) Show that the geometric distributions form an exponential family.

b) Write the densities for the family in canonical form, identifying the canonical parameter $\eta$, and the function $A(\eta)$.

c) Find the mean of the geometric distribution using a differential identity.

d) Suppose that $X_1,\dots,X_n$ are i.i.d. from a geometric distribution. Show that the joint distributions form an exponential family, and find the mean and variance of $T$.
    
- answer

**6. (Keener Chapter 2 - Problem 2)** Determine the canonical parameter space $\Xi$, and find densities for the one-parameter exponential family with $\mu$ Lebesgue measure on $\mathbb{R}^2$, $h(x,y)=\exp{[-(x^2+y^2)/2]}/(2\pi)$, and $T(x,y)=xy$.

- answer

**7. (Keener Chapter 2 - Problem 3)** Suppose that $X_1,\dots, X_n$ are independent random variables and that for $i=1,\dots,n$, $X_i$ has a Poisson distribution with mean $\lambda_i=\exp{(\alpha + \beta t_i)}$, where $t_1,\dots, t_n$ are observed constants and $\alpha$ and $\beta$ are unknown parameters. Show that the joint distributions for $X_1,\dots,X_n$ form a two-parameter exponential family and identify the statistics $T_1$ and $T_2$.

- answer


# Sources
- "Theoretical Statistics" by Keener - Chapters 2, 3, 5
- "All of Statistics" by Wasserman - Chapter 9
- "Linear Statistical Inference and Its Applications" by Rao - Chapter 2, 5