# Sufficient Statistics & The Exponential Family

## Exposition

Sufficient statistics are an intuitive way of reducing the data down to what is necessary for computation. For an unrealistic example, say we have amassed $1e10$ data points and know the data follows a normal distribution, then we do not need to store the $1e10$ points, but rather we could compute the sample mean and standard deviation only. Thus sufficiency is primarily a tool akin to filtering and dimension reduction for reducing our data to what is necessary for a particular distribution. **Notice** however, that sufficiency requires us to **know beforehand the underlying distribution**. For this reason, sufficiency is primarly of interest to wacky theoretical statiscians and not scientists in practice who enjoy non-parametric inference....


## Key Definitions & Theorems

### Sufficient Statistic: 

- Definition 1 (Wasserman - confusing): A statistic $T(x^n)$, where $x^n$ is a vector of $n$ observations, is **sufficient** for a parameter $\theta$ if $P(T(x^n)|\theta)=cP(T(y^n)|\theta)$ implies that $P(x^n|\theta)=dP(y^n|\theta)$ for some constants $c,d\in\mathbb{R}$ that can depend on $x^n$ and $y^n$ but not $\theta$. 


- Definition 2 (Keener): Suppose that a random variable $X:\Omega\rightarrow \mathcal{X}$ has distribution from a family $\mathcal{P}=\{P_\theta:\theta\in\Omega\}$. Then $T=T(X)$ is a **sufficient** statistic for $\mathcal{P}$ (or for $X$ or for $\theta$) if for every $t$ and $\theta$, the conditional distribution of $X$ under $P_\theta$ given $T=t$ does not depend on $\theta$. 


- Definition 3 (Rao): A statistic $T$ (a random variable mapping observations/data to some $\sigma$-algebra of the reals) is said to be **sufficient** for the family of measures $P_\theta$ iff one of the folowing equivalent conditions holds:
    - $P(A|T=t)$ is independent of $\theta$ for every $A$ measurable under $P$.
    - $\mathbb{E}[Y|T=t]$ is independent of $\theta$ for every random variable $Y$ such that $\mathbb{E}(Y)$ exists.
    - The conditional distribution of every random variable $Y$ given $T=t$, which always exists, is independent of $\theta$.
    

### Minimally Sufficient Statistic:

- Definition (Wasserman): A statistic $T$ is minimal sufficient if 1) it is sufficient and 2) it is a function of every other sufficient statistic.

### Complete Statistic:

- Definition (Keener): A statistic $T$ is complete for a family $\mathcal{P}=\{P_\theta:\theta\in\Omega\}$ if 

$$\mathbb{E}[f(T)|\theta]=c\;\;\;\forall \theta$$ 

implies $f(T)=c$.

### Fisher-Neyman Factorization Theorem:

- Statement (Wasserman): $T$ is sufficient iff there are functions $g(t,\theta)$ and $h(x)$ such that $f(x^n|\theta)=g(t(x^n),\theta)h(x^n)$.

### Mark's Magical Minimal Theorem:

- Statement: Suppose we have a sample $x^n=\{x_1,\dots,x_n\}$ and density function $f(x|\theta)$. If there exists a function $T(x)$ such that for $x^n=\{x_1,\dots,x_n\}$, $y^n=\{y_1,\dots,y_n\}$, and $\alpha\in\mathbb{R}$ (another sample of the distribution), the ratio

$$\frac{f(x^n|\theta)}{\alpha f(y^n|\theta)}$$

does not depend on $\theta$ iff $T(x^n)=T(y^n)$, then $T(x^n)$ is minimally sufficient for $\theta$. 

### Rao-Blackwell Theorem:

- Statement (Wasserman): Let $\hat{\theta}$ be an estimator and let $T$ be a sufficient statistic. Define a new estimator by 

$$\tilde{\theta}=\mathbb{E}[\hat{\theta}|T].$$

Then for every $\theta$, we have $R(\theta,\tilde{\theta})\leq R(\theta,\hat{\theta})$ where the risk $R$ is the mean squared error.

### Exponential Family: 

- Definition 1 (Wasserman): We say that $\{f(x|\theta):\theta\in\Theta\}$ is a **one-parameter exponential family** if there are functions $\eta (\theta)$, $B(\theta)$, $T(x)$, and $h(x)$ such that 

$$f(x|\theta)=h(x)e^{\eta(\theta)T(x)-B(\theta)}.$$


- Definition 2 (Keener): The family of densities $\{P(x|\eta):\eta\in \Xi\}$ is called an **$s$-parameter exponential family in canonical form** where 

$$\Xi=\{\eta:A(\eta)<\infty\}$$

is called the **natural parameter space**. Keener defines for $\eta\in\mathbb{R}^s$ the function 

$$A(\eta)=\log\int\limits_{\mathbb{R}^n} \exp{\left[\sum\limits_{i=1}^s \eta_i T_i(x)\right]} h(x)d\mu(x)$$

where $h:\mathbb{R}^n\rightarrow \mathbb{R}$ is non-negative and $T_1,\dots,T_s$ are measurable functions from $\mathbb{R}^n$ to $\mathbb{R}$. For finite $A(\eta)$, Keener defines the density by

$$P(x|\eta)=\exp{\left[\sum\limits_{i=1}^s \eta_i T_i(x) - A(\eta)\right]} h(x).$$

### Full Rank Exponential Family:

- Definition 1 (Keener): An exponential family with densities $p_\theta(x)=h(x)\exp\{\eta(\theta)\cdot T(x) - B(\theta)\}$, $\theta\in\Omega$, is said to be of **full rank** if the interior of $\eta(\Omega)$ is not empty and if $T_1,\dots,T_s$ do not satisfy a linear constraint of the form $v\cdot T=c$. *In my words*, the exponential family is full rank if the function $\eta(\theta)$ exists for an open set within $\Omega$ (not null) and $T_1,\dots,T_s$ are linearly independent.

### Exponential Completeness Theorems:

- Theorem 1 (Keener): In an exponential family of full rank, $T$ is complete.

- Theorem 2 (Keener): If $T$ is complete and sufficient, then $T$ is minimal sufficient.

### Differential Identities for Exponential Families

- Theorem 2.4 (Keener): Let $\Xi_f$ be the set of values for $\eta\in\mathbb{R}^s$ where

$$\int |f(x)|\exp\left[\sum\limits_{i=1}^s \eta_i T_i(x)\right]h(x)d\mu(x)<\infty.$$

Then the function 

$$g(\eta)=\int f(x)\exp\left[\eta_i T_i(x)\right]h(x)d\mu(x)$$

is continuous and has continuous partial derviatives of all orders for $\eta\in\Xi^\circ_f$. Furthermore, these derivatives can be computed by differentiation under the integral sign.

This is an interesting theorem mathematically speaking if you enjoyed analysis and while the proof is excluded from the book, I would like to give some intuition for why this makes sense by starting with a very simplistic example. 

- Define $\Omega=\left\{a|\int a|f(x)|\;dx<\infty\right\}$ similar to $\Xi$. Then let us ask and prove that the following function is continuous and has continuous derivatives of all orders (analytic). Let $g:\Omega\rightarrow \mathcal{B}$ such that $g(a)=\int af(x)\;dx$. Notice that 

$$g(a)-g(b)=\int a f(x)\;dx-\int b f(x)\;dx=(a-b)\int f(x)\;dx.$$

Let $M=\int f(x)<\infty$, so for $\epsilon >0$, let $\delta =\frac{\epsilon}{M}$ to see that $g$ is continuous over $\Omega$. For the derivative, notice that 

$$g'(a)=\lim\limits_{t\rightarrow\infty} \frac{g(a+t)-g(a)}{t}=\lim\limits_{t\rightarrow 0} \frac{\int (a+t)f(x)\;dx-\int af(x)\;dx}{t}=\int f(x)\;dx$$

which exists and is finite. Thus to show that the derivative is continuous is trivial as $g'(a)-g'(b)=\int f(x)\;dx-\int f(x)\;dx=0$ for all $a,b\in\Omega$. While yes, this example is very simplistic, the pattern extending this to the multivariable and more broad class of functions is not divergent (pun intended). The main tactic in the extension as far as I can see is harnessing the convexity of $e^x$ to prove continuity and examining the Taylor series to prove it is analytic as demonstrated here: https://thewindingnumber.blogspot.com/2019/05/whats-with-e-1x-on-smooth-non-analytic.html.


## Problems

**1.** Find a sufficient statistic for $\theta$ (and a minimally sufficient statistic if possible) where

a) $X_1,\dots,X_n\sim \text{Unif}(-\theta,\theta)$

b) $X_1,\dots,X_n\sim \mathcal{N}(\theta,\sigma)$ for a known $\sigma\in\mathbb{R}$

c) $X_1,\dots,X_n\sim \text{Gamma}(\theta,\frac{1}{2})$

d) $X_1,\dots,X_n\sim \text{Beta}(\theta, \theta)$
    
- **For part a)**, we have that 

$$P(X_1,\dots,X_n|\theta)=\left(\frac{1}{2\theta}\right)^n \mathbb{I}\left\{-\theta\leq X_1,\dots,X_n\leq \theta\right\}=\left(\frac{1}{2\theta}\right)^n \mathbb{I}\left\{\max\limits_{i\in\{1,\dots,n\}} |X_i|\leq \theta\right\}.$$

Thus by the factorization theorem, we know the maximum data point in absolute value is sufficient. To see that it is minimally sufficient, note that we have 

$$\frac{P(X_1,\dots,X_n|\theta)}{\alpha P(Y_1,\dots,Y_n|\theta)}=\frac{\left(\frac{1}{2\theta}\right)^n\mathbb{I}\left\{X_1,\dots,X_n\in[-\theta,\theta]\right\}}{\alpha\left(\frac{1}{2\theta}\right)^n\mathbb{I}\left\{Y_1,\dots,Y_n\in[-\theta,\theta]\right\}}=\frac{\mathbb{I}\left\{\max\limits_{i\in\{1,\dots,n\}} |X_i|\leq \theta\right\}}{\alpha\mathbb{I}\left\{\max\limits_{i\in\{1,\dots,n\}} |Y_i|\leq \theta\right\}}.$$

which is constant with respect to $\theta$ if we have $T(x^n)=T(y^n)=\max\limits_{i\in\{1,\dots,n\}} |X_i|$. I should note that the $\alpha$ here is just to ensure that we do not have a case with $\frac{0}{0}$. 

**For part b)**, we have the likelihood is given by 

$$P(X_1,\dots,X_n|\theta)=\left(\frac{1}{\sqrt{2\pi}\sigma}\right)^ne^{-\frac{\sum\limits_{i=1}^n (X_i-\theta)^2}{2\sigma^2}}\propto  e^{\frac{\sum\limits_{i=1}^n (X_i-\theta)^2}{2\sigma^2}}=e^{-\frac{\sum\limits_{i=1}^n X_i^2}{2\sigma^2}}\times e^{-\frac{\theta\sum\limits_{i=1}^n X_i}{2\sigma^2}-\frac{n\theta^2}{2\sigma^2}}$$

which is exactly of the form of the one-parameter exponential family as $\sigma$ is constant here with respect to $\theta$ (hence the alpha fish usage) where $T=\sum\limits_{i=1}^n X_i$. As the exponential family here is trivially full rank (rank one), we have that $T$ is also complete and so by theorem 2 from above, this implies that $T$ is minimal sufficient.

**For part c)**, the likelihood is given by

$$P(X_1,\dots,X_n|\theta)=\left(\frac{\left(\frac{1}{2}\right)^\theta}{\Gamma(\theta)}\right)^n (X_1\cdots X_n)^{\theta - 1}e^{-\frac{1}{2}\sum\limits_{i=1}^n X_i}=e^{-\frac{1}{2}\sum\limits_{i=1}^n X_i}\times e^{(\theta - 1)\sum\limits_{i=1}^n \log(X_i)-n\log(2^\theta\Gamma(\theta))}$$



which again is a one-parameter exponential family with $T=\sum\limits_{i=1}^n \log(X_i)$. Again as this is full rank, we have that $T$ is also minimal sufficient by the same theorem.

**For part d)**, we have

$$P(X_1,\dots,X_n|\theta)=e^{(\theta - 1)\sum\limits_{i=1}^n \log(X_i-X_i^2) - n\left[\log(\Gamma(2\theta)) - 2\log(\Gamma(\theta))\right]}$$

so $T=\sum\limits_{i=1}^n \log(X_i-X_i^2)$ is minimal sufficient. This particular example is a tad bit confusing because you may suspect that the exponential family is not full rank as the parameters for scale and location are the same. This is simply a misunderstanding of the definition - yes the exponential family is not 2-dimensional; however, the exponential family is 1-dimensional as there is only one $\eta(\theta)$ and so $\eta(\theta)=\theta$ is clearly linearly independent.

**2.** Find an example of a minimally sufficient statistic that is not complete. Can you find an example of the converse?

- I figured at first that the statistic $T=\max\limits_{i\in\{1,\dots,n\}} |X_i|$ from the Uniform distribution above wouldn't be complete; however, notice that 

$$P\left(\max\limits_{i\in\{1,\dots,n\}} |X_i|\leq x\right)=P(|X_i|\leq x)^n=\left(\frac{x}{\theta}\right)^n\mathbb{I}\{x\in[0,\theta]\}$$

so then

$$\mathbb{E}[g(T)]=\int\limits_0^\theta g(t)\frac{nt^{n+1}}{(n+1)\theta^n}dt=0\iff g(\theta)\frac{n\theta^{n+1}}{n+1}=0\iff g(\theta)=0$$

which implies that $T$ is complete. Thus, I have no clue how to draft an example of this type at the moment...

**3. (Rao Chapter 2 - Problem 21)**: Consider the probability space $(\Omega, \mathcal{B},P)$. Let $Y$ be a space of points $y$, $\mathcal{A}$ a $\sigma$-algebra of sets of $Y$, and $T:\Omega\rightarrow Y$ a function such that $A\in\mathcal{A}\Rightarrow T^{-1}(A)\in\mathcal{B}$. Let $f(\omega,y)$ be a real-valued $\mathcal{B}\times \mathcal{A}$ measurable function on $\Omega\times Y$. Then show that 

$$\mathbb{E}\left[f(\omega,T(\omega))|T(\omega)=y\right]=\mathbb{E}\left[f(\omega,y)|T(\omega)=y\right].$$

- answer Mark...

**4. (Rao Chapter 2 - Problem 22)**: Let $X$ be a vector valued random variable whose distribution depends on a vector parameter $\theta$. Further, let $T$ be a statistic such that the distribution of $T$ only depends on $\phi$, a function of $\theta$. Then $T$ is said to be **inference sufficient** for $\phi$ if the conditional distribution of $X$ given $T$ depends only on functions of $\theta$ which are independent of $\phi$. Let $X_1,\dots, X_n$ be $n$ independent observations from $\mathcal{N}(\mu,\sigma^2)$. Show that $\sum (X_i-\bar{X})^2$ is inference sufficient for the parameter $\sigma^2$.

- answer Mark...

**4. (Rao Chapter 5 - Problem 6.1)** Consider a power series distribution with proability 

$$P(X=r)=\frac{a_r}{f(\theta)}\theta^r,\;\;\;\;r=c,c+1,\dots,\infty.$$

Let $x_1,\dots,x_n$ be a sample of size $n$ witha. total, $T=x_1+\cdots +x_n$. Show that $T$ is a complete sufficient statistic for $\theta$ and the distribution of $T$ is of the same form 

$$P(T=t)=\frac{b_t\theta^t}{f(\theta)^n},\;\;\;\;t=nc,nc+1,\dots,\infty.$$

- cool problem - answer...

**5. (Keener Chapter 2 - Problem 1)** Consider independent Bernoulli trials with success probability $p$ and let $X$ be the number of failures before the first success. Then $P(X=x)=p(1-p)^x$, for $x=0,1,\dots$, and $X$ has the geometric distribution with parameter $p$. 

a) Show that the geometric distributions form an exponential family.

b) Write the densities for the family in canonical form, identifying the canonical parameter $\eta$, and the function $A(\eta)$.

c) Find the mean of the geometric distribution using a differential identity.

d) Suppose that $X_1,\dots,X_n$ are i.i.d. from a geometric distribution. Show that the joint distributions form an exponential family, and find the mean and variance of $T$.
    
- answer

**6. (Keener Chapter 2 - Problem 2)** Determine the canonical parameter space $\Xi$, and find densities for the one-parameter exponential family with $\mu$ Lebesgue measure on $\mathbb{R}^2$, $h(x,y)=\exp{[-(x^2+y^2)/2]}/(2\pi)$, and $T(x,y)=xy$.

- answer

**7. (Keener Chapter 2 - Problem 3)** Suppose that $X_1,\dots, X_n$ are independent random variables and that for $i=1,\dots,n$, $X_i$ has a Poisson distribution with mean $\lambda_i=\exp{(\alpha + \beta t_i)}$, where $t_1,\dots, t_n$ are observed constants and $\alpha$ and $\beta$ are unknown parameters. Show that the joint distributions for $X_1,\dots,X_n$ form a two-parameter exponential family and identify the statistics $T_1$ and $T_2$.

- answer


# Sources
- "Theoretical Statistics" by Keener - Chapters 2, 3, 5
- "All of Statistics" by Wasserman - Chapter 9
- "Linear Statistical Inference and Its Applications" by Rao - Chapter 2, 5
- "Statistical Inference" by Casella et al 
- https://www.math.kth.se/matstat/gru/Statistical%20inference/Lecture5.pdf
- https://stats.stackexchange.com/questions/53107/meaning-of-completeness-of-a-statistic