# Data Reduction

In [1]:
from sympy import *
import sympy as sym
from sympy.abc import e, x, X, y, z

In [2]:
Xi = sym.Symbol("X_i") 
Xi

X_i

In [3]:
fact = sym.Symbol("!")
fact

!

In [4]:
lamda = sym.Symbol("lamda", real=True)  # define the lambda symbol. nb: lambda is a reserved, use lamda. 
lamda

lamda

In [5]:
sigma = sym.Symbol("sigma", real=True)
sigma

sigma

In [6]:
theta = sym.Symbol("theta", real=True)
theta

theta

### data reduction: 
to summarise the information in the sample through transforming the sample values.

### Definitions:
- $\theta$ the parameter. The thing we don't know that we want to find or approximate
- $X$ a random variable with some distribution
- $x$ a particular realisation
- $T$ a statistic
- $S$ a statistic
- $\hat{X}$ 
- $\hat{Y}$ 
- $\bar{X}$ 
- $\bar{Y}$ 

In [7]:
# likelihood function for each different distribution

# factorise in such a way that we only have function of the statistic and the parameter.

### Sufficient statistic (slide 9)
$T$ is an *m*-dimensional sufficient statistic for the parameter $\theta$ of the family $P = \{\ P_{\theta} (X) \}\$

a statistic that retains all information about a parameter is said to be sufficient for that parameter

a statistic T is sufficient for $\theta$ if the conditional distribution of $X=X_1,...,X_n$ given $T = t$ does **NOT** depend on $\theta$ for all values of t.

Essentially, all the information contained in the x-values is contained in the sufficient statistic t. nothing more can be gained by looking at the X values.
- sufficiency means $P(X_1 = X_1,...,X_n = X_n | T=t)$ is only a function of x and t (not $\theta$)
- we have two samples of the same population $T(X) = T(Y)$ then inference on $\theta$ would be the same for $X = x$ or $Y = y$

### Sufficiency (slide 13)
Let $X_1.X_2,X_3$ be a sample of size 3 from the Bernoulli $P$ distribution. Consider the two statistics $S=X_1 + X_2 + X_3$ and $T= X_1X_2 + X_3$. 

Show that $S$ is sufficient for $p$ and $T$ is not.

## Factorisation criteria (slide 21)
use the factorisation criterion to find a sufficient statistic for the parameter when $X_1,X_2,...,X_n$ are independent random variables with distribution:

### Question 1

$N (\mu,1)$

### Solution

Denoting $T = \sum_{i=1}^{n} X_i$, you can factorise $L(X,\mu)$ with 

$$h(X) = \text{exp} \Bigg(- \dfrac{1}{2} \sum_{i=1}^{n} X_i^2 \Bigg)$$

$$g(T,\mu) = \text{exp} \Bigg( - \dfrac{n}{2} \mu^2 \Bigg) \text{exp}(T\mu) \dfrac{1}{(\sqrt{2\pi})^n}$$

In [8]:
# the first thing we do is compute the likelihood
expr = (1 / (2 * pi))  # the density function of a normal distribution
expr

1/(2*pi)

In [9]:
# take the product of the density function and evaluate it at all the samples


In [10]:
# 

In [11]:
# anything that doesn't depend on mu we're going to take outside

In [12]:
# now we have our factorisation

In [13]:
# a function that only depends on my data, it doesn't depend on the parameter mu

In [14]:
# a function that only depends on t and mu, it doesn't depend on the data except through t 

### Question 2: 

$N(0, \sigma^2)$

### Solution

Denoting $T = \sum_{i=1}^{n} X_i^2$ you can factorise $L(X, \sigma^2)$ with

$$h(X) = 1,g(T,\sigma^2) = \text{exp} \bigg( - \dfrac{1}{2\sigma^2} T \bigg) \dfrac{1}{(\sqrt{2 \pi} \sigma)^2}$$

### Question 3: 

Uniform $(\theta, \theta + 1)$

### Solution
for a point $x$ and a set $A$, we use the notation

$$I_A = I(x \in A) = \begin{Bmatrix} 1 & \text{ if } & x \text{ is in } A, \\  0 & \text{ if } & x \text{ is not in } A \\  \end{Bmatrix}$$

then

$$L(X, \theta) = \prod_{i=1}^{n} I_{(\theta, \theta+1)} (x_i) = I_{(\theta, \theta+1)} (x_{(n)}) I_{x_{(n)} - 1, \infty)} (\theta) I_{-\infty, x_{(1)})} (\theta)$$

Hence $T = {X_{(1)} \choose X_{(n)}}$ can be taken as a sufficient vector statistic

In [15]:
# how do i get rid of the RHS bracket?

### Question 4: 

Poisson $\lambda$

**hint**: use the property of Poisson random variables that

$$\sum_{i=1}^{n} X_i \sim \text{Poisson}(n \lambda)$$

### Solution

Denoting $T = \sum_{i=1}^{n} X_i$ you can factorise $L(X, \lambda)$ with 

$$g(T, \lambda) \text{ exp}(-n \lambda) \lambda^T$$

and

$$h(X) = \dfrac{1}{\prod_{i=1}^{n} X_i !}$$

According to the factorisation criterion, $T$ is sufficient.

Now using the definition **and** noting that $T = \sum_{i=1}^{n} X_i \sim \text{Poisson}(n \lambda)$ we have:

$$p(X = x|T=t) = \dfrac{P(X=x \cap T = t)}{P(T=t)} = 
\begin{Bmatrix}
0 & \text{ if } & \sum_{i=1}^{n} x_i \neq t \\
\dfrac{P(X=x)}{P( \sum_{i=1}^{n} X_i = t} & \text{ if } & \sum_{i=1}^{n} x_i = t \\
\end{Bmatrix}
$$

Since $T = \sum_{i=1}^{n} X_i \sim \text{Poisson}(n \lambda)$, the latter expression on the right can be seen to be equal to $\dfrac{t!}{n^t \prod_{i=1}^{n} x_i !}$ and obviously does not depend on $\lambda$. Hence $T = \sum_{i=1}^{n} X_i$ is sufficient according to the original definition of sufficiency. 

In [16]:
# 1) first we want to compute the likelihood
expr = (e**lamda * lamda**Xi)/(Xi * fact)  # the poisson density function
expr

e**lamda*lamda**X_i/(!*X_i)

In [17]:
# 2) take the product of the distribution 

In [18]:
# 3) factorize

In [19]:
# we have a function that is depends on t and lambda

In [20]:
# we have a h(x)

In [21]:
# now we apply the definition of sufficiency -> P(X=x | T=t) does not depend on lambda

In [22]:
# an ordered vector of heights is a statistic but there is not data reduction.
# likelihood ratios
# if we compute the likelihood ratio, we see if they simplify to a constant ratio.

### Minimal Sufficient Statistic (slide 31)

Find a minimal sufficient statistic for the parameter when $X_1,X_2,...,X_n$ are independent random variables each with distribution:

### Question 1: 

Poisson$(\lambda)$ 

$$\dfrac{L(X,\lambda)}{L(y,\lambda)} = \lambda^{\sum_{i=1}^{n} x_i - \sum_{i=1}^{n} y_i} \dfrac{\prod_{i=1}^{n} (y_i)!} {\prod_{i=1}^{n}(x_1 !)!}$$

This would not depend on $\lambda$ if $\sum_{i=1}^{n} x_i - \sum_{i=1}^{n} y_i$

Hence $\sum_{i=1}^{n} X_i$ is minimally sufficient

In [23]:
# we have the likelihod function of (X, lambda) over the likelihood function of (Y, lambda)

In [24]:
# 

### Question 2: 

$N(0, \sigma^2)$

### Solution

$$\dfrac{L(X,\sigma^2)}{L(y,\sigma_2)} = \text{exp} \bigg( - \dfrac{1}{2\sigma^2} \bigg( \sum_{i=1}^{n} x_i^2 - \sum_{i=1}^{n} y_i^2 \bigg) \bigg)$$

This would not depend on $\sigma^2$ if $\sum_{i=1}^{n} x_i^2 = \sum_{i=1}^{n} y_i^2$

Hence $T(X) = \sum_{i=1}^{n} X_i^2$ is minimal sufficient

In [25]:
# first compute the likelihood of X

In [26]:
sd = sigma**2 # standard deviation
sd

sigma**2

In [27]:
# density of the normal distribution
expr  = 1 / (sqrt(2 * pi * sd))
expr

sqrt(2)/(2*sqrt(pi)*Abs(sigma))

In [28]:
# product?

In [29]:
# likelihood ratio -> likelihood of X / likelihood of Y

### Question 3: 

gamma $(\alpha)$, with density $f(x, \alpha) = \dfrac{1}{\Gamma (\alpha)} \text{ exp}(-x)x^{\alpha - 1}, x > 0$

Here the Gamma function is definded as $\Gamma (\alpha) = \int_{0}^{\infty}e^{-x}x^{\alpha - 1} dx$ and has the property $\Gamma(\alpha + 1) = \alpha \Gamma (\alpha)$. In particular, for a natural number *n* $\Gamma(n+1)=n!$ holds

### Solution

Similarly, $T = \prod_{i=1}^{n}X_i$ is minimal sufficient. We can also take $\tilde{T} = \sum_{i=1}^{n} \log X_i$ as minimal sufficient

### Question 4: 
Uniform $(0, \theta)$

### Solution

we have:

$$\dfrac{L(X, \theta)}{L(Y, \theta)} = \dfrac{I_{(x_{(n)}, \infty)}(\theta)}{I_{(y_{(n)}, \infty)}(\theta)}$$

This has to be considered as a function of $\theta$ for fixed $x_{(n)}$ and $y_{(n)}$. Assume that $x_{(n)} \neq y_{(n)}$ and to be specific, first let $x_{(n)} > y_{(n)}$. Then the ratio $\dfrac{L(X, \theta)}{L(Y, \theta)}$ is:
- not defined if $\theta \leq y_{(n)}$
- equal to zero when $\theta \in [y_{(n)}, x_{(n)})$
- equal to one when $\theta > x_{(n)}$

in other words, the ratio's value depends on the position of $\theta$ on the real axis, that is, it is a function of $\theta$. Similar conclusions will be reached if we had $x_{(n)} < y_{(n)}$ (do it yourself). Hence, if $x_{(n)} = y_{(n)}$ will the ratio not depend on $\theta$. This implies that $T = X_{(n)}$ is minimally sufficient.

**question 5**: Uniform $(\theta, \theta + 1)$

**question 6**: Uniform $(\theta_1, \theta_2)$

### One parameter exponential family densities example (slide 35)

### One parameter exponential family of densities (slide 38)
Show that the following densities belong to the exponential family of densities and identify the minimal sufficient statistic for each of the distributions.

### Question 1: 
Poisson $(\theta): f(x, \theta) = \dfrac{e^{-\theta \theta^x}}{x!}, x \in \{\ 0,1,2,... \}\ , \theta > 0$

### solution

$$f(x,\theta) = \dfrac{e^{-\theta}\theta^{x}}{x!} = e^{-\theta} \dfrac{1}{x!} e^{x \ln \theta}$$

Hence, $a(\theta) = e^{-\theta}, b(x) = \dfrac{1}{x!}, c(\theta) = \ln \theta$ and $d(x) = x$

Therefore $T = \sum_{i=1}^{n} X_i$ is a minimal sufficient statistic for $\theta$

In [35]:
expr = ((e**-theta) * (theta**x)) / (x * fact)
expr

theta**x/(!*e**theta*x)

In [31]:
# we want to seperate x and theta

In [32]:
# first take the log, 

In [33]:
# the exponential

### Question 2: 
Bernoulli $(\theta): f(x, \theta) = \theta^x (1 - \theta)^{1-x}, x \in \{\ 0,1 \}\, \theta \in (0,1)$ 

### Solution

$$f(x,\theta) = \theta^x (1 - \theta)^{1-x} = \text{exp} \{\ \ln (\theta^x(1-\theta)^{1-x}) \}\ = (1- \theta) \text{exp} \{\ x \ln \bigg( \dfrac{\theta}{1 - \theta} \bigg) \}\$$

Hence $a(\theta) = 1,c(\theta) = \ln(\dfrac{\theta}{1 - \theta})$ and $d(x) = x$

Therefore $T = \sum_{i=1}^{n} X_i$ is a minimal sufficient statistic for $\theta$.

### Question 3: 
Normal $N(\theta,1): f(x, \theta) = \dfrac{1}{\sqrt{2 \pi }} e^{- \dfrac{1}{2}(x - \theta)^2}, x \in R, \theta \in R$

### Solution

$$
\begin{aligned}
f(x, \theta) &= \dfrac{1}{\sqrt{2 \pi}} e^{\dfrac{1}{2}(x-\theta)^2} \\
&= \text{exp} \{\ -\dfrac{x^2}{2} + x\theta - \dfrac{\theta^2}{2} - \dfrac{1}{2} \ln(2 \pi) \}\ \\
&= \text{exp} \{\ -\dfrac{-\theta^2}{2} - \dfrac{1}{2} \ln (2 \pi) \}\ \text{exp} \{\ - \dfrac{x^2}{2} \}\ \text{exp} \{\ \theta x \}\ \\
\end{aligned}
$$

Hence, $a(\theta) = \text{exp} \{\ - \dfrac{\theta^2}{2} - \dfrac{1}{2} \ln (2 \pi) \}\, b(x) = \text{exp} \{\ - \dfrac{x^2}{2} \}\, c(\theta)$ and $d(x) = x$.

Therefore, $T= \sum_{i=1}^{n} X_i$ is a minimal sufficient statistic for $\theta$.

### Question 4: 
Normal $N(0, \theta^2): f(x, \theta) = \dfrac{1}{\sqrt{2 \pi \theta^2 }} e^{\dfrac{x^2}{2 \theta^2}}, x \in R, \theta^2 > 0$

### Solution

$$f(x,\theta) = \dfrac{1}{\sqrt{2 \pi \theta^2}} e^{- \dfrac{x^2}{2 \theta^2}}$$

note that the parameter of interest is here $\theta^2$

Hence, $a(\theta^2) = \dfrac{1}{\sqrt{2 \pi \theta^2}}, b(x)=1, c(\theta^2) = -\dfrac{1}{2\theta^2}$ and $d(x) = x^2$

Therefore, $T = \sum_{i=1}^{n} X_i^2$ is a minimal sufficient statistic for $\theta^2$

In [37]:
expr = 1 / sqrt(2*pi * theta**2)  # the density function for a normal distribution
expr  # nb: sympy does its own simplification

sqrt(2)/(2*sqrt(pi)*Abs(theta))