# Exercise 8: Bias and Consistency
 >__Created__:  25 June 2018, ESHEP 18, Maratea, Italy,  Harrison B. Prosper
 
 
## Introduction
An important property of maximimum likelihood estimates is invariance with respect to a change of variables. Suppose  that
$y = g(s)$. Given the maximum likelihood estimate (MLE), $\hat{s}$, of $s$, the
MLE of  $y$ is simply $\hat{y} = g(\hat{s})$. This property is very convenient because
it makes it possible to maximize the likelihood using any convenient parameterization, .e.g. $s$, and obtain the  MLE
of the parameter of interest, e.g. $y$, by simply plugging $\hat{s}$ into $g(s)$. 
For example, $\hat{s} = N$ is the MLE of the Poisson mean $s$, where 
\begin{align}
	\textrm{Poisson}(N, s) &= \frac{e^{-s} s^N}{N!}.
\end{align}
It is readily verified that $\hat{s} = N$ is an unbiased estimate of the Poisson mean $s$. The MLE of its standard deviation, however, is biased. 

The standard deviation of a Poisson distribution is $y = \sqrt{s}$. Therefore, the MLE is $\hat{y} = \sqrt{\hat{s}} = \sqrt{N}$.  As shown below, this estimate is indeed biased,  but, it is 
consistent, that is, the bias goes to zero as $N \rightarrow \infty$.

## Bias and Consistency

By definition, the __bias__ is given by
 \begin{align}
	\textrm{bias} 	&\equiv \langle  \hat{s}  - s \rangle, \nonumber\\
				&= \langle  \hat{s} \rangle - s
\end{align}
where $\langle  \hat{s} \rangle$ is the mean (or expected) value of the estimates. Note that  bias, as is true
of many statistical quantities,  is a property not of an individual, here the estimate $\hat{s}$, but rather the ensemble to
which the individual is presumed to belong. An __estimator__, that is, a procedure that yields an estimate, is __consistent__ if the bias goes
to zero as more and more data are included in the likelihood function. In practice, the expected value $\langle  \hat{s} \rangle$ is approximated using a Monte Carlo method in which the estimating procedure (the estimator) is run a large number of times and the resulting estimates are 
averaged$^*$.

Let $\hat{s}$ be the MLE of $s$, which we shall take to be unbiased. *A priori*, we would expect $\sqrt{N}$ to be a biased estimate of the standard
deviation because $\sqrt{N}$  is a nonlinear function of the estimate $\hat{s} = N$.  First note that $\hat{y} = g(\hat{s}) = g(s + \hat{s} - s)$, that is, 
$\hat{y} = g(s + h)$, where $h = \hat{s} - s$ is the (unknown) error. For small $h$, we can write
\begin{align}
\hat{y} 	&= g(\hat{s}),\nonumber\\
			&= g(s + h),\nonumber\\
			&= g(s) + g^\prime(s) h + \frac{1}{2} g^{\prime\prime} (s) h^2 + O(h^3),\nonumber\\
			&\approx y + g^\prime(s) h + \frac{1}{2} g^{\prime\prime} (s) h^2.
\end{align}
Now take the ensemble average of $\hat{y}$,
\begin{align}
\langle \hat{y} \rangle	
		&\approx y + g^\prime(s) \langle h \rangle + 
\frac{1}{2} g^{\prime\prime}(s) \langle h^2  \rangle, \nonumber\\
		&= y  + 
\frac{1}{2} g^{\prime\prime} (s) \textrm{V}.
\end{align}
In the above, we have taken $\langle h \rangle$ to be zero, that is, we have assumed $\hat{s}$ to be unbiased.

We can draw the following conclusions from this result. First, the bias in $\hat{y}$ is approximately $\frac{1}{2} g^{\prime\prime} (s) \textrm{V}$, where
V is the variance. For the Poisson example, $\textrm{V} = s$ and $g^{\prime\prime} (s) = -1/(4 s^{3/2})$,
which yields  a negative bias in the estimate $\hat{y} = \sqrt{N}$ of approximately
$-1/(8 \sqrt{s})$. Second, we conform that the MLE of
the standard deviation of a Poisson distribution is consistent; its bias is $\propto 1/ \sqrt{s}$ and therefore vanishes in the
limit $s \rightarrow \infty$. Third, the bias cannot be determined exactly because it typically depends on unknown parameters, here the poisson mean $s$ for which only an estimate is available.

Particle physicists tend to favor unbiased estimates and often correct the MLEs for bias. For the
Poisson example, the obvious bias corrected estimate is
\begin{align}
\hat{y}_\textrm{cor} &= \sqrt{N} + \frac{1}{8\sqrt{N}}. 
\end{align}
You should ask whether correcting for bias makes sense. One reason why it may not make sense is that it may waste data. Ideally, given two estimates $\hat{y}$ and $\hat{y}_\textrm{cor}$ one would hope that 
$\langle (\hat{y}_\textrm{cor} - y)^2 \rangle \leq \langle (\hat{y} - y)^2 \rangle$.
Unfortunately, bias corrections can violate this condition, that is, yield, on average, less precise results. But, the size of the violation, and where it occurs, depends on the degree of nonlinearity of the function $y = g(s)$. As shown below, for the Poisson standard deviation, a bias correction may not be helpful for values of $s < 1.5$, but is helpful for values that are greater.


*This is often described as running "toy" experiments.

In [1]:
import os, sys
import ROOT
%jsroot off

Welcome to JupyROOT 6.12/06


### Setup Poisson Experiments

In [15]:
rand  = ROOT.TRandom3()
Nexp  = 25000
smin  = 0.0
smax  = 4.0
nstep = 20
step  = (smax-smin)/nstep
print "number of experiments in ensemble: %10d" % Nexp
print "range of Poisson mean s: %8.2f to %8.2f" % (smin, smax)
print "number of steps in s:    %5d" % nstep

number of experiments in ensemble:      25000
range of Poisson mean s:     0.00 to     4.00
number of steps in s:       20


### Run Experiments
For each value of the Poisson mean $s$, run a large number of experiments and accumulate statistics on them. The last column is the ratio of the mean squared error of the bias corrected estimate to that of the uncorrected estimate. This ratio is the amount by which the sample size would need to be increased in order for the corrected estimate to be as accurate as the uncorrected estimate.

In [22]:
print '%6s %9s %8s %8s %8s %8s %8s %11s' % \
  ('s', 'y=sqrt(s)', '<y_hat>', 'bias_u', 'bias_c',
       'RMS_u', 'RMS_c', 'MSE_c/MSE_u')

sqrt = ROOT.TMath.Sqrt

# loop over Poisson mean s
for ii in range(nstep+1):
    
    # true mean
    if ii < nstep:
        s = smin + (ii+1) * step
    else:
        s = 100.0
    
    # true standard deviation
    y = sqrt(s)
    
    # for uncorrected estimate of sqrt(s)
    y_u   = 0.0
    MSE_u = 0.0

    # for corrected estimate of sqrt(s)
    y_c   = 0.0
    MSE_c = 0.0
    
    # loop over experiments
    for jj in range(Nexp):
        
        # run experiment
        N = rand.Poisson(s)
        
        # MLE estimate of s
        s_hat = N
        
        # MLE estimate of sqrt(s)
        y_hat = sqrt(s_hat)
        
        # bias corrected estimate of sqrt(s)
        y_hat_cor = y_hat
        if s_hat > 0: y_hat_cor += 1.0/(8*s_hat)

        # accumulate statistics
        y_u   += y_hat
        MSE_u +=(y_hat - y)**2

        y_c   += y_hat_cor
        MSE_c +=(y_hat_cor - y)**2
        
    # analyze ensemble of results
    y_u    /= Nexp
    MSE_u  /= Nexp; RMS_u = sqrt(MSE_u)

    y_c    /= Nexp
    MSE_c  /= Nexp; RMS_c = sqrt(MSE_c)

    # compute bias
    bias_u  = y_u - y
    bias_c  = y_c - y

    print '%6.2f %9.2f %8.2f %8.3f %8.3f %8.2f %8.2f %11.2f' % \
      (s, y, y_u, bias_u, bias_c, RMS_u, RMS_c, MSE_c/MSE_u)        

     s y=sqrt(s)  <y_hat>   bias_u   bias_c    RMS_u    RMS_c MSE_c/MSE_u
  0.20      0.45     0.19   -0.257   -0.236     0.48     0.51        1.12
  0.40      0.63     0.35   -0.278   -0.241     0.59     0.62        1.10
  0.60      0.77     0.51   -0.266   -0.218     0.64     0.67        1.08
  0.80      0.89     0.65   -0.243   -0.187     0.66     0.69        1.06
  1.00      1.00     0.78   -0.219   -0.158     0.67     0.69        1.05
  1.20      1.10     0.88   -0.214   -0.151     0.68     0.69        1.03
  1.40      1.18     0.99   -0.190   -0.126     0.67     0.67        1.01
  1.60      1.26     1.09   -0.178   -0.113     0.66     0.66        1.00
  1.80      1.34     1.18   -0.160   -0.096     0.65     0.65        0.98
  2.00      1.41     1.27   -0.148   -0.085     0.64     0.63        0.97
  2.20      1.48     1.35   -0.129   -0.069     0.63     0.62        0.96
  2.40      1.55     1.42   -0.125   -0.066     0.62     0.61        0.95
  2.60      1.61     1.50   -0.112   -

## Summary
One should not be surprised if a maximum likelihood estimate is biased when the data are few and counts are low.
But, we should expect any sensible procedure for constructing estimates  to be consistent. After all, it 
would be a waste of resources to 
collect more data if results were unlikely to improve. Consistency is a more important requirement
than whether or not a procedure for constructing estimates (an estimator) is biased, especially if a bias correction would make estimates appreciably less precise.

Given two consistent procedures for estimating a parameter, how might one choose which to use? One way
is to choose the procedure with the stronger consistency, that is, the procedure whose results converge faster
to the true value of a parameter as more data are analyzed.
 