In [1]:
import math
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# required for interactive plotting
from __future__ import print_function
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
import numpy.polynomial as np_poly

from IPython.display import Math
from IPython.display import Latex

initialization  

$ \newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$  
$ \newcommand{\V}[1]{\mathbb{V}\left[#1\right]}$
$ \newcommand{\P}{\mathbb{P}}$

Bias-Variance Decomposition
===============

$$
\operatorname{MSE}(\hat{\theta})=
\V{\hat{\theta}}+
\left(\operatorname{Bias}(\hat{\theta},\theta)\right)^2.
$$

Proof:

\begin{array}{llr}
\operatorname{MSE}(\hat{\theta})
&\equiv \E{(\hat{\theta}-\theta)^2}
\\
&=
 \E{\left(\hat{\theta}-\E{\hat\theta}+\E{\hat\theta}-\theta\right)^2}\\
&= \E{ \left(
            \hat{\theta}-\E{\hat\theta}
        \right)^2
        +2 \left( (\hat{\theta}-\E{\hat\theta})(\E{\hat\theta}-\theta) 
           \right)
        +\left( 
            \E{\hat\theta}-\theta
         \right)^2
     }
\\
&=  \E{\left( 
          \hat{\theta}-\E{\hat\theta}
        \right)^2
     }
     +2 \mathbb{E}
        \Big[
            (\hat{\theta}-\mathbb{E}(\hat\theta))
            \overbrace{ (\E{\hat\theta}-\theta)}^
                      {\begin{smallmatrix}
                            \text{This is} \\
                            \text{a constant,} \\
                            \text{so it can be} \\ 
                            \text{pulled out.}
                       \end{smallmatrix}
                      }
         \Big]
       + \mathbb{E}
         \Big[
            \overbrace{ \left(\E{\hat\theta}-\theta\right)^2}^
                      {
                         \begin{smallmatrix}
                             \text{This is a} \\
                             \text{constant, so its} \\
                             \text{expected value} \\
                             \text{is itself.}
                         \end{smallmatrix}
                       }
         \Big]
\\
& = \E{
        \left( 
            \hat{\theta}-\E{\hat\theta}
         \right)^2
     }
     +2( \overbrace
             {\E{\hat\theta}-\theta}^
             {\begin{smallmatrix}
                 \text{That first} \\
                 \text{constant, now} \\
                 \text{pulled out.}
               \end{smallmatrix}
              }
        )
        \underbrace{\E{
                        \hat{\theta}-\E{\hat\theta}
                        }
                   }_
                   {=\E{\hat\theta}-\E{\hat\theta}=0}
     +\left(
          \E{\hat\theta}-\theta
       \right)^2
\\
& = \E{
        \left(
            \hat{\theta}-\mathbb{E}(\hat\theta)
        \right)^2
     }
     +\left(
         \E{\hat\theta}-\theta
      \right)^2
\\
& = \V{\hat\theta}+ \operatorname{Bias}(\hat\theta,\theta)^2
\end{array}

Regression
======

Mean Squred error =
$ \frac{\text{Residual Sum of squares}}
       {\text{#Degrees of Freedom}}$
                
[#Degrees of Freedom][steeltorrie1960] = 
\begin{cases}
    n-p   & \text{for p regressors}\\
    n-p-1 & \text{if an intercept is used}\\
\end{cases}

[steeltorrie1960]: google.com "Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 288."

todo: 
http://www.wikiwand.com/en/Errors_and_residuals_in_statistics

Mean
===

\begin{array}{llr}
\overline{X} &= \frac{1}{n} \sum_{i=1}^{n} X_i\\
\E{\overline{X}} &= \mu &\color{gray}{\text{  True Mean}}\\
\operatorname{MSE}\left(\overline{X}\right)
&= \E{\left( 
        \overline{X} - \mu
      \right)^2
   }
\\
&= \left(
     \frac{\sigma}{\sqrt{n}}
   \right)^2
\\
&= \frac{\sigma^2}{n}
\end{array}

For Gaussian distribution, this is the best unbiased estimator; that is, it has the lowest MSE among all unbiased estimators but not for the Uniform distribution



todo:  
http://www.wikiwand.com/en/Best_unbiased_estimator

Variance
=====

Corrected Sampled Variance  
\begin{array}{llr}
S^2_{n-1}
&= \frac{1}{n-1}
   \sum_{i=1}^n
   \left(
       X_i-\overline{X}\,
   \right)^2
\\
&= \frac{1}{n-1}
   \sum_{i=1}^n
   \left(
       X_i^2 - 2X_i\overline{X} + \overline{X}^2
   \right)
\\
&= \frac{1}{n-1}
   \left(
       \sum_{i=1}^n X_i^2
       - 2\overline{X} \sum_{i=1}^n X_i
       + \sum_{i=1}^n \overline{X}^2
   \right).
\\
&= \frac{1}{n-1}
   \left(
       \sum_{i=1}^n X_i^2
       - 2\overline{X} \left( n \overline{X} \right)
       + n \overline{X}^2
   \right).
\\
&= \frac{1}{n-1}
   \left(
       \sum_{i=1}^n
       X_i^2-n\overline{X}^2
   \right).
\end{array}

This is unbiased since its expected value is $\sigma^2$ [?].  
Hence it is also called the unbiased sample variance.  
Its [MSE][mood1974] is:  
\begin{array}{ll}
\operatorname{MSE}(S^2_{n-1})
&= \frac{1}{n}
   \left(
       \mu_4-\frac{n-3}{n-1}\sigma^4
   \right)
\\
&= \frac{1}{n}
   \left(
       \gamma_2+\frac{2n}{n-1}
   \right)\sigma^4
\end{array}
where  
\begin{array}{ll}
\mu_4 & \text{Fourth central moment}\\
\gamma_2 = \mu_4/\sigma^4 -3 & \text{excess kurtosis}
\end{array}

when n is large,
\begin{array}{llr}
\operatorname{MSE}(S^2_{n-1})
&\approx
\frac{1}{n}
\left(
    \gamma_2^{\prime} - 1
\right)
\sigma^4\\
\text{where }
& \gamma_2^{\prime}
& \text{ is kurtosis}
\end{array}

[mood1974]: google.com "Mood, A.; Graybill, F.; Boes, D. (1974). Introduction to the Theory of Statistics (3rd ed.). McGraw-Hill. p. 229."

Other estimators
-----------------

\begin{array}
\text{If  } S^2_a
&=\frac{n-1}{a} S^2_{n-1}
\\
&=
\frac{1}{a}
\sum_{i=1}^n
\left(
    X_i-\overline{X}
\right)^2
\end{array}

Then the MSE is

\begin{array}{lll}
\operatorname{MSE}(S^2_a)
&=\operatorname{E}
\left(
    \left(
        \frac{n-1}{a}
        S^2_{n-1}-\sigma^2
    \right)^2 
\right) \\
&=
\frac{n-1}{n a^2}
\left[
    (n-1)\gamma_2+n^2+n
\right]\sigma^4
\\
& \hspace{30pt}
-\frac{2(n-1)}{a}
\sigma^4+\sigma^4
\end{array}

This is minimized when
\begin{array}{ll}
a
&=
\frac{(n-1)\gamma_2+n^2+n}{n}
\\
&= n+1+\frac{n-1}{n}\gamma_2.
\end{array}

* For Normal distribution
  * $\gamma_2=0$
  * MSE is minimized when dividing by a=n+1
* For Bernoulli
  * with p=1/2, $\gamma_2 = -2$
  * MSE is minimized when a = n - 1 + 2/n
* So irrespective of the value of $\gamma_2$, we can get a better estimate (having lower MSE) by scaling down the unbiased estimator
* This is an example of the shrinkage estimator, where one shrinks the estimator towards zero.

todo:  
http://www.wikiwand.com/en/Shrinkage_estimator

todo:  
http://www.wikiwand.com/en/Analysis_of_variance

Applications
=======

* Minimizing MSE is a key criterion in selecting estimators: 
  * see [minimum mean-square error](http://www.wikiwand.com/en/Minimum_mean-square_error)
  * Among unbiased estimators, minimizing the MSE is equivalent to minimizing the variance
  * The estimator that does this is the [minimum variance unbiased estimator][mvue].
  * However, a biased estimator may have lower MSE; see [estimator bias](http://www.wikiwand.com/en/Estimator_bias).
  
* In statistical modelling
  * MSE represents the difference between the actual observations and the observation values predicted by the model
  * It is used to determine the extent to which the model fits the data
  * Also, whether the removal or some explanatory variables, simplifying the model, is possible without significantly harming the model's predictive ability.

[mvue]: http://www.wikiwand.com/en/Minimum_variance_unbiased_estimator "Minimum Variance Unbiased Estimator"

todo:  
1. http://www.wikiwand.com/en/Minimum_mean-square_error
1. http://www.wikiwand.com/en/Minimum_variance_unbiased_estimator
1. http://www.wikiwand.com/en/Estimator_bias
1. Lehmann, E. L.; Casella, George (1998). Theory of Point Estimation (2nd ed.). New York: Springer. ISBN 0-387-98502-6. MR 1639875.

Loss Function
========

* Squared error loss is one of the most widely used loss functions in statistics, though its widespread use stems more from mathematical convenience than considerations of actual loss in applications.
* Carl Friedrich Gauss, who introduced the use of mean squared error, was aware of its arbitrariness and was in agreement with objections to it on these grounds [ref][lehmann1998]

[lehmann1998]: google.com "Lehmann, E. L.; Casella, George (1998). Theory of Point Estimation (2nd ed.). New York: Springer. ISBN 0-387-98502-6. MR 1639875."

Criticisms
----------

1. [Berger][berger1985]:
  * Mean squared error is the negative of the expected value of one specific utility function, 
    the quadratic utility function,
    which may not be the appropriate utility function to use under a given set of circumstances.
  * There are, however,
    some scenarios where MSE can serve as a good approximation to a 
    loss function occurring naturally in an application

1. 
  * Like variance, MSE has the disadvantage of heavily weighting outliers.This is a result of the squaring of each term, which effectively weights large errors more heavily than small ones. [Sergio][sergio2001]
   * This property, undesirable in many applications, has led researchers to use alternatives such as the [mean absolute error][MAE], or those based on the median.


[berger1985]: http://www.ams.org/mathscinet-getitem?mr=0804611 "Berger, James O. (1985). "2.4.2 Certain Standard Loss Functions". Statistical decision theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag. p. 60. ISBN 0-387-96098-8. MR 0804611"

[sergio2001]: google.com "Sergio Bermejo, Joan Cabestany (2001) "Oriented principal component analysis for large margin classifiers", Neural Networks, 14 (10), 1447–1461."

[MAE]: http://www.wikiwand.com/en/Mean_absolute_error "wiki: Mean Absolute Error"

todo:  
http://www.wikiwand.com/en/Mean_absolute_error