Hi all,

I have seen some kernels here that try to estimate the median of the future time series by means of taking the median of the historical data into account. Some kernels in previous competions like [webtraffic prediction](https://www.kaggle.com/c/web-traffic-time-series-forecasting) were also having acceptable results by this approach. But, there, the goal was to minimze an MAE based error function, while here, we should minimze an MSE based error function.

So, I have a theory. I think for MAE based error functions median estimation is the right approach but for MSE based functions mean estimation is a better approach.

Here I show it with one numerical simulation and then prove it analytically in the next section.

so, let's make a signal with some slow and fast oscillations with some trend and randomness:



In [None]:
import pandas as pd
import pylab as pl
import seaborn as sns
from scipy.stats import mode

def analyzer(n, Seed, Res, Plot):
    pl.seed(Seed)
    x = pl.arange(n)
    y = pl.zeros(n)  # let's make up our signal
    y += pl.sin(pl.pi * x / 100)  # slow oscillation
    y += pl.sin(pl.pi * x / 5)  # fast oscillation
    y += .01*x  # some trend
    y += .5*pl.cumsum(pl.randn(n))  # some randomness by randomwalk
    y += pl.exp(pl.randn(n))*pl.randn(n)  # some outliers

    Avg = pl.mean(y)
    Med = pl.median(y)
    bins = pl.linspace(y.min(), y.max(), 20)
    m = mode(pl.digitize(y, bins))
    Mod = bins[m[0]]-(bins[1]-bins[0])/2

    Res = Res.append(pd.DataFrame([[Seed,'Avg',((y-Avg)**2).mean(),abs(y-Avg).mean()]], columns=ResCols))
    Res = Res.append(pd.DataFrame([[Seed,'Med',((y-Med)**2).mean(),abs(y-Med).mean()]], columns=ResCols))
    Res = Res.append(pd.DataFrame([[Seed,'Mod',((y-Mod)**2).mean(),abs(y-Mod).mean()]], columns=ResCols))

    if Plot:
        pl.figure(figsize=(10,10))
        pl.subplot(2,1,1)
        pl.hist(y)
        pl.xlabel('y values')
        pl.ylabel('histogram')

        pl.subplot(2,1,2)
        pl.plot(x, y)
        pl.plot(x[[0,-1]], [Avg, Avg])
        pl.plot(x[[0,-1]], [Med, Med])
        pl.plot(x[[0,-1]], [Mod, Mod])
        pl.legend(('data','Mean','Median','Mode'))
        pl.xlabel('samples')
        pl.ylabel('y values')

    return Res


n = 500
Seed = 1
ResCols = ('Seed','Estimate','MSE','MAE')
Res = pd.DataFrame([],columns=ResCols)
Res = analyzer(n, Seed, Res, Plot=True)
print(Res)


The table shows for MSE average is a better estimate and for MAE median is doing better, and mode is not working good in either case.

Let's run some statistics:

In [None]:
Res = pd.DataFrame([],columns=ResCols)
for Seed in range(100):
    Res = analyzer(n, Seed, Res, Plot=False)

Res2=Res.pivot(columns='Estimate',index='Seed')
print(all(Res2[('MAE','Med')]<=Res2[('MAE','Avg')]))
print(all(Res2[('MSE','Med')]>=Res2[('MSE','Avg')]))


so, it's clear that for MSE mean estimate works the best and for MAE median is the best estimate.

And of course, the difference between the median and mean depends on the distribution of the values of our time series.

Here are some figures:

In [None]:
pl.figure(figsize=(10,10))
ax=pl.subplot(2,2,1)
sns.violinplot(x='Estimate', y='MSE', data=Res, ax=ax)
sns.pointplot(x='Estimate', y='MSE', data=Res, ax=ax, color='y', markers=".")

ax=pl.subplot(2,2,2)
sns.violinplot(x='Estimate', y='MAE', data=Res, ax=ax)
sns.pointplot(x='Estimate', y='MAE', data=Res, ax=ax, color='y', markers=".")

pl.subplot(2,2,3)
pl.plot(Res2[('MSE','Avg')],Res2[('MSE','Med')], '.')
pl.plot([0,Res.MSE.max()],[0,Res.MSE.max()],'r')
pl.xlabel('Median')
pl.ylabel('Average')
pl.title('MSE')

pl.subplot(2,2,4)
pl.plot(Res2[('MAE','Avg')],Res2[('MAE','Med')], '.')
pl.plot([0,Res.MAE.max()],[0,Res.MAE.max()],'r')
pl.xlabel('Median')
pl.ylabel('Average')
pl.title('MAE')


The figures show slight differences in the errors for median and mean estimations, and as I said it depends on the distribution of the signal.

and here is the proof that I promissed:

We are going to determine what is the best estimate of the signal, if we take some error function.Let's first do it for MSE.

$MSE = \frac{1}{n}\sum_{i=1}^n {{\big(y_i - m\big)}^2} $

by expanding this formula we get:

$MSE = \frac{\sum_{i=1}^n y_i^2}{n} + \frac{nm^2}{n} + -2m\frac{\sum_{i=1}^n y_i}{n} $

and in fact it is a function of m that we are going to minimize with respect to m i.e. find the best m that has the lowest MSE:

$MSE(m) = \mathbb{E}[y^2] + m^2 + -2mAvg $

the extremum values of this function are where the derivative is zero:

$\frac{\mathrm d}{\mathrm d m}  MSE  = 2m-2Avg $ 

so clearly the extremum is where m is the average!

Let's do the same thing for MAE:

$MAE = \frac{1}{n} \sum_{i=1}^n {\big|y_i - m\big|} $

It's derivative is:

$\frac{\mathrm d}{\mathrm d m}  MAE  = \frac{1}{n} \Big( \frac{y_1-m}{|y_1-m|} + \frac{y_2-m}{|y_2-m|} + ...+ \frac{y_n-m}{|y_n-m|} \Big) $

and these  $\frac{y_i-m}{|y_i-m|}$  terms are +1s and -1s. Therefore, this summation is only minimum if we have the same number of +1s and -1s, which implies m should be the median of this series!