# How to select an objective function using information theory

### Timothy O. Hodson<sup>1</sup>, Tyler J. Smith<sup>2</sup>, Lucy M. Marshall<sup>3</sup>, and Thomas M. Over<sup>1</sup>
    
<sup>1</sup> U.S. Geological Survey Central Midwest Water Science Center, Urbana, Illinois; <sup>2</sup> Clarkson University, Potsdam, New York; <sup>3</sup> University of New South Wales, Sydney

[![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/hytest-org/workflow-hodson-2022-objective-benchmark)](https://github.com)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hytest-org/workflow-hodson-2022-objective-benchmark/blob/main/01-objective-benchmark-demo.ipynb)

## Abstract
Science tests competing theories or models by evaluating the similarity of their predictions against observational experience. 
Thus, how we measure similarity fundamentally determines what we learn. 
In modeling and machine learning, similarity metrics are used as objective functions.
A classic example being mean squared error, which is the optimal measure of similarity when errors are normally distributed and IID. 
In many cases, however, the error distribution is neither normal nor IID, so it is left to the scientist to determine an appropriate objective.
Here, we review how information theory can guide that selection, then apply the approach with a simple hydrologic model.

## Introduction
Science seeks to create useful representations of reality in the from of theories or hypotheses or models.
What sets science apart from other pursuits is that it rigorously tests those representations against observational experience:
all other things being equal, favor that which "best" fits the evidence.
An analogous process occurs when calibrating a numerical model or evaluating among competing models.
To select the "best" model, experiment by varying the model while keeping the test data and objective fixed.
If mean squared error (MSE) is the objective, compute the MSE between the test data and the model predictions, then select the model with the lowest MSE.
But why choose MSE and not another objective function?
The answer: MSE is the optimal measure when errors are normally distributed and IID.
But for many problems, the true error distribution is complex or unknown.

To select the "best" objective for those cases, the experiment is essentially the same except the objective is varied while the model and data are held fixed.
Now, select the objective indicating the greatest similarity between data and model.
Different objectives have different scales, however, so they are ``normalized''
such that each integrates to one; thereby representing them as probability distributions.
The normalized form of MSE is the normal distribution, for example.
When used to evaluate model fit, the probability distribution is called a likelihood function
and its output the likelihood
(for additional review, see Hodson, 2022).
So, to select among objectives, compute their likelihoods, then favor the most likely.
The convention is to work with the natural logarithm of the likelihood, denoted as $\ell$.
The logarithm does not change how the models rank
but simplifies the math by converting products to sums.
Likelihoods multiply, so log likelihoods add.


## Uncertainty to information
Thus far, the selection problem is posed in terms of probability theory,
but information theory gives an equally valid interpretation.
The former seeks the most likely representation, whereas the latter seeks that with the best compression
(Cover, 2006).
Their equivalence is shown later,
but first, we review three fundamental concepts from information theory.
The entropy $H(D)$, which is the expected information in each new observation of the data $D$.
The conditional entropy $ H(D|M)$,
which is the additional information need to represented the $D$ when encoded with some model $M$
(think of it as the information in the model error).
Finally, the difference between these terms gives the mutual information
$$
I(D; M) = H(D) - H(D | M) \text{,}
$$
which measures how much information $M$ encodes about $D$.
The "best" model or, in this case, objective maximizes $I$.
Hereafter, model and objectively are used interchangeably,
but in a formal sense, the objective is a model representing the error, or information loss, between $M$ and $D$.

If the data are fixed, the information in the data, $H(D)$, is constant $C$.
Now, the connection to probability theory:
as the number of observations $n$ goes to infinity,
the average log likelihood equals the negative conditional entropy equals the mutual information up to a constant,
$$
I_C = - H(D|M) = \ell / n \text{,}
$$
where the natural logarithm gives units of nats.
Dividing by $\ln(2)$ converts the result to bits.
In other words, the most likely objective function represents the error with the fewest bits.
For finite $n$, the average $\ell$ still gives an unbiased estimate of the conditional entropy,
when assessed "out of sample,"
meaning the test data were not used to calibrate the model.

TODO
So to benchmark an objective, use the log likelihood to estimate the information content of the model error.
Log likelihoods for several objectives are implemented below.


In [1]:
# compute likelihood
import numpy as np
from scipy.stats import pearsonr
import scipy.stats


def normal_ll(y, y_hat, transform=None, gradient=1):
    '''Log likelihood for the normal distribution with change of variable
    
    The normal distribution is the formal likelihood for the mean squared error (MSE).
    

    Parameters
    ----------
    y : array_like
        Observations.
    y_hat : array_like
        Predictions.
    transform : function
        Change of variable transformation.
    gradient : function
        Gradient of the transform function.
        
    Proof
    -----
    https://www.statlect.com/probability-distributions/normal-distribution
    '''
    if transform is not None:
        y = transform(y)
        y_hat = transform(y_hat)
        
    e = y - y_hat
    n = len(e)
    sigma = e.std()
    log_gradient = np.sum(np.log(np.abs(gradient)))
    ll = -n * np.log(sigma) - n/2*np.log(2*np.pi) - 1/(2*sigma**2) * (e**2).sum() + log_gradient
    return ll


def laplace_ll(y, y_hat, transform=None, gradient=1):
    '''Log likelihood for Laplace distribution with change of variable
    
    The laplace distribution is the formal likelihood for the mean absolute
    error (MAE).
    
    Parameters
    ----------
    y : array_like
        Observations.
    y_hat : array_like
        Predictions.
    transform : function
        Change of variable transformation.
    gradient : function
        Gradient of the transform function.
    '''
    if transform is not None:
        y = transform(y)
        y_hat = transform(y_hat)
        
    e = (y - y_hat).abs()
    n = len(e)
    b = e.mean()
    log_gradient = np.sum(np.log(np.abs(gradient)))
    ll = -n * np.log(2*b) - 1/b * e.sum() + log_gradient
    return ll.sum()
                                   

def msre_ll(y, y_hat):
    '''Log likelihood for mean squared square-root error
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    return normal_ll(y, y_hat, transform=lambda x: np.sqrt(x), gradient=-1/(2*np.sqrt(y)))


def mare_ll(y, y_hat):
    '''Log likelihood for mean absolute square-root error
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    return laplace_ll(y, y_hat, transform=lambda x: np.sqrt(x), gradient=-1/(2*np.sqrt(y)))


def lognormal_ll(y, y_hat):
    '''Lognormal log likelihood
    
    The lognormal distribution is the formal likelihood for the mean squared
    log error (MSLE).
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    return normal_ll(y, y_hat, transform=lambda x: np.log(x), gradient=1/y)


def mspe_ll(y, y_hat):
    '''Log likelhood for mean squared percentage error
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    
    '''
    #return normal_ll(y, y_hat, transform=lambda x: x/y, gradient=1/np.abs(y))
    return normal_ll(y, y_hat, transform=lambda x: x/y, gradient=-1/(y**2)) 


def nse_ll(y, y_hat, group='gage_id'):
    '''Log likelihood for normalized squared error (NSE)
    
    NSE is equivalent to the Nash–Sutcliffe model efficiency coefficient.
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    sigma_o = y.groupby('gage_id').transform(lambda x: x.std())
    return normal_ll(y, y_hat, transform=lambda x: x/sigma_o, gradient=1/sigma_o)


def loglaplace_ll(y, y_hat):
    '''Log likelihood for log Laplace distribution
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    return laplace_ll(y, y_hat, transform=lambda x: np.log(x), gradient=1/y)


def uniform_ll(y, y_hat):
    '''Log likelihood for uniform distribution.
    
    The uniform log likelihood minimizes the maximum error.
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    e = np.abs(y - y_hat)
    n = len(e)
    #ll = -n * np.log(e.max()-e.min()) # standard formulation
    ll = -n * np.log(e.max() - 0)
    return ll


def bernoulli_ll(y, y_hat, groupby=None):
    '''TODO and use within zi_ll
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    pass



def zi_ll(y, y_hat, ll=normal_ll, threshold=0.01, groupby=None):
    ''' Zero-inflated log likelihood.
    
     Parameters
    ----------
    y : array_like
    y_hat : array_like
    ll : function
        Zero-inflated log likelihood 
    threshold : float
        Value below which is treated as zero
    groupby : string
        Optional groupby term (testing)
    '''
    y_o = y <= threshold
    y_hat_o = y_hat <= threshold
    
    if groupby is None:
        n1 = (y_o & y_hat_o).sum() # correct zero-flow prediction
        n2 = (y_o ^ y_hat_o).sum() # incorrect zero-flow prediction 
    else:
        n1 = (y_o & y_hat_o).groupby(groupby).sum() # correct zero-flow prediction
        n2 = (y_o ^ y_hat_o).groupby(groupby).sum() # incorrect zero-flow prediction

    n3 = (~y_o & ~y_hat_o) # correct flow predictions
    
    # fraction of correctly predicted zero flows
    rho = np.where( (n1+n2) == 0, 0, n1 / (n1 + n2))
    n_rho = 1-rho
    
    # n1 * np.log(rho) + n2 * np.log(1-rho)
    ll_zero = n1[rho!=0] * np.log(rho[rho!=0]) + n2[n_rho!=0]* np.log(n_rho[n_rho!=0])
    
    return ll_zero.sum() + ll(y[n3], y_hat[n3])


def zilognormal_ll(y, y_hat):
    '''Log likelihood for zero-inflated lognormal.
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
       
    return zi_ll(y, y_hat, ll=lognormal_ll, threshold=0.01)


def ziloglaplace_ll(y, y_hat):
    '''Log likelihood for zero-inflated laplace.
    
    Parameters
    ----------
    y : array_like
    y_hat : array_like
    '''
    return zi_ll(y, y_hat, ll=loglaplace_ll, threshold=0.01)


## Weights

Given the conditional entropies for a set of $m$ models,
the "weight" of evidence for each model $w_i$ is
$$
w_i = \frac{ x^{\hat H_i} }{ \sum^{m}_{i} x^{\hat H_i}  }
$$
where the base $x$ is 2 for bits or $e$ for nats (Burnham and Anderson, 2002).

In [2]:
def compute_weights(series, base=np.e):
    '''Compute posterior weights
    
    Parameters
    ----------
    series : array_like
        Log likelihoods
    base: float
        Base of the logarithm used to compute log likelihood
    '''
    s = base**series
    return s/s.sum()

## Benchmark demonstration
As demonstration, we benchmark the entropies of several objective functions used with streamflow models.
The test data are streamflow observations from 1,385 streamgages in the conterminous U.S. (Russell and others, 2020);
roughly 14 million observations.
Streamflow can be zero or negative, which is undefined for certain objectives, like those with log transformations.
For the demonstration, flows below 0.0028 cubic meters per second (0.01 cfs) were thresholded and treated as the "zero flow" state.
Different thresholds may yield slightly different results,
particularly for objectives that are more sensitive at low flow, like those using logs.

The model is simple: 
predict streamflow at a particular time and location by scaling the nearest observation by the ratio of the two drainage areas.
So when predicting flow in a large river using observations from a smaller one, scale up the observations.
By nature, the predictions are out of sample, so neither cross validation nor bias adjustment is necessary. 
Though the demonstration is almost trivial, the same procedure can be applied with machine learning or physics-based simulations.

In [3]:
# load data
import pandas as pd
import xarray as xr
import numpy as np

ds = xr.open_dataset('GAGESII_reference_NNDAR.nc')
df = ds.to_dataframe()
df = df.dropna(how='any')
#df = pd.read_parquet()
df[df < 0.01] = 0.01

In [4]:
# step 1: create a table of objective functions
objectives = {
    'U' : {'name':'uniformly distributed error', 'f':uniform_ll},
    'MSE' : {'name':'mean squared error', 'f':normal_ll},
    'NSE' : {'name':'normalized squared error', 'f':nse_ll},
    'MAE' : {'name': 'mean absolute error', 'f':laplace_ll},
    'MSPE' : {'name': 'mean squared percent error', 'f':mspe_ll},
    'MSLE' : {'name':'mean squared log error*', 'f':lognormal_ll},
    'MALE' : {'name':'mean absolute log error*', 'f':loglaplace_ll},
    'ZMSLE' : {'name':'zero-inflated MSLE', 'f':zilognormal_ll},
    'ZMALE' : {'name':'zero-inflated MALE', 'f':ziloglaplace_ll},
    'MARE' : {'name':'mean absolute square root error', 'f':mare_ll},
}

obj_df = pd.DataFrame.from_dict(objectives, orient='index')

# step 2: compute the information in each objective function
for index, row in obj_df.iterrows():
    # nats is the negative log likelihood or the info in the error
    obj_df.loc[index, 'bits'] = - row.f(df.obs, df.NNDAR)/len(df)/np.log(2)

# step 3: compute weights
obj_df['weight'] = compute_weights(-obj_df.bits, base=2)

# step 4: format output table

table = obj_df[['name','bits','weight']].sort_values('weight').round(2)#.rename(columns=names)

table['rank'] = len(table) - np.argsort(table['weight'])

Results are shown in the table below.
The best objective represents the error ($\hat H$) in the fewest bits.
For this particular benchmark, zero-mean absolute log error (ZMALE) performed best,
with an entropy of 6.95 bits.
For comparison, MSE was 11.62 bits.
The magnitudes are unimportant here, only the differences.
The data and model were fixed;
all that changed was how we quantified it.
The real information content is unchanged.
Relative to ZMALE, the excess bits in the other objectives are noise.
So, MSE measures at least 40 percent noise,
and NSE, the de facto in hydrologic modeling, at least 38 percent.
Noiser objectives result in models that require more iterations of calibration, more data, and more storage space
(better model, better data compression).
A well-known example being stochastic gradient descent, where the noisier gradient requires more iterations to reach the solution.
In that case, each iteration completes faster, so the solution can be reached quicker overall.
A poorly chosen objective incurs a similar penalty but without a benefit.

Despite their near ubiquituous use as a basis for learning,
objective functions are rarely, if ever, benchmarked in many problem domains.
Hopefully, by providing this simple demonstration the practice becomes commonplace.
After all, how well machines---and scientists---learn and think can depend a lot on how they measure similarity.

In [12]:
#print(table.to_latex())
table

Unnamed: 0,name,bits,weight,rank
MSPE,mean squared percent error,23.54,0.0,10
U,uniformly distributed error,18.17,0.0,9
MSE,mean squared error,11.62,0.01,8
NSE,normalized squared error,11.2,0.01,7
MAE,mean absolute error,9.49,0.04,6
MSLE,mean squared log error*,7.47,0.15,5
MARE,mean absolute square root error,7.34,0.17,4
ZMSLE,zero-inflated MSLE,7.18,0.19,3
MALE,mean absolute log error*,7.04,0.21,2
ZMALE,zero-inflated MALE,6.95,0.22,1


In [8]:
# compute minimum noise is MSE
mse_bits = table.loc['MSE', 'bits']
nse_bits = table.loc['NSE', 'bits']
zmale_bits = table.loc['ZMALE', 'bits']
print(f'MSE is at least {round(100*(mse_bits - zmale_bits) / mse_bits)} percent noise.')

MSE is at least 40 percent noise.


In [9]:
# compute minimum noise in NSE
print(f'and NSE at least {round(100*(nse_bits - zmale_bits) / nse_bits)} percent.')

NSE is at least 38 percent noise.


## Data availability
The streamflow data are from Russell et al. (2020) and are available at https://doi.org/10.5066/P9XT4WSP.
The Python code used to compute the likelihoods are available in a demonstration notebook at https://github.com/thodson-usgs/objective-manuscript.

## Disclaimer
Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

## Acknowledgements
The authors thank Hoshin V. Gupta for his encouragement and some inspiring discussion while waiting for his flight after HydroML 2022.
Funding for this research was provided by the Hydro-terrestrial Earth Systems Testbed (HyTEST) project of the U.S. Geological Survey Integrated Water Prediction program.

## References 

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19 (6): 716–723. https://doi.org/10.1109/tac.1974.1100705

Burnham, K.P. and Anderson, D.R. (2002). Model selection and multimodel inference: A Practical Information-Theoretic Approach. 2nd Edition, Springer-Verlag, New York.

Cover, T.M., and Thomas, J. A. (2006). Elements of information theory, 2nd edition, John Wiley & Sons.

Hodson, T.O. (2022). Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not, Geosci. Model Dev., 15, 5481–5487. https://doi.org/10.5194/gmd-15-5481-2022

Schwarz, G.E. (1978). Estimating the dimension of a model. Annals of Statistics, 6 (2), 461–464. https://doi.org/10.1214/aos/1176344136

Smith, T., Sharma, A., Marshall, L., Mehrotra, R., and Sisson, S. (2010). Development of a formal likelihood function for improved Bayesian
inference of ephemeral catchments, Water Resources Research, 46. https://doi.org/10.1029/2010wr009514

Russell, A.M., Over, T.M., and Farmer, W.H. (2020). Cross-validation results for five statistical methods of daily streamflow estimation at 1,385 reference streamgages in the conterminous United States, Water Years 1981-2017: U.S. Geological Survey data release. https://doi.org/10.5066/P9XT4WSP