# Threshold weighted scores
## Introduction
For a long time, people have wanted to be able to evaluate the performance of forecasts of extreme events using verification methods that can't be hedged. The threshold weighted scores in `scores` allow us to do this using consistent scoring functions.

Threshold weighted scores allow us to measure the performance of forecasts with a focus on a range of decision thresholds (e.g., performance of temperature forecasts above 40°C) using [consistent scoring functions](./Consistent_Scores.ipynb) and they do not allow hedging. They allow us to measure the performance of predicting extremes or some other range of decision thresholds of interest. This is particularly important when common mean scoring functions such as the [mean squared error (MSE)](./Mean_Squared_Error.ipynb) evaluate performance across all decision thresholds may hide the performance of predicting extremes.

The following threshold weighted scores are available in `scores`:

  - `scores.continuous.tw_squared_error`
  - `scores.continuous.tw_absolute_error`
  - `scores.continuous.tw_quantile_score`
  - `scores.continuous.tw_expectile_score`
  - `scores.continuous.tw_huber_loss`


There are two types of threshold weighting that are supported: rectangular and trapezoidal.

### Rectangular weighting
To specify a rectangular weight, set `interval_where_positive=None` and set `interval_where_one` to be the interval where the weight is 1. For example, if  `interval_where_one=(0, 10)` then a weight of 1 is applied to decision thresholds satisfying 0 ≤ threshold < 10, and weight of 0 is applied otherwise. Interval endpoints can be `-numpy.inf` or `numpy.inf`.

### Trapezoidal weighting
To specify a trapezoidal weight, specify `interval_where_positive` and `interval_where_one` using desired endpoints. For example, if `interval_where_positive=(-2, 10)` and `interval_where_one=(2, 4)` then a weight of 1 is applied to decision thresholds satisfying 2 ≤ threshold < 4. The weight increases linearly from 0 to 1 on the interval [-2, 2) and decreases linearly from 1 to 0 on the interval [4, 10], and is 0 otherwise. Interval endpoints can only be infinite if the corresponding `interval_where_one` endpoint is infinite. End points of `interval_where_positive` and `interval_where_one` must differ except when the endpoints are infinite.

This implementation is based on [Taggart, R. (2022)](https://doi.org/10.1002/qj.4206) who showed that many scoring functions can be decomposed as a sum of other scoring functions.

## Example 1 - Rectangular Weighting
Suppose we have some forecasts that are forecasting the mean (i.e., expected value) rainfall amount that we want to evaluate the performance for decision thresholds above 40mm. Let's generate the same synthetic data as in the [Consistent Scores tutorial](./Consistent_Scores.ipynb) and calculate a threshold weighted MSE (twMSE).

In [29]:
from scores.continuous import (
    mse,
    murphy_score,
    murphy_thetas,
    tw_absolute_error,
    tw_expectile_score,
    tw_huber_loss,
    tw_quantile_score,
    tw_squared_error,
)
from scipy.stats import skewnorm
import numpy as np
import xarray as xr
import plotly.graph_objects as go
from plotly.subplots import make_subplots

np.random.seed(100)

In [30]:
# Generate some synthetic rainfall observations between 0 and 50mm
N = 1000
obs = xr.DataArray(data=50 * np.random.random(N), dims=["time"], coords={"time": np.arange(0, N)})
obs = obs.clip(min=0)  # don't allow negative rainfall

# Generate synthetic forecasts by adding noise to each observation
fcst1 = 0.9 * obs + skewnorm.rvs(4, size=N)  # fcst1 has a low bias
fcst1 = fcst1.clip(min=0)  # don't allow negative rainfall
fcst2 = 1.1 * obs - skewnorm.rvs(4, size=N)  # fcst2 has a high bias
fcst2 = fcst2.clip(min=0)  # don't allow negative rainfall

In [31]:
# First if we calculate the MSE of fcst1 and fcst2 we will see that that have similar predictive performance.
print(f"fcst1 MSE = {mse(fcst1, obs).item()}")
print(f"fcst2 MSE = {mse(fcst2, obs).item()}")

fcst1 MSE = 5.397562928167134
fcst2 MSE = 5.3763094346565685


The MSE of each forecast are similar.

Let's apply a rectangular weight that gives equal weight to decision thresholds above 40mm, and gives zero weight to thresholds below 40mm and calculate the twMSE.

In [32]:
fcst1_tw_mse_upper = tw_squared_error(fcst1, obs, interval_where_one=(40, np.inf))
fcst2_tw_mse_upper = tw_squared_error(fcst2, obs, interval_where_one=(40, np.inf))
print(f"fcst1 twMSE = {fcst1_tw_mse_upper.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse_upper.item()}")

fcst1 twMSE = 2.2395142870695333
fcst2 twMSE = 3.2119291100148106


Why is this? We can visualise what's going on with a [Murphy Diagram](./Murphy_Diagrams.ipynb).

In [33]:
# Generate a list of thresholds of interest
thetas = murphy_thetas([fcst1, fcst2], obs, "expectile")

# Calculate the average elementary score for the mean (0.5 expectile) for each threshold theta
ms1 = 2 * murphy_score(fcst1, obs, thetas, functional="expectile", alpha=0.5)
ms2 = 2 * murphy_score(fcst2, obs, thetas, functional="expectile", alpha=0.5)

fig = make_subplots(rows=2, cols=1)
fig.add_trace(
    go.Scatter(x=ms1.theta, y=ms1.total, mode="lines", name="fcst1", line=dict(color="#1b9e77")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms1.theta,
        y=ms1.total.where(ms2.theta >= 40),
        mode="lines",
        line=dict(color="rgba(27,158,119, 1)"),
        fillcolor="rgba(27,158,119, 0.5)",
        fill="tozeroy",
        showlegend=False,
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=ms2.theta, y=ms2.total, mode="lines", name="fcst2", line=dict(color="#7570b3")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms2.theta,
        y=ms2.total.where(ms2.theta >= 40),
        mode="lines",
        fill="tozeroy",
        line=dict(color="#7570b3"),
        fillcolor="rgba(117,112,179, 0.5)",
        showlegend=False,
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=[0, 40, 40, 55], y=[0, 0, 1, 1], mode="lines", name="weight", line=dict(color="black")), row=2, col=1
)
fig.update_layout(
    xaxis_title="Rainfall (mm)",
    yaxis_title="Economic Regret",
    width=800,
    height=600,
    legend=dict(x=0.01, y=0.99, xanchor="left", yanchor="top"),
    margin=dict(l=50,r=20,b=20,t=20)
)
fig.update_yaxes(title_text="Weight", row=2, col=1)
fig.update_xaxes(title_text="Rainfall (mm)", row=2, col=1)
fig

- Our rectangular weight is shown in the lower subplot. It shows a weighting of 1 for rainfall thresholds 40mm and above, and a weight of 0 otherwise.
- The area under the lines across all decision thresholds equals the MSE (due to the scaling factor that we added to the Murphy score calculation).
- The area under the curves where rainfall is 40mm and above is proportional to the twMSE.
- `fcst2` (purple) has a larger shaded area than `fcst1` (green) and so has a lower (better) twMSE.


Let's now apply a rectangular weight with a value of 1 for all decision thresholds below 40mm, and a value of zero for all thresholds 40mm and higher.

In [34]:
fcst1_tw_mse_lower = tw_squared_error(fcst1, obs, interval_where_one=(-np.inf, 40))
fcst2_tw_mse_lower = tw_squared_error(fcst2, obs, interval_where_one=(-np.inf, 40))
print(f"fcst1 twMSE = {fcst1_tw_mse_lower.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse_lower.item()}")

fcst1 twMSE = 3.1580486410975848
fcst2 twMSE = 2.1643803246417592


If we add the twMSE values thresholds above 40mm and the twMSE values for thresholds below 40mm, we can see that it equals the MSE.

In [35]:
fcst1_tw_mse_upper + fcst1_tw_mse_lower

In [36]:
mse(fcst1, obs)

## Example 2 - Trapezoidal Weighting

If we want to gradually increase the weights between 30mm and 40mm, we add in some trapezoidal weighting.

To do this, we add the arg `interval_where_positive=(30, np.inf)` to the function call.

In [37]:
fcst1_tw_mse_upper = tw_squared_error(
    fcst1, obs, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf)
)
fcst2_tw_mse_upper = tw_squared_error(
    fcst2, obs, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf)
)
print(f"fcst1 twMSE = {fcst1_tw_mse_upper.item()}")
print(f"fcst2 twMSE = {fcst2_tw_mse_upper.item()}")

fcst1 twMSE = 3.2650463986170224
fcst2 twMSE = 3.90266301088566


We can also visualise this on Murphy Diagrams

In [38]:
def weight_function(x):
    if x < 30:
        return 0
    elif 30 <= x <= 40:
        return (x - 30) / 10
    else:
        return 1


y_values1 = np.array([weight_function(x) for x in ms1.theta])
y_values2 = np.array([weight_function(x) for x in ms2.theta])

fig = make_subplots(rows=2, cols=1)
fig.add_trace(
    go.Scatter(x=ms1.theta, y=ms1.total, mode="lines", name="fcst1", line=dict(color="#1b9e77")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms1.theta,
        y=ms1.total * y_values1,
        mode="lines",
        line=dict(width=0),
        fillcolor="rgba(27,158,119, 0.5)",
        fill="tozeroy",
        showlegend=False,
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=ms2.theta, y=ms2.total, mode="lines", name="fcst2", line=dict(color="#7570b3")), row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=ms2.theta,
        y=ms2.total * y_values2,
        mode="lines",
        fill="tozeroy",
        line=dict(width=0),
        fillcolor="rgba(117,112,179, 0.5)",
        showlegend=False,
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=[0, 30, 40, 55], y=[0, 0, 1, 1], mode="lines", name="weight", line=dict(color="black")), row=2, col=1
)
fig.update_layout(
    xaxis_title="Rainfall (mm)",
    yaxis_title="Economic Regret",
    width=800,
    height=600,
    margin=dict(l=50,r=20,b=20,t=20),
    legend=dict(x=0.01, y=0.99, xanchor="left", yanchor="top"),
)
fig.update_yaxes(title_text="Weight", row=2, col=1)
fig.update_xaxes(title_text="Rainfall (mm)", row=2, col=1)
fig

We can see how the weighting function linearly increases the area under the curve between 30 and 40mm. 

## Other Threshold Weighted Scores.

In this tutorial, we demonstrated the threshold weighted square error (or twMSE) function. There are other threshold weighted scores that work the same way. We show how they can be called.

### Threshold Weighted Absolute Error

In [39]:
tw_absolute_error(
    fcst1, obs, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf)
)

### Threshold Weighted Quantile Score
We need to use the `alpha` arg to specify the quantile level

In [40]:
tw_quantile_score(
    fcst1, obs, alpha=0.9, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf)
)

### Threshold Weighted Expectile score
We need to use the `alpha` arg to specify the expectile level

In [41]:
tw_expectile_score(
    fcst1, obs, alpha=0.9, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf)
)

### Threshold Weighted Huber Loss

We need to specify the Huber parameter arg; `huber_param`

In [42]:
tw_huber_loss(
    fcst1, obs, huber_param=2, interval_where_one=(40, np.inf), interval_where_positive=(30, np.inf)
)

## Things to try next
- Play around with different `interval_where_one` and `interval_where_positive` values.
- Rather than providing a `float` for `interval_where_one` and `interval_where_positive`, provide `xr.DataArray` objects where these values vary across a dimension(s).
- If you require more complicated weightings than rectangular and trapezoidal weightings, you can use `scores`' consistent scores. See the [Consistent Scoring Rules tutorial](./Consistent_Scores.ipynb).  

### Reference
[Taggart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quarterly Journal of the Royal Meteorological Society, 148(742), 306-320.](https://doi.org/10.1002/qj.4206) 