# Block bootstrapping technique
Block bootstrapping is a resampling technique commonly used in time series analysis to account for autocorrelation structure in the data. By preserving the correlation within blocks of data, it provides a more accurate representation of the underlying processes. This technique can be used to calculate confidence intervals, which help quantify the uncertainty in our estimate. When comparing verification metrics of two forecast systems, these confidence intervals enable us to assess the statistical significance of their differences, helping determine which system performs better with greater reliability.

Block bootstrapping function available in `scores` includes an optional argument that allows one to specify whether the resampling should be done circularly. Circular block bootstrapping treats data as circular, enabling blocks to wrap around from the end of the series back to the beginning. This method is particularly useful for cyclical or seasonal time series data, where the end of the series naturally follows the beginning. Additionally, it helps maintain the continuity and dependence structure of the data.

Similar to the **[Diebold Mariano (DM) test](./Diebold_Mariano_Test_Statistic.ipynb)**, the block bootstrapping technique does not require forecast errors to be Gaussian. Since it is a non-parametric resampling method, it does not rely on specific distributional assumptions.This characteristic, along with its ability to handle serially correlated scores, makes block bootstrapping a suitable method for calculating confidence intervals compared to compared to traditional statistical techniques.

## References
Gilleland, E. (2020) Bootstrap methods for statistical inference. Part I: comparative forecast verification for continuous variables. *Journal of Atmospheric and Oceanic Technology*, 37, 2117-2134.

Wilks, D.S. (2011) Statistical methods in the atmospheric sciences (Vol. 100). *Academic press*.


Here we show an example of using block bootstrapping to calculate the confidence that one forecast system performed better than another.

In [1]:
import xarray as xr
import numpy as np
import pandas as pd
import plotly.graph_objs as go
from scores.processing import block_bootstrap
from scores.continuous import mse

Let's generate some synthetic observations and two corresponding synthetic forecasts.

In [2]:
# Create synthetic observations (here temperature) for four stations over 365 days in 2023
n_time = 365
n_x = n_y = 4
n_lead_day = 7
np.random.seed(42)
obs = xr.DataArray(
    data=np.random.randn(n_time, n_x, n_y).cumsum(axis=0),
    dims=["time", "x", "y"],
    coords={"time": pd.date_range("2023-01-01", "2023-12-31"), "x": np.arange(n_x), "y": np.arange(n_y)},
    name="temperature",
)

fcst = xr.DataArray(data=[1] * 7, dims="lead_day", coords={"lead_day": np.arange(1, 8)})
fcst = fcst * obs

# Create two synthetic forecasts that are based on the observations + noise
# Here, we consider a larger noise for forecast system B
fcst_a = fcst + np.random.randn(n_lead_day, n_time, n_x, n_y) * 0.5
fcst_b = fcst + np.random.randn(n_lead_day, n_time, n_x, n_y) * 0.7

Now let's resample observation and forecasts using circular block bootstrapping technique with 1000 iterations. Here we use block size of 2 for `x` and `y` dimensions and 30 for `time` dimension. In your case. make sure that the block sizes are longer than the dependence length in your data. Additionally, we exclude `"lead_day"` dimension from our bootstrapping process. 

In [3]:
fcst_a_boot, fcst_b_boot, obs_boot = block_bootstrap(
    [fcst_a, fcst_b, obs],
    blocks={"x": 2, "y": 2, "time": 30},
    exclude_dims=[["lead_day"], ["lead_day"], []],
    n_iteration=1000,
    circular=True,
)

Now let's calculate mean squared error (MSE) for each systems:

In [4]:
fcst_a_mse = mse(fcst_a_boot, obs_boot, preserve_dims=["lead_day", "iteration"])
fcst_b_mse = mse(fcst_b_boot, obs_boot, preserve_dims=["lead_day", "iteration"])

In [5]:
diff = fcst_a_mse - fcst_b_mse
mean = diff.mean(dim="iteration")
lower_band = diff.quantile(0.05, dim="iteration")
upper_band = diff.quantile(0.95, dim="iteration")
xr.Dataset(
    {
        "mean": mean,
        "ci_upper": upper_band.drop_vars("quantile"),
        "ci_lower": lower_band.drop_vars("quantile"),
    }
)

Here, we can also visualise the mean difference in MSE of these two systems along with 90% confidence bands. This figure clearly shows that forecast system A (as we expected) performed batter than forecast system B ($MSE_A < MSE_B$) at the 10% significance level.

In [6]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=mean.lead_day.values, y=mean.values, mode="lines", name="Mean difference"))
fig.add_trace(
    go.Scatter(
        x=mean.lead_day.values, y=upper_band.values, fill=None, mode="lines", line_color="lightgrey", showlegend=False
    )
)
fig.add_trace(
    go.Scatter(
        x=mean.lead_day.values,
        y=lower_band.values,
        fill="tonexty",
        mode="lines",
        line_color="lightgrey",
        name="90% Confidence intervals",
    )
)

fig.update_layout(
    title="Difference in the mean MSE of forecast systems<br>A and B (MSE<sub>A</sub> - MSE<sub>B</sub>)",
    xaxis_title="Lead day",
    yaxis_title="MSE difference",
    width=600,
    height=500,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()