## Simulated Correlated Time Series

To test the efficacy of Triple Collocation (TC), we have simulated a correlated time series using the [CoSMoS](https://cran.r-project.org/web/packages/CoSMoS/vignettes/vignette.html) package in R (see [simulation.R](R/simulation.R)). The package uses spatiotemporal correlation structures to generate a time series correlated in time and space. We have chosen to use the Clayton-Weibull spatiotemporal correlation structure, which has correlations exponentially decrease as elements in the series increase in separation in space and time. This is given by

$\rho_{\rm CW}(\delta, \tau) = \Big(\exp\Big(\theta\Big(\frac{\delta}{b_{\rm S}}\Big)^{c_{\rm S}}\Big) + \exp\Big(\theta\Big(\frac{\tau}{b_{\rm T}}\Big)^{c_{\rm T}}\Big) - 1 \Big)^{-\frac{1}{\theta}}$,

where $\rho_{\rm CW}$ is the correlation structure, $\delta$ is space, $\tau$ is time, $\theta$ is the copula parameter, $b_{\rm S}$ and $b_{\rm T}$ are the space and time scale parameters, and $c_{\rm S}$ and $c_{\rm T}$ are the space and time shape parameters, respectfully. Since we look at the time series of each spatial pixel individually, this leads to $\delta = 0$, which simplifies the CW equation to

$\rho_{\rm CW}(\tau) = \Big(\exp\Big(\theta\Big(\frac{\tau}{b_{\rm T}}\Big)^{c_{\rm T}}\Big)\Big)^{-\frac{1}{\theta}}$.

Additionally, the maginal distribution from which to sample was chosen to be a normal distribution. We then ran the simulation three independent times for a 5x5 grid of spatial points and 5000 temporal points. Each of the three simulations utilized different normal distribution shape properties (i.e., mean and variance). Finally, we saved the output (which were matrices in R) from [CoSMoS](https://cran.r-project.org/web/packages/CoSMoS/vignettes/vignette.html) to a hdf5 file for use in this notebook.

In [None]:
import numpy as np
import h5py
import holoviews as hv

hv.extension("bokeh")

In [None]:
f = h5py.File("../Data/examples/simulation.h5", "r")
# The parameter names indicate each maginal distributions variance (i.e., 0.3, 0.2, 0.15)
sim3 = np.array(f["sim3"])
sim2 = np.array(f["sim2"])
sim15 = np.array(f["sim15"])

print(sim15.shape, sim2.shape, sim3.shape)

# Let's stack the three different simulations along a third dimension and adjust dimensions
#   to have ndata x nsim x npix
sim = np.stack((sim15, sim2, sim3), axis=2).transpose((1, 2, 0))
print(sim.shape)

Now that we have read in the simulated data, let's estimate the error variances using TC. Since [Zwieback et al. (2012)](https://npg.copernicus.org/articles/19/69/2012/) showed that we only need 500 or so samples to estimate the variances within 10%, we will compute the variances for our simulations using 500 up to 5000 samples to see if including more samples makes a significant difference.

> Note that we will use the multi-dimensional EC function to do this. As shown in the [random dataset example](Examples/random_dataset.ipynb), the EC function is equivalent to using the TC function if only three datasets are used. Therefore by using the multi-dimensional EC function, it will be much faster since it parallelizes along the spatial dimensions.

In [None]:
%run ../TC/EC_function.ipynb

In [None]:
# It takes a little while to run, about 60s or so
evar = np.zeros((3, 25, 4500))
for i in range(500, 5000):
    # Only need the diagonal, so extract it
    evar[:, :, i - 500] = np.diagonal(
        ec_covar_multi(sim[0:i, ...], corr_sets=[1, 2, 3])
    ).T

Now, let's plot this to see if the variances are estimated well and if including more samples improves our esimates.

In [None]:
fig_samplesi = hv.Curve(
    zip(np.arange(500, 5000), np.nanmedian(evar[0, :, :], axis=0)), label="i"
).opts(
    color="green",
    height=300,
    width=800,
    xlabel="Number of Samples",
    ylabel="Estimated Error Variance",
    xlim=(500, 5700),
    ylim=(0.08, 0.37),
)
fig_samplesj = hv.Curve(
    zip(np.arange(500, 5000), np.nanmedian(evar[1, :, :], axis=0)), label="j"
).opts(color="orange")
fig_samplesk = hv.Curve(
    zip(np.arange(500, 5000), np.nanmedian(evar[2, :, :], axis=0)), label="k"
).opts(color="red")

fig_truthi = hv.Curve(zip([500, 5000], [0.15, 0.15]), label="i Truth").opts(
    color="lime"
)
fig_truthj = hv.Curve(zip([500, 5000], [0.2, 0.2]), label="j Truth").opts(color="gold")
fig_truthk = hv.Curve(zip([500, 5000], [0.3, 0.3]), label="k Truth").opts(
    color="darkred"
)

fig_samplesi * fig_samplesj * fig_samplesk * fig_truthi * fig_truthj * fig_truthk

So, the estimates align fairly well with the truth values, increasing in agreement as more samples are included as we would expect. Therefore, TC works well at estimating each observation system's error variance, if each system has independent errors.

Now, let's use TC on a simulation where the errors of different observation systems are correlated. To do this, we ran another [CoSMoS](https://cran.r-project.org/web/packages/CoSMoS/vignettes/vignette.html) simulation using the Clayton-Weibull spatiotemporal correlation structure, normal marginalized distribution, 5x5 spatial pixels, and 40000 time steps. To create correlated observational systems from this simulation, we split the simulation into three observation systems. This is done by distributing the first 3 elements in each time series of each pixel (i.e., indices 0, 1, 2) to each observing system. This is then repeated for the next three elements in the time series, and the next three, and so on until each observing system has 500 time elements.

Additionally, to show how decreasing correlation effects the error variance estimates, we can do a similar method of distrubuting the simulated data to each observing system. Instead of the first three elements, we can do every other element till we have three (i.e., indices 0, 2, 4) and repeat till each system has 500 elements. This method can keep being repeated (e.g., indices 0, 3, 6; indices 0, 4, 8, etc.) to make less and less correlated observing systems by continually spreading out the elements distributed to each observing system.

In [None]:
sim_long = np.array(f["sim"])

sim_corr = np.zeros(
    (3, sim_long.shape[0], 500, int(np.floor(sim_long.shape[1] / 3 / 500)))
)

for i in range(int(np.floor(sim_long.shape[1] / 3 / 500))):
    obs1 = sim_long[:, slice(0, 500 * (i + 1) * 3, (3 * (i + 1)))]
    obs2 = sim_long[:, slice(1 * (i + 1), 500 * (i + 1) * 3 + 1 * (i + 1), 3 * (i + 1))]
    obs3 = sim_long[:, slice(2 * (i + 1), 500 * (i + 1) * 3 + 2 * (i + 1), 3 * (i + 1))]
    sim_corr[:, :, :, i] = np.stack((obs1, obs2, obs3), axis=0)

sim_corr.shape

Now that we have our correlated observing system data from the simulation, we can estimate the error variances using TC.

In [None]:
evar_corr = np.diagonal(
    ec_covar_multi(sim_corr.transpose((2, 0, 1, 3)), corr_sets=[1, 2, 3])
).transpose((2, 0, 1))
evar_med = np.nanmedian(evar_corr, axis=1)

Using the median error variance of all the pixels, we can plot how the error variance estimates change with decreasing correlation (i.e., increasing separation of distributed time elements). Since we know the temporal distance between each element, we can directly calculate the correlation using the chosen parameters in the simulation and use this in our plot.

First, let's make a simple function to calculate the correlation.

In [None]:
def cw_stcs(tau, theta, bt, ct):
    rho = (np.exp(theta * (tau / bt) ** ct)) ** (-1 / theta)
    return rho

Then, we can plot.

In [None]:
rho = cw_stcs(np.arange(np.floor(sim_long.shape[1] / 3 / 500)) + 1, 2, 3, 0.7)

fig_var_corr_i = hv.Curve(zip(rho, evar_med[0, :]), label="i").opts(
    color="green",
    height=300,
    width=800,
    xlabel="CW correlation",
    ylabel="Estimated Error Variance",
    logx=True,
)
fig_var_corr_j = hv.Curve(zip(rho, evar_med[1, :]), label="j").opts(color="orange")
fig_var_corr_k = hv.Curve(zip(rho, evar_med[2, :]), label="k").opts(color="red")

fig_truth = hv.Curve(
    zip(rho, np.repeat(0.15, np.floor(sim_long.shape[1] / 3 / 500))), label="Truth"
).opts(color="black")

(fig_var_corr_i * fig_var_corr_j * fig_var_corr_k * fig_truth).opts(
    legend_position="bottom_left"
)

As we can see, the decrease in correlation (increase distance between distributed points) between observing systems results in error variance estimates that are closer to the truth values. Notice that $i$ and $k$ result in better estimates of the error variance at higher correlations compared to $j$. Since correlation is between pairs of systems and the simulated data was distributed with $i$ getting the first index, $j$ getting the second index, and $k$ getting the third index; we have the correlation pair of $ik$ having less correlation due to them having double the distance in time. Therefore, $j$ is more highly correlated with both $i$ and $k$ compared to $i$ with $k$. This leads to the less accurate estimates of the error variance in $j$ as seen in the plot.

To ensure our correlations in our simulated data are being calculated from the CW function as we would expect, we can calculate the Pearson correlation and plot it against our derived value from the CW function.

In [None]:
pearson_coef = np.zeros(
    (3, 3, sim_long.shape[0], int(np.floor(sim_long.shape[1] / 3 / 500)))
)

for i in range(int(np.floor(sim_long.shape[1] / 3 / 500))):
    for j in range(sim_long.shape[0]):
        pearson_coef[:, :, j, i] = np.corrcoef(sim_corr[:, j, :, i])

# Use the median correlation of all pixels
pearson_med = np.median(pearson_coef, axis=2)

In [None]:
rho = cw_stcs(
    np.stack(
        (
            (np.arange(np.floor(sim_long.shape[1] / 3 / 500))) + 1,
            ((np.arange(np.floor(sim_long.shape[1] / 3 / 500))) + 1) * 2,
        ),
        axis=1,
    ),
    2,
    3,
    0.7,
)

fig_corrij = hv.Curve(zip(rho[:, 0], pearson_med[0, 1, :]), label="ij").opts(
    color="green",
    height=500,
    width=500,
    xlabel="CW correlation",
    ylabel="Estimated Correlation (Pearson)",
)
fig_corrik = hv.Curve(zip(rho[:, 1], pearson_med[0, 2, :]), label="ik").opts(
    color="orange"
)
fig_corrjk = hv.Curve(zip(rho[:, 0], pearson_med[1, 2, :]), label="jk").opts(
    color="red"
)

fig_one2one = hv.Curve(zip([0, 0.7], [0, 0.7])).opts(color="black", line_dash="dashed")

(fig_corrij * fig_corrik * fig_corrjk * fig_one2one).opts(
    legend_position="bottom_right"
)

From this plot, we can see that the correlations are close to what we are expecting. Therefore, this shows that in order to use TC to get estimates of the error variance we must have observing systems that have errors that are minimally correlated.

Now, let's apply Extended Collocation (EC) to see if adding additional independent data sets allows for us to get better estimates of the error (co)variances when we do have correlation between observing systems.

Since EC requires at least three independent observing systems, we will add two additional independent systems to this previous set of three correlated systems, giving us a total of five systems. Of these five, the two added systems are completely independent of each other and the correlated three, meaning we meet the requirements of at least three independent systems. These two systems will be two of the simulated independent systems from the beginning of this notebook. Therefore, with these two new observing systems, we can better estimated the error variance and covariance of the three correlated systems.

In [None]:
sim_EC_corr = np.zeros((5, 25, 500, int(np.floor(sim_long.shape[1] / 3 / 500))))
sim_EC_corr[:3, :, :, :] = sim_corr
sim_EC_corr[3, :, :, :] = (
    sim2[:, :500]
    .reshape(25, 500, 1)
    .repeat(int(np.floor(sim_long.shape[1] / 3 / 500)), 2)
)
sim_EC_corr[4, :, :, :] = (
    sim3[:, :500]
    .reshape(25, 500, 1)
    .repeat(int(np.floor(sim_long.shape[1] / 3 / 500)), 2)
)

# The first three systems are correlated, so we indicate such by giving each associated index the same value in corr_sets
ecovar = ec_covar_multi(sim_EC_corr.transpose((2, 0, 1, 3)), corr_sets=[0, 0, 0, 1, 2])
ecovar_med = np.nanmedian(ecovar, axis=2)

print(
    "Estimated values of l for each correlation of ijk (Expected value = 0.2):",
    ecovar_med[3, 3, :],
)
print(
    "Estimated values of m for each correlation of ijk (Expected value = 0.3):",
    ecovar_med[4, 4, :],
)

From the EC error estimation, we can see that the two independent data sets had their error variances estimated near their expected values. `l` is slightly below the expectation, and `m` is slightly elevated. These biases could be reduced by including more than the 500 samples we are currently using as shown above, but they are close to being within 10% of the expected value. Finally, let's plot the estimated error (co)variances of the three correlated systems versus the CW correlation to check that using EC effectively estimated and accounted for the correlations.

In [None]:
fig_vari = hv.Curve(zip(rho[:, 0], ecovar_med[0, 0, :]), label="i").opts(
    color="green",
    height=300,
    width=800,
    xlabel="CW correlation",
    ylabel="Estimated Error (Co)Variance",
    logx=True,
)
fig_varj = hv.Curve(zip(rho[:, 0], ecovar_med[1, 1, :]), label="j").opts(color="orange")
fig_vark = hv.Curve(zip(rho[:, 0], ecovar_med[2, 2, :]), label="k").opts(color="red")

fig_covij = hv.Curve(
    zip(rho[:, 0], np.mean([ecovar_med[0, 1, :], ecovar_med[1, 0, :]], axis=0)),
    label="ij covar",
).opts(color="green")
fig_covij = fig_covij * hv.Curve(
    zip(rho[:, 0], np.mean([ecovar_med[0, 1, :], ecovar_med[1, 0, :]], axis=0)),
    label="ij covar",
).opts(color="orange", line_dash="dashed")

fig_covik = hv.Curve(
    zip(rho[:, 1], np.mean([ecovar_med[0, 2, :], ecovar_med[2, 0, :]], axis=0)),
    label="ik covar",
).opts(color="green")
fig_covik = fig_covik * hv.Curve(
    zip(rho[:, 1], np.mean([ecovar_med[0, 2, :], ecovar_med[2, 0, :]], axis=0)),
    label="ik covar",
).opts(color="red", line_dash="dashed")

fig_covjk = hv.Curve(
    zip(rho[:, 0], np.mean([ecovar_med[1, 2, :], ecovar_med[2, 1, :]], axis=0)),
    label="jk covar",
).opts(color="orange")
fig_covjk = fig_covjk * hv.Curve(
    zip(rho[:, 0], np.mean([ecovar_med[1, 2, :], ecovar_med[2, 1, :]], axis=0)),
    label="jk covar",
).opts(color="red", line_dash="dashed")

fig_truth = hv.Curve(
    zip(rho[:, 0], np.repeat(0.15, np.floor(sim_long.shape[1] / 3 / 500))),
    label="Var Truth",
).opts(color="black")
fig_truthcov = hv.Curve(
    zip(np.sort(rho.flatten()), np.sort(rho.flatten()) * 0.15), label="Covar Truth"
).opts(color="black", line_dash="dashed")

(
    fig_vari
    * fig_varj
    * fig_vark
    * fig_truth
    * fig_covij
    * fig_covik
    * fig_covjk
    * fig_truthcov
).opts(legend_position="top_left")

Great! So, EC appropriately accounted for the correlations and estimated the error (co)variances accurately. Therefore, correlations in the errors can be present in the data sets as long as at least three observing systems are not correlated.