## Randomized Data Example

To test the general implementation of Triple Collocation (TC) and Extended Collocation (EC), we will use a randomly generated time series. This will be done in a similar way to that in Section 6 of [Zwieback+2014](https://npg.copernicus.org/articles/19/69/2012/).

To get our "true" time series values $\boldsymbol{t}$, we will randomly generate uniform values between 0 and 10, and then convolve this with a boxcar filter.

In [1]:
import numpy as np
import holoviews as hv
hv.extension('bokeh')

nsamples = 1e4
t_original = np.random.uniform(0, 10, int(nsamples))
t = np.convolve(t_original, np.ones(5)/5, 'same')

# Lets plot this true data to see what it looks like
fig_convolve = hv.Curve(zip(np.arange(0, nsamples), t), label='Convolved').opts(height=300, width=800, color='blue', xlabel='Sample', ylabel='t')
fig_original = hv.Curve(zip(np.arange(0, nsamples), t_original), label='Original').opts(color='red')
display(fig_original * fig_convolve)

Now, we will make a set of observations from the true data. Since TC assumes that the error on the true observation is additive, we will add some uncorrelated random noise to the true data along with some affine transformations. For the uncorrelated noise, we will assume:

$\Sigma = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{pmatrix}$,

where $\Sigma$ is the error covariance matrix of the observations.

Additionally, we will assume some affine transformations in the data to show that any bias in the observations will not effect the error variance estimates. For this example, we will use the transformations of:

$\begin{align} \alpha_i = 0; \beta_i = 1 \\ \alpha_j = 2; \beta_j = 3 \\ \alpha_k = 4; \beta_k = 5 \\  \end{align}$,

where $\boldsymbol{X}_i = \alpha_i + \beta_i \boldsymbol{t} + \boldsymbol{\varepsilon}_i$ and $\boldsymbol{X}_i$ and $\boldsymbol{\varepsilon}_i$ are the observations and random errors from a given system.

In [2]:
Sigma = np.array([[1, 0, 0], [0, 2, 0], [0, 0, 3]])
errors = np.random.multivariate_normal([0, 0, 0], Sigma, int(nsamples))

X = np.zeros((int(nsamples), 3))
X[:, 0] = 0 + 1 * t + errors[:, 0]
X[:, 1] = 2 + 3 * t + errors[:, 1]
X[:, 2] = 4 + 5 * t + errors[:, 2]

# Lets plot this observed data over the true data to see what it looks like
fig_Xi = hv.Scatter(zip(np.arange(0, nsamples), X[:, 0]), label='i').opts(color='green', marker='square')
fig_Xj = hv.Scatter(zip(np.arange(0, nsamples), X[:, 1]), label='j').opts(color='orange', marker='triangle')
fig_Xk = hv.Scatter(zip(np.arange(0, nsamples), X[:, 2]), label='k').opts(color='red')
display(fig_Xi * fig_Xj * fig_Xk * fig_convolve)


Now that we have our observations, let's perform the TC estimation. To do this, let's load the TC function we have already made, and check the `help` to get the input and output format.

In [3]:
%run ../TC/TC_function.ipynb

In [4]:
help(tc_covar)

Help on function tc_covar in module __main__:

tc_covar(X)
    Uses the covariance method of Triple Collocation (TC) to estimate the error variances for the three collocated inputs.
    
    Parameters: X : ndarray, shape(N, 3)
                    The three collocated inputs.
    
    Returns:    evar : ndarray, shape(3)
                    The estimated error variance of the three collocated inputs.



So, it takes our input observation array as is and outputs the estimated error variance of the observations. Let's give it a test to see if it works on our simulated data.

In [5]:
evar = tc_covar(X)
print('Expected values:', np.diagonal(Sigma))
print('Estimated values:', evar)

Expected values: [1 2 3]
Estimated values: [1.00186307 1.92088603 3.18247   ]


Great! The estimated values are very close to the expected values. Let's try this whole process again, but this time with multiple samples of different sizes. This way we can see how the number of samples in our data set influences the estimated variances. To create these samples, we have created a simple function to output data sets given the number of samples, covariance matrix, and affine transformation parameters.

In [6]:
def generate_sample(nsamples, Sigma, alpha, beta):
    t_original = np.random.uniform(0, 10, int(nsamples))
    t = np.convolve(t_original, np.ones(5)/5, 'same')
 
    errors = np.random.multivariate_normal(np.repeat(0, len(Sigma[:,0])), Sigma, int(nsamples))

    X = np.zeros((int(nsamples), len(Sigma[:, 0])))
    for i in range(len(Sigma[:, 0])):
        X[:, i] = alpha[i] + beta[i] * t + errors[:, i]
   
    return X

In [7]:
evar = np.zeros((9996, 3))
# Need to have at least 5 samples as we use a boxcar filter of width 5
for i in range(5, 10001):
    X = generate_sample(i, Sigma, alpha=[0, 2, 4], beta=[1, 3, 5])
    evar[i-5, :] = tc_covar(X)

In [8]:
fig_samplesi = hv.Scatter(zip(np.arange(5, 10001), evar[:, 0]), label='i').opts(color='green', marker='square', height=300, width=800, 
                                                                               xlabel='Number of Samples', ylabel='Estimated Error Variance',
                                                                               xlim=(0, 10000), ylim=(0, 5))
fig_truevari = hv.Curve(zip(np.array([0, 10001]), np.repeat(np.diagonal(Sigma)[0], 2)), label='True i').opts(color='lime')

fig_samplesj = hv.Scatter(zip(np.arange(5, 10001), evar[:, 1]), label='j').opts(color='orange', marker='triangle')
fig_truevarj = hv.Curve(zip(np.array([0, 10001]), np.repeat(np.diagonal(Sigma)[1], 2)), label='True j').opts(color='gold')

fig_samplesk = hv.Scatter(zip(np.arange(5, 10001), evar[:, 2]), label='k').opts(color='red')
fig_truevark = hv.Curve(zip(np.array([0, 10001]), np.repeat(np.diagonal(Sigma)[2], 2)), label='True k').opts(color='darkred')

display(fig_samplesk * fig_samplesj * fig_samplesi * fig_truevari * fig_truevarj * fig_truevark)

As expected, variance in the estimated variance decreases with increasing the number of samples. This result is the same as found in [Zwieback+2014](https://npg.copernicus.org/articles/19/69/2012/), where they estimate at least 500 samples are needed to have the variance estimated within 10% of its true value on average.

Using this 500 value as the minimum number of samples we would typically want, let's see what happens when we violate assumption 2 of TC (i.e., we add correlation between observing systems).
To do this we will simply add some off diagonal terms to the error covariance matrix giving:

$\Sigma = \begin{pmatrix} 1 & 0 & 1 \\ 0 & 2 & 0 \\ 1 & 0 & 3 \end{pmatrix}$.

In [9]:
Sigma_offdiag = np.array([[1, 0, 1], [0, 2, 0], [1, 0, 3]])
X = generate_sample(500, Sigma_offdiag, alpha=[0, 2, 4], beta=[1, 3, 5])
evar = tc_covar(X)
print('Expected values if not correlated:', np.diagonal(Sigma_offdiag))
print('Estimated values:', evar)

Expected values if not correlated: [1 2 3]
Estimated values: [ 0.78239025  3.25239688 -1.01074947]


So, as we can see, adding any sort of correlation can cause serious discrepancies in our estimated values, especially since $\sigma_{\varepsilon_k}$ is a negative value. To account for this, we will need to implement a more generalized version of TC to get more accurate estimates. This more generalized version of TC is EC, which utilizes additional observing systems to account for any potential correlation in the errors of the data sets. 

> While the covariances in the errors caused a negative variance estimate, error variance estimates can also be negative when correlations are not present. This results from situations where two observing systems have approximately order of magnitude larger error variances compared to the third. These larger values dominate the TC calculation and cause a poor estimate of the smaller variance. Therefore, any estimate of an error variance that results in a negative value should be flagged as incorrect and that its error is likely much less than the other observing systems.

In [10]:
%run ../TC/EC_function.ipynb

To show that our EC function outputs the same result as the TC function, let's test it on this last example that had correlated data.

In [11]:
ecovar_matrix, _ = ec_covar(X, corr_sets=[0,1,2])
print('Expected values if not correlated:', np.diagonal(Sigma_offdiag))
print('Estimated values:', np.diagonal(ecovar_matrix))

Expected values if not correlated: [1 2 3]
Estimated values: [ 0.78239025  3.25239688 -1.01074947]


Great! So, the EC function returned the exact same estimates as the TC function. Its only additional requirement was for us to say what data sets we thought would be correlated (independent). For this example, we assumed all were independent, even though we knew the first and last were correlated. Now, let's try adding an additional observing system that is independent of the other three to see if we can recover covariances in the errors. To do so, we will add another row and column to the previous error covariance matrix giving:

$\Sigma = \begin{pmatrix} 1 & 0 & 1 & 0 \\ 0 & 2 & 0 & 0 \\ 1 & 0 & 3 & 0 \\ 0 & 0 & 0 & 2 \end{pmatrix}$,

and use affine transformation parameters of $\alpha_l = 1$ and $\beta_l = 2$.

In [12]:
Sigma_offdiag = np.array([[1, 0, 1, 0], [0, 2, 0, 0], [1, 0, 3, 0], [0, 0, 0, 2]])
X = generate_sample(500, Sigma_offdiag, alpha=[0, 2, 4, 1], beta=[1, 3, 5, 2])

ecovar_matrix, _ = ec_covar(X, corr_sets=[0, 1, 0, 2])
print('Expected values error covariance matrix: \n', Sigma_offdiag)
print('Estimated error covariance matrix: \n', ecovar_matrix)

Expected values error covariance matrix: 
 [[1 0 1 0]
 [0 2 0 0]
 [1 0 3 0]
 [0 0 0 2]]
Estimated error covariance matrix: 
 [[0.94460279 0.         0.64021659 0.        ]
 [0.         2.08540386 0.         0.        ]
 [0.82703071 0.         2.37581586 0.        ]
 [0.         0.         0.         2.14477865]]


As we can see, adding this additional data set allowed use to have a better estimate of the error variances and covariance. While not all estimates are within 10% of the true value, additional samples would increase the accuracy, similar to what is shown above.

> Note that the error covariance estimates on the off-diagonals are not the same value. This difference is a result of how these terms can be estimated. The error covariance term is determined as $\sigma_{\varepsilon_i, \varepsilon_j} = \sigma_{ij} - \frac{\sigma_{ik} \sigma_{jl}}{\sigma_{kl}}$, where $\sigma_{ij}$ is the covariance of observing systems $i$ and $j$ (and similarly for $k$ and $l$). With this formulation, it is theoretically expected that $\sigma_{\varepsilon_i, \varepsilon_j} = \sigma_{\varepsilon_j, \varepsilon_i}$. However, while $\sigma_{ij} = \sigma_{ji}$, $\sigma_{ik} \sigma_{jl} \ne \sigma_{jk} \sigma_{il}$ in practice due to random noise in the data causing slight non-zero values in these covariances. Therefore, $\sigma_{\varepsilon_i, \varepsilon_j} \ne \sigma_{\varepsilon_j, \varepsilon_i}$ leading to the differences in off diagonal values. When quoting $\sigma_{\varepsilon_i, \varepsilon_j}$ instead of the whole error covariance matrix, we recommend averaging the corresponding off-diagonal values

Finally, while we only show and EC example for four observing systems, EC can be expanded to include any number of observing systems. By adding more systems, more correlations of errors between data sets are allowed, with the only requirement being that at least three of the observing systems are independent.