# Extended Collocation

Extended collocation (EC) is the generalized form of Triple Collocation (TC) for an arbitrary (3+) number of measurement/observing systems assuming none of them observed the true value. Like TC, the method makes four assumptions about the observing systems:

1. The signal and random errors are stationary (i.e., the mean of each is constant with time).
2. No cross-correlation of errors (i.e., measurement system errors are independent of each other).
3. Error orthogonality (i.e., the measurement system errors are independent of the true value).
4. No error autocorrelation (i.e., the error estimates are not correlated with time).

However, by allowing for more than three observing systems, EC generalized the contents of assumption (2) to be:

2. Cross-correlation of errors are only allowed between observing systems if each correlated observing system is also a member of at least one observing system triplet that has no cross-correlation of errors.

In other words, at least three observing systems must have no cross-correlation of errors and the systems whose errors are cross-correlated must be known.

Just as TC, EC commonly assumes an affine error model relating the observation to the true value and error:

$$\boldsymbol{X}_i = \alpha_i + \beta_i \boldsymbol{t} + \boldsymbol{\varepsilon}_i,$$

where $\boldsymbol{X}_i$ are the measured values from the $i$ collocated measurement systems, $\boldsymbol{t}$ are the measurements true values, and $\boldsymbol{\varepsilon}_i$ are the additive random errors.

### Covariance Notation

Keeping with the covariance notaction like TC, solving for the covariance of two measurement systems gives

$${\rm Cov}(\boldsymbol{X}_i, \boldsymbol{X}_j) = {\rm E}(\boldsymbol{X}_i, \boldsymbol{X}_j) - {\rm E}(\boldsymbol{X}_i){\rm E}(\boldsymbol{X}_j) = \beta_i \beta_j \sigma_\boldsymbol{t}^2 + \beta_i {\rm Cov}(\boldsymbol{t}, \boldsymbol{\varepsilon}_j) + \beta_j {\rm Cov}(\boldsymbol{t}, \boldsymbol{\varepsilon}_i) + {\rm Cov}(\boldsymbol{\varepsilon}_i, \boldsymbol{\varepsilon}_j).$$

With assumptions above, this simplifies to

$${\rm Cov}(\boldsymbol{X}_i, \boldsymbol{X}_j) = \begin{cases} \beta_i^2 \sigma_{\boldsymbol{t}}^2 + \sigma_{\varepsilon_i}^2, & {\rm for}\ i = j \\ \beta_i \beta_j \sigma_{\boldsymbol{t}}^2, & {\rm for}\ i \ne j\ {\rm where}\ \sigma_{\varepsilon_i, \varepsilon_j} = 0 \\ \beta_i \beta_j \sigma_{\boldsymbol{t}}^2 + \sigma_{\varepsilon_i, \varepsilon_j}, & {\rm for}\ i \ne j\ {\rm where}\ \sigma_{\varepsilon_i, \varepsilon_j} \ne 0  \end{cases},$$

since assumption (3) gives ${\rm Cov}(\boldsymbol{t}, \boldsymbol{\varepsilon}_i) = 0$.

Solving this system of equations when $\sigma_{\varepsilon_i, \varepsilon_j} \ne 0$ is only possible with more than three observing systems. Three of which must have $\sigma_{\varepsilon_i, \varepsilon_j} = 0$ to satisfy assumption (2). The resulting solution gives the following estimates of the error variances:

$$\sigma_{\varepsilon_i}^2 = \sigma_{i}^2 - \frac{\sigma_{ik} \sigma_{il}}{\sigma_{kl}},\ {\rm where}\ \sigma_{\varepsilon_i, \varepsilon_k} = \sigma_{\varepsilon_i, \varepsilon_l} = \sigma_{\varepsilon_k, \varepsilon_l} = 0,$$

$$\sigma_{\varepsilon_i, \varepsilon_j} = \sigma_{ij} - \frac{\sigma_{ik} \sigma_{jl}}{\sigma_{kl}},\ {\rm where}\ \sigma_{\varepsilon_i, \varepsilon_k} = \sigma_{\varepsilon_j, \varepsilon_l} = \sigma_{\varepsilon_k, \varepsilon_l} = 0\ {\rm and}\ \sigma_{\varepsilon_i, \varepsilon_j} \ne 0,$$

where $\sigma_{ij} = {\rm Cov}(\boldsymbol{X}_i, \boldsymbol{X}_j)$.

![Diagram showing the TC calculation with distributions](TC_diagram/TC_diagram.svg)

### Unbiased SNR

As derived by [McColl et al. 2014](https://doi.org/10.1002/2014GL061322), an unbiased SNR can be derived using the TC and, thereby, EC method. This SNR shows how the error variances relate to the system's signal variance. It is derived from the scaled signal variability $\beta_i^2\sigma_t^2$ and error variance $\sigma_\varepsilon^2$ using the following:

$${\rm SNR} = \frac{\beta_i^2\sigma_t^2}{\sigma_{\varepsilon_i}^2}.$$

Using the above equations, we can substitute in variables and reduce:

$${\rm SNR} = \frac{\sigma_i^2 - \sigma_{\varepsilon_i}^2}{\sigma_{\varepsilon_i}^2} = \frac{\sigma_i^2}{\sigma_{\varepsilon_i}^2} - 1.$$

Note that just like the error variances, this SNR estimate will be biased (i.e., larger than reality) if cross-correlated errors exist between the measurement system triplets in the EC estimate.

In [None]:
def ec_covar(X, corr_sets, return_snr=False):
    """
    Uses the covariance method of Extended Collocation (EC) to
    estimate the error variances for the three or more collocated
    inputs. Additionally, errors covariances can be determined for
    M-3 inputs, when M is the number of observing systems.

    Parameters
    ----------
    X : ndarray, shape(N, M)
        The M collocated inputs.
    corr_sets : array-like, shape(M)
        An array-like input indicating which input observing systems
        are cross-correlated and which are independent. Independent
        systems should have unique values, while correlated systems
        should have matching values. For an M = 4 example where the
        last two systems are the only correlated systems, corr_sets
        could be set to [0, 1, 2, 2].
    return_snr : bool, optional
        If True, also return the stimated unbiased snr of the
        observing systems.

    Returns
    -------
    ecovar_matrix : ndarray, shape(M, M)
        The estimated error covariance matrix of the observing systems.
        Observing systems with cross-correlated errors will have non-zero
        off diagonal terms. Observing system indices are the same as the
        input.
    snr : ndarray, shape(M), optional
        The estimated unbiased snr of the observing systems.
    """
    import numpy as np

    corr_sets = np.array(corr_sets)

    # Error checking X for shape and size
    if X.ndim != 2:
        raise ValueError(
            f"X must be a 2D array-like input. Current number of dimensions: {X.ndim}"
        )
    M = X.shape[1]
    if M < 3:
        raise ValueError(
            "X must have a leading dimension of length 3 or more. "
            f"Current leading dimension length: {M}"
        )

    # Error checking corr_sets for size and unique values
    if len(corr_sets) != M:
        raise ValueError(
            f"corr_sets must have a length of {M}. Current length: {len(corr_sets)}"
        )
    u, u_idc, u_cnt = np.unique(corr_sets, return_index=True, return_counts=True)
    if len(u) < 3:
        raise ValueError(
            "corr_sets must have at least 3 unique elements. "
            f"Current number of unique elements: {len(u)}"
        )

    # Compute the covariance matrix of the input data
    covar = np.cov(X, rowvar=False)

    # Compute the error covariance matrix from the data covariance matrix
    ecovar_matrix = np.zeros_like(X, shape=(M, M))
    snr = np.zeros_like(X, shape=M)
    for i in range(M):
        # Use the first two independent data sets from the current
        # ith (assumes there can be more than two)
        k, l = u_idc[u != corr_sets[i]][0:2]
        ecovar_matrix[i, i] = covar[i, i] - covar[i, k] * covar[i, l] / covar[k, l]
        snr[i] = covar[i, i] / ecovar_matrix[i, i] - 1

        # Estimate error covariance for those that are cross-correlated
        if u_cnt[u == corr_sets[i]] > 1:
            # Determine which other data sets are dependent on the current one
            dep_idc = corr_sets[i] == corr_sets
            dep_idc[i] = False
            dep_idc = np.where(dep_idc)

            for jj in range(len(dep_idc)):
                j = dep_idc[jj]
                ecovar_matrix[i, j] = (
                    covar[i, j] - covar[i, k] * covar[j, l] / covar[k, l]
                )

    if return_snr:
        return ecovar_matrix, snr
    else:
        return ecovar_matrix

### Multi-Dimensional Function

The above function only works for a single spatial point within the collocated set of observing systems. Since most observing systems will either have a gridded map or a collection of spatial points, we can expand this function to calculate the error covariance matrix of the observations in a single function rather than having to loop over each spatial point. This simply requires expanding the computations along additional dimensions and computing the observational data covariance matrix using a function other than `cov`. In our case, we will use `einsum`, which can perform the needed matrix multiplication and summation for calculating the covariance matrix from:

${\rm Cov}(\boldsymbol{X}_i, \boldsymbol{X}_j) = \frac{1}{N-1}(\boldsymbol{X}_i - \bar{X_i})(\boldsymbol{X}_j - \bar{X_j})^T$,

where $N$ is the number of points in the time series.

Also for this expanded function, we will add an optional input that gives a user more control over how the independent measuring systems are combined. In the function for a single collocated point above, this is specified by which system within correlated pairs come first in the input. However, a user may need more control when multiple correlated systems are input. For example, assume we have five measuring systems with two correlated pairs. Let's call them `A-E`, where `A` and `D` are correlated and `B` and `E` are correlated. With the above function, we could input the data sets in order as `(AD)(BE)C`, where the parenthesis help distinguish correlated pairs. This would result in independent system calculations of `ABC`, `DBC`, `AEC` to derived the needed error variances. However, what if we weren't sure that `AE` or `DB` were independent, but were sure `DE` are independent? With the above function, we couldn't account for this in the calculation. Therefore, we include an option input allowing for a user to specify which other two measuring systems should be considered independent from the one currently being calculated.

In [None]:
def ec_covar_multi(X, corr_sets, return_snr=False, indep_idc=None):
    """
    Uses the covariance method of Extended Collocation (EC) to estimate the
    error variances for the three or more collocated inputs at all of their
    collocated spatial points. Additionally, errors covariances can be
    determined for M-3 inputs, when M is the number of observing systems.

    Parameters
    ----------
    X : ndarray, shape(N, M, ...)
        The M collocated inputs with N times series elements.
    corr_sets : array_like, shape(M)
        An array-like input indicating which input observing systems
        are cross-correlated and which are independent. Independent systems
        should have unique values, while correlated systems should have
        matching values. For an M = 4 example where the last two systems
        are the only correlated systems, corr_sets could be set to [0, 1, 2, 2].
    return_snr : bool, optional
        If True, also return the stimated unbiased snr of the observing systems.
    indep_idc : array_like, shape(M, 2), optional
        An optional array-like input indicating which two measuring systems
        should be considered independent from the one currently being calculated.
        This input contains M pairs of indices corresponding to the other two
        indpendent systems to be used in the error (co)variance calculation for
        the current ith out of M input. For an M = 5 example where the last two
        pairs of systems are correlated systems (i.e., corr_sets could be set
        to [0, 1, 1, 2, 2]), indep_idc could be set to
        [[1, 3], [0, 3], [0, 4], [0, 1], [0, 2]]. If not given, defaults to using
        the first two independent data sets as input in X (e.g. with M = 5 like
        above, [[1, 3], [0, 3], [0, 3], [0, 1], [0, 1]])

    Returns
    -------
    ecovar_matrix : ndarray, shape(M, M, ...)
        The estimated error covariance matrix of the observing systems at each
        collocated spatial point. Observing systems with cross-correlated errors
        will have non-zero off diagonal terms. Observing system indices are
        the same as the input.
    snr : ndarray, shape(M, ...), optional
        The estimated unbiased snr of the observing systems at each collocated
        spatial point.
    """
    import numpy as np

    # Convert to numpy array for later boolean indexing
    corr_sets = np.array(corr_sets)

    # Error checking X for shape and size
    if X.ndim < 2:
        raise ValueError(
            "X must be a 2D or greater array-like input. "
            f"Current number of dimensions: {X.ndim}"
        )
    M = X.shape[1]
    if M < 3:
        raise ValueError(
            "X must have a leading dimension of length 3 or more. "
            f"Current leading dimension length: {M}"
        )

    # Error checking corr_sets for size and unique values
    if len(corr_sets) != M:
        raise ValueError(
            f"corr_sets must have a length of {M}. Current length: {len(corr_sets)}"
        )
    u, u_idc, u_cnt = np.unique(corr_sets, return_index=True, return_counts=True)
    if len(u) < 3:
        raise ValueError(
            "corr_sets must have at least 3 unique elements. "
            f"Current number of unique elements: {len(u)}"
        )

    # Error checking indep_idc for shape and max/min values
    if indep_idc is not None:
        if np.array(indep_idc).shape != (M, 2):
            raise ValueError(
                f"indep_idc must have a shape of ({M}, 2). "
                f"Current shape: {np.array(indep_idc).shape}"
            )
        if (np.min(indep_idc) < 0) or (np.max(indep_idc) > (M - 1)):
            raise ValueError(
                f"indep_idc must have value between 0 and {M}. "
                f"Current min and max values: {np.min(indep_idc)}, {np.max(indep_idc)}"
            )

    # Compute the covariance matrix of the input data
    deviation = X - X.sum(axis=0) / X.shape[0]
    # Use einsum subscript string for X with variable length. Since the first two
    # dimensions are the observing systems and time series, this leaves the
    # remaining spatial dimensions the same.
    covar = np.einsum("ab...,ac...->bc...", deviation, deviation) / (X.shape[0] - 1)

    # Compute the error covariance matrix from the data covariance matrix
    ecovar_matrix = np.zeros_like(X, shape=(M, M) + X.shape[2:])
    snr = np.zeros_like(X, shape=(M,) + X.shape[2:])
    for i in range(M):
        if indep_idc is not None:
            k, l = indep_idc[i]
        else:
            # Use the first two independent data sets from the current
            # ith (assumes there can be more than two)
            k, l = u_idc[u != corr_sets[i]][0:2]

        ecovar_matrix[i, i, ...] = (
            covar[i, i, ...] - covar[i, k, ...] * covar[i, l, ...] / covar[k, l, ...]
        )
        snr[i, ...] = covar[i, i, ...] / ecovar_matrix[i, i, ...] - 1

        # Estimate error covariance for those that are cross-correlated
        if u_cnt[u == corr_sets[i]] > 1:
            # Determine which other data sets are dependent on the current one
            dep_idc = corr_sets[i] == corr_sets
            dep_idc[i] = False
            dep_idc = np.where(dep_idc)

            for jj in range(len(dep_idc)):
                j = dep_idc[jj]
                ecovar_matrix[i, j, ...] = (
                    covar[i, j, ...]
                    - covar[i, k, ...] * covar[j, l, ...] / covar[k, l, ...]
                )

    if return_snr:
        return ecovar_matrix, snr
    else:
        return ecovar_matrix