# Extended Collocation Uncertainty Analysis

Since we saw in the [Triple Collocation (TC) notebook](2_TC_application.ipynb#TC-Discussion) that the estimated ET error variances using TC potentially had error cross-correlations, we will look into how expanding TC to include additional data sets can help estimate this cross-correlation.

In [None]:
import holoviews as hv
import hvplot.xarray
import panel as pn
import cartopy.crs as ccrs
import numpy as np
import xarray as xr
from xarray_einstats import linalg
from scipy.stats import percentileofscore
import itertools
import warnings
import os

First, we will run in the Extended Collocation (EC) notebook to create our EC function. (See [the notebook](../TC/EC_function.ipynb) for details on the EC method.)

In [None]:
%run ../TC/EC_function.ipynb

## Combine Data Sets in Xarray
Next, we need to load in our data sets and limit them to a common date range. Since we need at least four data sets to utilized EC, we will restrict the data ranges of all data sets to have the beginning date of the fourth oldest starting date and ending data of the fourth most recent ending date. This choice allows us to save memory usage, while also utilizing the largest amount of data. For data sets with a more restricted date range, due to one data set having a smaller date range, we will limit the date range further at the time of the EC computation.

In [None]:
files = ['../Data/ssebop/ssebop_aet_regridded.nc',
         '../Data/gleam/gleam_aet.nc',
         '../Data/era5/era5_aet_regridded.nc',
         '../Data/nldas/nldas_aet_regridded.nc',
         '../Data/terraclimate/terraclimate_aet_regridded.nc',        
         '../Data/wbet/wbet_aet_regridded.nc',
         ]
dataset_name = ['SSEBop', 'GLEAM', 'ERA5', 'NLDAS', 'TerraClimate', 'WBET']

date_ranges = {}
for file, name in zip(files, dataset_name):
    ds_temp = xr.open_dataset(file, engine='netcdf4', chunks={'lon': -1, 'lat': -1, 'time': -1})
    date_ranges[name] = [ds_temp.time.min().values, ds_temp.time.max().values]

# Take the third oldest start and third most recent end dates
date_range = [np.sort(np.array(list(date_ranges.values()))[:, 0])[3],
              np.sort(np.array(list(date_ranges.values()))[:, 1])[-4]]
date_range

Using the date range, we can now combine all of the data sets into a single `xarray.DataSet` for easy computations.

In [None]:
def preprocess(ds):
    """
    Keep only the specified time range for each file.
    """
    return ds.sel(time=slice(date_range[0], date_range[1]))

ds = xr.open_mfdataset(files, engine='netcdf4', preprocess=preprocess, combine='nested', concat_dim='dataset_name')
ds = ds.assign_coords({'dataset_name': dataset_name})
ds.dataset_name.attrs['description'] = 'Dataset name'

# Need time as first index for TC computation
ds = ds.transpose('time', ...)
# The data set is less than 1GiB, so let's read it into memory vs keeping as a dask array
ds = ds.compute()
ds

## EC Estimation

As stated above, since we have data sets with different date ranges, we will need to trim the date ranges here before computing the EC error covariance matrix. This will be slightly complicated. So, let's make it the date range selection its own function.

In [None]:
def common_date_range(ds, combo):
    """Return the common date slice of the datasets."""
    old_common_date = []
    recent_common_date = []
    for name in combo:
        old_common_date.append(date_ranges[name][0])
        recent_common_date.append(date_ranges[name][1])
    
    return slice(np.max(old_common_date), np.min(recent_common_date))

Like the TC error variance estimates in the [TC notebook](2_TC_application.ipynb#TC-Estimation), we will compute the error covariance matrix for all 90 possible combinations of four data sets (i.e., Quadruple Collocation or QC), since the computation is fast. Additionally, since we only want the subset of the error covariance matrix corresponding to the correlated data sets (i.e., the two data sets with non-zero covariances) and not the whole covariance matrix, the resulting subset of the EC covariance matrices will be reasonably small in memory (~375MiB).

In [None]:
# Generate a list of the combinations, need two correlated data sets, then two additional ones
combos = list(itertools.combinations(dataset_name, 2))
combos = [list(corr_combo + indep_combo) 
          for corr_combo in combos 
          for indep_combo in combos 
          if ((corr_combo[0] not in indep_combo) and (corr_combo[1] not in indep_combo))]
combos[0:10]

Now that we have our data set combinations, let's compute the EC error covariance matrices and extract the subset. We will do this for each season independently along with the full year. Additionally, we will normalize the subset of the covariance matrices to get the error cross-correlation matrices. (The season and full year will be denoted with the monthly abbreviations contained within the season or `All`, respectively.)

In [None]:
# We want to ignore all of the sqrt warnings with negative values
warnings.filterwarnings("ignore", category=RuntimeWarning)

# Create list of seasons
seasons = ['All'] + list(np.unique(ds.time.dt.season))

ec_covar_est = []
ec_covar_est_season = []


for combo in combos:
    for season in seasons:
        if season == 'All':
            ds_season = ds
        else:
            ds_season = ds.isel(time=(ds.time.dt.season == season))

        ds_combo = ds_season.sel(time=common_date_range(ds, combo), dataset_name=combo)
        
        ec_covar = ec_covar_multi(ds_combo.aet.data, corr_sets=[1, 1, 2, 3])
        
        # Extract the subset for our correlated pair (i.e., the first two indices)
        ec_covar = ec_covar[0:2, 0:2, ...]

        # We will want the covariance matrix to be symmetric. Therefore,
        # average together the matrix and its transpose as sig_01 != sig10
        # as discussed in our random example.
        covar = (ec_covar + np.swapaxes(ec_covar, 0, 1)) / 2
        # Compute the cross-correlation matrix
        d = np.moveaxis(np.diagonal(covar), -1, 0)
        rho = covar / np.sqrt(d[:, None, ...])
        rho /= np.sqrt(d[None, :, ...])
        # We only want the off diagonal as the diagonals will be 1
        rho = np.diagonal(rho, 1).squeeze()
       
        ec_covar_est_season.append(xr.Dataset(data_vars={'covar': (['dataset_combo', 'season', 'dataset_idx_1', 'dataset_idx_2', 'lat', 'lon'],
                                                                   covar[None, None, ...]),
                                                         'rho': (['dataset_combo', 'season', 'lat', 'lon'],
                                                                 rho[None, None, ...])},
                                              coords={'dataset_combo': [' '.join(combo)], 'season': [season],
                                                      'dataset_idx_1': [0, 1], 'dataset_idx_2': [0, 1], 
                                                      'lat': ds.lat, 'lon': ds.lon}))
    ec_covar_est.append(xr.concat(ec_covar_est_season, dim='season'))
    ec_covar_est_season = []

ec_covar_est = xr.concat(ec_covar_est, dim='dataset_combo')

# Convert layout to be by covariance pair vs long dataset_combo list
covar_pairs = [' '.join(combo) for combo in list(itertools.combinations(dataset_name, 2))]
covar_est_by_dataset_pair = []
est_pair = []
for covar_pair in covar_pairs:
    idx_loc = np.char.startswith(ec_covar_est.dataset_combo.data, covar_pair)
    dataset_loc = np.where(idx_loc)[0]
    covar_pair_datasets = ec_covar_est.isel(dataset_combo=dataset_loc)

    est_pair.append([combinations.replace(covar_pair, '').strip() for combinations in ec_covar_est.dataset_combo.data[dataset_loc]])

    covar_est_by_dataset_pair.append(xr.Dataset(data_vars={'covar': (['est_idx', 'season', 'covar_pair_idx_1',
                                                                      'covar_pair_idx_2', 'lat', 'lon'], 
                                                                     covar_pair_datasets.covar.data),
                                                           'rho': (['est_idx', 'season', 'lat', 'lon'],
                                                                     covar_pair_datasets.rho.data)},
                                     coords={'covar_pair': covar_pair, 'est_idx': np.arange(len(dataset_loc)),
                                             'season': seasons, 'covar_pair_idx_1': [0, 1], 'covar_pair_idx_2':[0, 1],
                                             'lat': ec_covar_est.lat, 'lon': ec_covar_est.lon}))

del ec_covar_est

covar_est_by_dataset_pair = xr.concat(covar_est_by_dataset_pair, dim='covar_pair')

covar_est_by_dataset_pair = covar_est_by_dataset_pair.assign_coords(est_pair=(['covar_pair', 'est_idx'], np.array(est_pair)))

covar_est_by_dataset_pair.covar.attrs['description'] = 'EC error covariance matrix estimate for the data sets in covar_pair.'
covar_est_by_dataset_pair.covar.attrs['units'] = 'mm2.month-2'
covar_est_by_dataset_pair.covar_pair.attrs['description'] = 'Correlated data set pair used in EC evaluation.'
covar_est_by_dataset_pair.covar_pair_idx_1.attrs['description'] = ('Index of the correlated data set in the covariance matrix '
                                                                   'along the first dimesion as contained in covar_pair.')
covar_est_by_dataset_pair.covar_pair_idx_2.attrs['description'] = ('Index of the correlated data set in the covariance matrix '
                                                                   'along the second dimesion as contained in covar_pair.')
covar_est_by_dataset_pair.est_idx.attrs['description'] = 'Index of the other two data sets used in the EC evaluation as contained in est_pair.'
covar_est_by_dataset_pair.season.attrs['description'] = ('Season of the year given by the first letter of each month within '
                                                         'the season. The full year is given by "All".')
covar_est_by_dataset_pair.est_pair.attrs['description'] = 'Names of the other two data sets used in the EC evaluation.'

covar_est_by_dataset_pair

Now, let's see how the resulting error covariance and cross-correlation estimates look. (We can simply extract them from one of the diagonals of the corresponding matrix.)

In [None]:
def ec_plt(covar_pair='SSEBop GLEAM', est_idx=0, season='All'):

    ec_data = covar_est_by_dataset_pair.sel(covar_pair=covar_pair, est_idx=est_idx, season=season,
                                            covar_pair_idx_1=1, covar_pair_idx_2=0)
    
    est_pairs = str(covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=est_idx).data)
    plt = (ec_data.covar.hvplot(geo=True, coastline=True, clim=(-500, 500), cmap='RdBu',
                               title='Error Covariance (other EC datasets: '+est_pairs+')'
                              ).opts(frame_width=500)
           + ec_data.rho.hvplot(geo=True, coastline=True, clim=(-1, 1), cmap='RdBu',
                                title='Error Cross-Correlation (other EC datasets: '+est_pairs+')'
                               ).opts(frame_width=500))

    return plt

covar_pair_widget = pn.widgets.Select(name="covar_pair", value="SSEBop GLEAM", options=list(covar_est_by_dataset_pair.covar_pair.values))
est_idx_widget = pn.widgets.IntSlider(name="est_idx", start=0, end=5, step=1, value=0)
season_widget = pn.widgets.Select(name="season", value="All", options=['All', 'DJF', 'MAM', 'JJA', 'SON'])

bound_plot = pn.bind(ec_plt, covar_pair=covar_pair_widget, season=season_widget, est_idx=est_idx_widget)

pn.Column(covar_pair_widget, est_idx_widget, season_widget, bound_plot)

## EC Discussion

Looking at the error covariances, we can see a large variation in how the covariance estimates change with the other independent pairs. For example, the covariance of SSEBop and WBET seem to just change intensity across CONUS while keeping relatively similar spatial patterns, while the the covariance of GLEAM and ERA5 change both intensity and spatial patterns. Additionally, certain regions of each covariance map show constant positive covariances, while others show constant negative covariances. Only one data set pair shows constant net positive error covariances, ERA5 and TerraClimate. This likely indicates that these two data sets have errors that are indeed cross-correlated. Interestingly though, we also find some data sets that have constant net negative covariances, for example SSEBop and TerraClimate along with ERA5 and WBET. This is highly unexpected from an input data perspective as this seems to indicate that as the uncertainty in one data set increases, the other decreases. Typically, we would expect the data sets to have net positive covariances due to commonalities in the input data or the calculation method propagating the same errors into the data set. For regional negative covariances, this could be reasonable as one measurement system could be optimized for that geographic region, while the other struggles in that region. However, for SSEBop and TerraClimate, we are seeing almost the entire map as negative covariances. This is related to the fact that these two datasets have the largest error variances compared to the other four data sets, which is causing issues when estimating their covariances.

As for the error cross-correlations, there are large swaths of `NaN` values in certain data set covariance pairs. This is due to the `NaN`s in the error variance estimates as discussed in [TC notebook](2_TC_application.ipynb#TC-Discussion), since the error cross-correlation is calculated as:

$$\rho_{\varepsilon_i, \varepsilon_l} = \frac{\sigma_{\varepsilon_i, \varepsilon_l}}{\sqrt{\sigma_{\varepsilon_i}^2\sigma_{\varepsilon_l}^2}}$$

where $\rho_{\varepsilon_i, \varepsilon_l}$ is the error cross-correlation, $\sigma_{\varepsilon_i, \varepsilon_l}$ is the error covariance, and $\sigma_{\varepsilon_i}^2$ and $\sigma_{\varepsilon_l}^2$ are the error variances. Again, this issue is related to SSEBop and TerraClimate having approximately an order of magnitude larger error variances compared to the other data sets. This causes negative error variances (and covariance) estimates, which in turn result in `NaN`s across the cross-correlation maps. Therefore, we will likely need to average the data set combinations to get a single map of the error covariances and cross-correlation for each covariance pair. Additionally, this will help with make looking at the covariance estimates less overwhelming as it will give one covariance estimate per covariance pair.

In [None]:
mean_covar_est = covar_est_by_dataset_pair.covar.mean(dim='est_idx', skipna=True, keep_attrs=True)
mean_covar_est.name = 'mean_covar'
mean_covar_est.attrs['description'] = 'Mean EC error covariance estimate for all possible combinations with other datasets.'
median_covar_est = covar_est_by_dataset_pair.covar.median(dim='est_idx', skipna=True, keep_attrs=True)
median_covar_est.name = 'median_covar'
median_covar_est.attrs['description'] = 'Median EC error covariance estimate for all possible combinations with other datasets.'
std_covar_est = covar_est_by_dataset_pair.covar.std(dim='est_idx', ddof=1, skipna=True, keep_attrs=True)
std_covar_est.name = 'std_of_covar_std'
std_covar_est.attrs['description'] = ('Standard deviation of the EC error covariance estimate for all '
                                      'possible combinations with other datasets.')

mean_rho_est = covar_est_by_dataset_pair.rho.mean(dim='est_idx', skipna=True, keep_attrs=True)
mean_rho_est.name = 'mean_rho'
mean_rho_est.attrs['description'] = 'Mean EC error cross-correlation estimate for all possible combinations with other datasets.'
median_rho_est = covar_est_by_dataset_pair.rho.median(dim='est_idx', skipna=True, keep_attrs=True)
median_rho_est.name = 'median_rho'
median_rho_est.attrs['description'] = 'Median EC error cross-correlation estimate for all possible combinations with other datasets.'
std_rho_est = covar_est_by_dataset_pair.rho.std(dim='est_idx', ddof=1, skipna=True, keep_attrs=True)
std_rho_est.name = 'std_of_rho_std'
std_rho_est.attrs['description'] = ('Standard deviation of the EC error cross-correlation estimate for all '
                                    'possible combinations with other datasets.')

count_covar_est = np.isfinite(covar_est_by_dataset_pair.covar).sum(dim='est_idx')
count_covar_est.name = 'counts'
count_covar_est.attrs['description'] = ('Number of datasets used in the average EC error covariance '
                                        'estimates (i.e., number of finite values in a given pixel).')
count_covar_est.attrs['units'] = 'counts'

# Compile these DataSets into one and save for use in notebook 4_Bias
ec_est_averages = xr.merge([mean_covar_est, median_covar_est, std_covar_est, mean_rho_est,
                            median_rho_est, std_rho_est, count_covar_est], join='exact')

if not os.path.isfile('../Data/compiled_EC_avg_covar_errs.nc'):
    _ = ec_est_averages.to_netcdf(path='../Data/compiled_EC_avg_covar_errs.nc', format='NETCDF4', engine='netcdf4')

Now that we have our median error covariance and cross-correlation estimates, let's plot them and see how they look.

In [None]:
# Select the covariance element from the matrix
ec_est_averages_co = ec_est_averages.sel(covar_pair_idx_1=1, covar_pair_idx_2=0)

plt = ec_est_averages_co.median_covar.hvplot(
        groupby=['covar_pair', 'season'], geo=True, coastline=True, cmap='RdBu',
        clim=(-500,500), title='Median Error Covariance'
      ).opts(frame_width=500) + \
      ec_est_averages_co.median_rho.hvplot(
          groupby=['covar_pair', 'season'], geo=True, coastline=True, cmap='RdBu', 
          clim=(-1,1), title='Median Error Cross-Correlation'
      ).opts(frame_width=500) + \
      np.abs(ec_est_averages_co.median_covar/ec_est_averages_co.std_of_covar_std).hvplot(
          groupby=['covar_pair', 'season'], geo=True, coastline=True, cmap='RdBu',
          clim=(0.01,100), title='SNR of Error Covariance', logz=True
      ).opts(frame_width=500) + \
      np.abs(ec_est_averages_co.median_rho/ec_est_averages_co.std_of_rho_std).hvplot(
          groupby=['covar_pair', 'season'], geo=True, coastline=True, cmap='RdBu',
          clim=(0.01,100), title='SNR of Error Cross-Correlation', logz=True
      ).opts(frame_width=500)

pn.panel(plt.cols(2), widget_location='top')

From these plots, we can see that some data sets may be correlated with each other. To visualize this clearly, we can plot all of the pixels from each independent data set pair and average of pairs as histograms, where one count is a pixel. This will allow us to see if a whole map shows a net error covariance or if the covariances are evenly distributed around zero, which could indicate minimal cross-correlation of errors.

In [None]:
# Select the covariance element from the matrix
covar_est_by_dataset_pair = covar_est_by_dataset_pair.sel(covar_pair_idx_1=1, covar_pair_idx_2=0)

def histogram_plts(covar_pair='SSEBop GLEAM', season='All'):
    da_pair = []
    for i in range(6):
        da_pair.append(covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=i, season=season))
        da_pair[i].name = covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=i).data.item()
    da_mean = mean_covar_est.sel(covar_pair=covar_pair, season=season)
    da_median = median_covar_est.sel(covar_pair=covar_pair, season=season)
    da_mean.name = 'Mean'
    da_median.name = 'Median'

    plt = da_pair[0].hvplot.hist(bins=50, bin_range=(-500,500), title='EC Error Covariance Distribution of '+covar_pair, 
                           xlabel='Error Covariance (mm2.month-2)', ylabel='Counts', alpha=1, normed=True)

    for i in range(1, 6):
        plt *= da_pair[i].hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
        
    plt *= da_mean.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
    plt *= da_median.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)

    return plt

def percent_table(covar_pair='SSEBop GLEAM', season='All'):
    da_pair = []
    for i in range(6):
        da_pair.append(covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=i, season=season))
    da_mean = mean_covar_est.sel(covar_pair=covar_pair, season=season)
    da_median = median_covar_est.sel(covar_pair=covar_pair, season=season)

    percentiles = ([percentileofscore(da.data.flatten(), 0, kind='strict', 
                                     nan_policy='omit') for da in da_pair]
                   + [percentileofscore(da_mean.data.flatten(), 0, kind='strict',  nan_policy='omit')]
                   + [percentileofscore(da_median.data.flatten(), 0, kind='strict',  nan_policy='omit')])
    correlation_str = np.array(['Neutral '] * len(percentiles))
    correlation_str[np.array(percentiles) <= 35] = 'Positive'
    correlation_str[np.array(percentiles) >= 65] = 'Negative'
    table = hv.Table({'Independent Pair': list(covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair).data) + ['Mean', 'Median'], 
                      'Percentile of 0': np.round(percentiles, 2),
                      'Net Correlation': correlation_str}, ['Independent Pair', 'Percentile of 0', 'Net Correlation']).opts(width=350)

    return table


covar_pair_widget = pn.widgets.Select(name="covar_pair", value="SSEBop GLEAM", options=list(covar_est_by_dataset_pair.covar_pair.values))
season_widget = pn.widgets.Select(name="season", value="All", options=['All', 'DJF', 'MAM', 'JJA', 'SON'])

bound_plot = pn.bind(histogram_plts, covar_pair=covar_pair_widget, season=season_widget)
bound_table = pn.bind(percent_table, covar_pair=covar_pair_widget, season=season_widget)

pn.Column(covar_pair_widget, season_widget, pn.Row(bound_plot, bound_table))

From these results, we can see that indeed some data sets do have net positive covariances. For example SSEBop and NLDAS, along with ERA-5 and TerraClimate, both have strong net positive covariances. This indicates that these data sets likely have some common modeling assumption or input data that are causing them to have correlated errors.

Interestingly though, like we noticed above in the unaggregated maps, we also find some data sets that have net negative covariances, for example SSEBop and TerraClimate along with ERA-5 and WBET. As stated above This, while initially unexpected, likely indicates that one data set may have be optimized for a aspecific geographic region, while the other struggles in that region. However, for SSEBop and TerraClimate, we are seeing almost the entire map as negative covariances. This is related to the fact that these two datasets have the largest error variances compared to the other four data sets, which is causing issues when estimating their covariances.