# Triple Collocation Uncertainty Analysis

Now that we have all of our monthly ET datasets spatially collocated, we are ready to perform a Triple Collocation (TC) analysis on the common date ranges.

In [None]:
import hvplot.xarray
import panel as pn
import cartopy.crs as ccrs
import numpy as np
import xarray as xr
import itertools
import warnings

First, we will run in the Extended Collocation notebook to create our TC (or EC) function that runs each spatial point simultaneously.

In [None]:
%run TC/EC_function.ipynb

## Combine Datasets in Xarray
Next, we need to load in our datasets and limit them to a common date range. Since we need at least three datasets to utilized TC, we will restrict the data ranges to have a beginning date of the third oldest starting date and the third most recent ending date. This choice allows us to save memory usage, while also utilizing the largest amount of usable data. For triplets with a more restricted date range, due to one data set having a smaller date range, we will limit the date range further at the time of the TC computation.

In [None]:
files = ['ssebop/ssebop_aet_regridded.nc',
         'gleam/gleam_aet.nc',
         'era5/era5_aet_regridded.nc',
         'nldas/nldas_aet_regridded.nc',
         'terraclimate/terraclimate_aet_regridded.nc',        
         'wbet/wbet_aet_regridded.nc',
         ]
dataset_name = ['SSEBop', 'GLEAM', 'ERA5', 'NLDAS', 'TerraClimate', 'WBET']
dataset_abrv = ['S', 'G', 'E', 'N', 'T', 'W']

date_ranges = np.zeros((2, len(files)), dtype='datetime64[ns]')
for i, file in enumerate(files):
    set = xr.open_dataset(file, engine='netcdf4', chunks={'lon': -1, 'lat': -1, 'time': -1})
    date_ranges[:, i] = [set.time.min().values, set.time.max().values]

# Take the third oldest start and third most recent end dates
date_range = [np.sort(date_ranges[0, :])[2], np.sort(date_ranges[1, :])[3]]
date_range

Using the date range, we can now combine all of the datasets into a single `Xarray` Data set for easy computations.

In [None]:
def preprocess(ds):
    """
    Keep only the specified time range for each file.
    """
    return ds.sel(time=slice(date_range[0], date_range[1]))

ds = xr.open_mfdataset(files, engine='netcdf4', preprocess=preprocess, combine='nested', concat_dim='dataset_name')
ds = ds.assign_coords({'dataset_name': dataset_name})
ds.dataset_name.attrs['description'] = 'Dataset name'

# Need time as first index for TC computation
ds = ds.transpose('time', ...)
# The data set is less than 1GiB, so let's read it into memory vs keeping as a dask array
ds.compute()

## Time Series Exploration
In case we want to explore the time series of each pixel later, let's make an interactive figure where we can select the latitude and longitude and plot the time series.

In [None]:
def create_timseries(lat=40, lon=-90):
    map = ds.aet.isel(time=0, dataset_name=0)
    map = map * 0
    map.loc[dict(lat=lat, lon=lon)] = 1
    plt = map.hvplot(geo=True, coastline=True, clim=(0, 1), title='Time Series Location  (Red dot indicates current pixel)', colorbar=False, cmap='kr')
    plt = plt + ds.aet.sel(lat=lat, lon=lon, method='nearest').hvplot(groupby='dataset_name',
                                                                      title="Datasets' ET Time Series").overlay().opts(legend_position='right')

    return plt

lat_widget = pn.widgets.FloatSlider(name='lat', start=ds.lat.min().item(), end=ds.lat.max().item(), step=0.25, value=ds.lat.min().item())
lon_widget = pn.widgets.FloatSlider(name='lon', start=ds.lon.min().item(), end=ds.lon.max().item(), step=0.25, value=ds.lon.min().item())

bound_plot = pn.bind(create_timseries, lat=lat_widget, lon=lon_widget)

pn.Column(lat_widget, lon_widget, bound_plot)

## TC Estimation

Time to compute the TC uncertainty estimates. To do that, we first need to decide on datasets that have "independent" errors in order to group them together into TC sets.

Here is a table of the data and method used for calculating each ET data set:

| Dataset | SSEBop | GLEAM v3b | ERA5 | NLDAS | TerraClimate | WBET |
| ------  |  ----  | -----     | ---- | ----- | ----         | ---- |
| Resolution | 1 km | 0.25 deg | 0.1 deg (9 km) | 0.125 deg | 0.04166 deg | 800 m |
| Measurement System | Satellite | Satellite | Reanalysis | Land Surface Model | Satellite + Water Balance | Water Balance + Satellite |
| Calculation Method | "hot" /"cold" reference pixels | Priestley and Taylor equation; Gash's alaytical model; Soil moisture based | Reanalysis | Land Surface Model | Penman-Monteith equation + Thornthwaite-Mather WBM | Water Balance, Meteorlogical/Climate regression, ensemble averaging |
| Input Data | **STRM** elevn; **PRISM** Ta; **MODIS** Ts, emissivity, albedo, and NDVI; **GDAS** ETo | **CERES** radiation; **TMPA** precip; **AIRS** Ta; **GLOBSNOW** snow-water equiv; **CCI** vegetation optical depth; **GLDAS** and **CCI** Soil moisture; **MODIS** GVCF (global vegetation continuous fields); **IGBP-DIS** soil properties; **CGLFRD** lightning flash rate for rainfall inference | **CHTESSEL** Land surface model using model cycle Cy45r1 (2018) | **NARR** (North American Regional Reanalysis) atmospheric forcing data; **PRISM** precip | **WorldClim** Ta, vapor, precip, solar radiation, wind (Uses **MODIS** Ts, cloud cover; **STRM** elevn); **CRU** Ts4.0, Tmax, Tmin, vapor, precip, Ta; **JRA-55** Ta, vapor, precip, radiation, wind | **PRISM** precip, mean Ta, max Ta, min Ta; **USGS** water use irrigaion, national elevation dataset, NWIS gage II discharge; **EROS** land cover (1938-1999); **Landsat** NLCD land cover (2000-2018); **gridMT** wind; **Koppen-Geiger** climate classification; **Fenneman & Johnson** physiographic province classification; **EPA** level III ecoregions; **STATSGO2** soil saturated hydraulic conductivity, porosity, field capacity, thickness, available water capacity |

From this table, we can group the datasets into measurement systems that "should" be independent:

1) Satellite
2) Reanalysis
3) Land Surface Model (LSM)
4) Water Balance

Since TerraClimate and WBET are both mixed water balance and satellite datasets, we will keep treat them as just water balance for now. This give us 12 different possible combinations of datasets. However, since the computation is fast and resulting TC error estimates will be small in memory (~500kiB), we will just compute all 20 combinations and can filter them out later.

In [None]:
# Generate a list of the combinations
combos = list(itertools.combinations(dataset_abrv, 3))
combos = [list(combo) for combo in combos]
combos

Since we have datasets with different date ranges, we will need to trim the date ranges here before computing the TC error variances. This will be slightly complicated. So, let's make it the date range selection its own function.

In [None]:
def common_date_range(ds, combo):
    """Return the common date slice of the datasets."""
    old_common_date = []
    recent_common_date = []
    for abrv in combo:
        idx = [j for j in range(len(ds['dataset_name'])) if abrv == ds['dataset_name'][j]][0]
        old_common_date.append(date_ranges[0, idx])
        recent_common_date.append(date_ranges[1, idx])
    
    return slice(np.max(old_common_date), np.min(recent_common_date))

Now that we have the ability to select the common date range, let's compute the TC error standard deviations.

In [None]:
# We want to ignore all of the sqrt and log warnings with negative values
warnings.filterwarnings("ignore", category=RuntimeWarning)

# Override the name to the abbreviation for easier indexing
ds['dataset_name'] = dataset_abrv

tc_est = []
ndates_per_combo = []
for combo in combos:
    ds_combo = ds.sel(time=common_date_range(ds, combo), dataset_name=combo)
    ndates_per_combo.append(len(ds_combo.time))
    
    tc_covar, snr_temp = ec_covar_multi(ds_combo.aet.data, corr_sets=[1, 2, 3])

    tc_est.append(xr.Dataset(data_vars={'error': (['dataset_combo', 'combo_idx', 'lat', 'lon'], 
                                                  np.sqrt(np.diagonal(tc_covar)).transpose((2, 0, 1))[None, ...]),
                                        'snr': (['dataset_combo', 'combo_idx', 'lat', 'lon'], (10 ** np.log10(snr_temp[None, ...])))},
                             coords={'dataset_combo': [''.join(combo)], 'combo_idx': [0, 1, 2], 'lat': ds.lat, 'lon': ds.lon}))

tc_est = xr.concat(tc_est, dim='dataset_combo')

tc_est.error.attrs['description'] = 'TC error estimate for the dataset_combo triplet.'
tc_est.snr.attrs['description'] = 'TC unbiased SNR estimate for the dataset_combo triplet.'
tc_est.dataset_combo.attrs['description'] = ('Dataset combination used in TC evaluation '
                                             '(abbriviations: T=TerraClimate, E=ERA5, '
                                             'N=NLDAS, G=GLEAM, W=WBET, S=SSEBop).')
tc_est.combo_idx.attrs['description'] = 'String index of "dataset_combo" coordinate associated with the dataset.'
tc_est.error.attrs['units'] = 'mm.month-1'
tc_est = tc_est.compute()

# Reset the name back from the abbreviation
ds['dataset_name'] = dataset_name

tc_est

Let's see how the resulting error estimates look. As a reminder the data set abbreviation are:
**T=TerraClimate, E=ERA5, N=NLDAS, G=GLEAM, W=WBET, S=SSEBop**. Additionally, we can look at the unbiased SNR estimates that we generated as well.

Additionally, since we suppressed the `sqrt` and `log` run time warnings, we can expect to see `NaN`s throughout the maps, where the TC calculation resulted in negative values. This works as intended as any negative error variances should be flagged as incorrect (i.e., this is done with `NaN`).

In [None]:
plt = tc_est.error.hvplot(groupby=['dataset_combo', 'combo_idx'], geo=True, 
                          coastline=True, clim=(0, 50)).opts(frame_width=500) + \
tc_est.snr.hvplot(groupby=['dataset_combo', 'combo_idx'], geo=True,
                                    coastline=True, clim=(0.1, 50), cnorm='log').opts(frame_width=500)

import panel as pn
pn.panel(plt, widget_location='top')

## TC Discussion

Some of the datasets (Mainly GLEAM and ERA5, and some NLDAS and WBET) have large swaths of `NaN` values caused by negative variances. This is typically caused by one of two things. (1) Covariances in the errors, which are assumed to not be present, or (2) two datasets have approximately order of magnitude larger error variances compared to the third.

Let's begin exploring this by checking what combinations of datasets regularly result in large `NaN` regions. To do this, we can simply check the fraction of finite values produced by the TC estimate by the number of finite values in the original data. If certain data set combinations typically result in small fraction, it is likely that these datasets have correlated errors.

In [None]:
nfinite = np.isfinite(ds.aet.sel(time='2010-07')).sum(dim=['lat', 'lon']).compute()
nNaN = np.isnan(ds.aet.sel(time='2010-07')).sum(dim=['lat', 'lon']).compute()
nfinite['dataset_name'] = dataset_abrv
nNaN['dataset_name'] = dataset_abrv

for combo in tc_est.dataset_combo.data:
    temp = tc_est.error.sel(dataset_combo=combo)
    tc_nfinite = np.isfinite(temp).sum(dim=['lat', 'lon'])

    expand_combo = [i for i in combo]
    min_ref_finite = nfinite.sel(dataset_name=expand_combo).min().data

    # This was printed and manually sorted below for investigative purposes.
    # print(expand_combo, tc_nfinite.data/min_ref_finite)

From this fraction comparison, we can see for sure that GLEAM does not pair well with ERA5 or TerraClimate, ERA5 does not pair well with TerraClimate, and NLDAS does not pair well with SSEBop or WBET. (WBET bad pairings are slightly more ambiguous. So, we will leave them for now.) This means that GLEAM, ERA5, and TerraClimate fail when any are combined, and NDLAS fail when combined with SSEBop or WBET. Therefore, the only viable combinations are SSEBop-GLEAM-WBET, SSEBop-ERA5-WBET, and SSEBop-TerraClimate-WBET.

Let's explore the failing combinations some by seeing how the covariances of the data compare. This may clarify these poor pairings. In the plots below, starting at the top left and moving left to right, top to bottom, the plots show each dataset's variance and covariance between each data set. If we see that the covariance (variance) of a datasets is significantly larger (smaller) than the covariance (variance) of the other datasets, we can expect there to potentially be negative error variance estimates. This is due to how the error variance is calculated:

\begin{equation}
\sigma_{\varepsilon_i}^2 = \sigma_{i}^2 - \frac{\sigma_{ij} \sigma_{ik}}{\sigma_{jk}},
\end{equation}

where $\sigma_{ij} = {\rm Cov}(\boldsymbol{X}_i, \boldsymbol{X}_j)$, and $\boldsymbol{X_i}$ is data set $i$ from the data set triplet.

In [None]:
einsum_subscripts = 'abde,acde->bcde'

def cov_plots(dataset_combo='SGE'):

    ds['dataset_name'] = dataset_abrv
    combo = [i for i in dataset_combo]
    ds_combo = ds.sel(time=common_date_range(ds, combo), dataset_name=combo)
    deviation = ds_combo.aet - ds_combo.aet.sum(axis=0) / ds_combo.aet.shape[0]
    covar = np.einsum(einsum_subscripts, deviation, deviation)  / (ds_combo.aet.shape[0] - 1)
    covar_ds = xr.DataArray(data=covar, coords={'dataset1': combo, 'dataset2': combo, 'lat': ds.lat, 'lon': ds.lon})

    plt = covar_ds.isel(dataset1=0, dataset2=0).hvplot(geo=True, coastline=True).opts(frame_width=400)
    plt = plt + covar_ds.isel(dataset1=1, dataset2=1).hvplot(geo=True, coastline=True).opts(frame_width=400)
    plt = plt + covar_ds.isel(dataset1=2, dataset2=2).hvplot(geo=True, coastline=True).opts(frame_width=400)
    plt = plt + covar_ds.isel(dataset1=0, dataset2=1).hvplot(geo=True, coastline=True).opts(frame_width=400)
    plt = plt + covar_ds.isel(dataset1=0, dataset2=2).hvplot(geo=True, coastline=True).opts(frame_width=400)
    plt = plt + covar_ds.isel(dataset1=1, dataset2=2).hvplot(geo=True, coastline=True).opts(frame_width=400)

    ds['dataset_name'] = dataset_name
    
    return plt.cols(2)

dataset_combo_widget = pn.widgets.Select(name="dataset_combo", value="SGE", options=list(tc_est.dataset_combo.values))

bound_plot = pn.bind(cov_plots, dataset_combo=dataset_combo_widget)

pn.Row(bound_plot, dataset_combo_widget)

As expected, this does seem to be the case. While determining the issue in the data is doable (relative variance/covariance issues), translating that into what TC assumption is not met is a more challenging problem. To remind ourselves of these assumptions, they are:

1. The signal and random errors are stationary (i.e., the mean of each is constant with time).
2. All data sets are represent exactly the same ET state (i.e., the three data sets have the same spatial resolution and sampling intervals).
3. No cross-correlation of errors (i.e., measurement system errors are independent of each other).
4. Error orthogonality (i.e., the measurement system errors are independent of the true value).
5. No error autocorrelation (i.e., the error estimates are not correlated with time).

Of these assumptions, it is possible that each is influencing the result. First, it is likely that our signal and random errors are not stationary. We have performed some stationarity tests off-hand and found that the signal is not always stationary (this is expected since ET has likely changed with time). Additionally, the error likely has a non-stationary seasonal component. As discussed in [Gruber et al. (2016)](http://dx.doi.org/10.1016/j.jag.2015.09.002), this is not an issue if the datasets all have the same non-stationarity effect. However, determining this is difficult and likely not the major contributing factor to the error variance.

Next, representativeness can bias one or two of the error variances in the triplet if the spatial representativeness is highly different between the data sets [(Gruber et al. 2016)](http://dx.doi.org/10.1016/j.jag.2015.09.002). In our case, the data sets all originally had their own native resolution, which we degraded to match the GLEAM resolution. Therefore, it is possible that data sets with high native resolution may be penalizing the data sets with lower resolution. Of the six data sets, SSEBop and WBET were similar in resolution, with TerraClimate close to their resolution as well. ERA5 and NLDAS are both also similar in resolution, with them being almost double GLEAM and 10x lower than SSEBop and WBET.

| Dataset | SSEBop | GLEAM v3b | ERA5 | NLDAS | TerraClimate | WBET |
| ------  |  ----  | -----     | ---- | ----- | ----         | ---- |
| Resolution | 0.01 deg (1 km) | 0.25 deg (22.5 km) | 0.1 deg (9 km) | 0.125 deg (11.25 km) | 0.04166 deg (3.75 km) | 0.009 deg (800 m) |

While this does help us understand why combinations of SSEBop and WBET resulted in better TC error variance estimates, it does not give a reason why the lower resolution data sets were leading to poor results. Mainly, we should keep this issue in mind as we further investigate expanding TC to extended Collocation (EC).

Finally, the largest issue in our error variance estimates is likely the inclusion of datasets with cross-correlated errors. Meeting this assumption has been shown to be the most influential in getting correct error variance estimates, more than error orthogonality and error autocorrelation [(Yilmax & Crow 2014)](https://doi.org/10.1175/JHM-D-13-0158.1). From a basic assumption standpoint, as discussed above, we would have expected that SSEBop and GLEAM would likely have issues along TerraClimate and WBET, as they are generated from similar measurement systems. Additionally, we would expect ERA5 and NLDAS to be relatively independent of the others, since they are unique measurement systems. However, this is almost the opposite situation, as they seem to have the most problems. This could be that their forcing data, which we did not expand to its individual part, is the same data used in other measurement systems. Therefore, this TC application demonstrates that using basic TC is likely not a reliable measure of error variances, as knowing what data sets are truly independent is almost and insurmountable task. However, we can expand TC to EC and see if coupling data sets results in different error variance estimates.

Combine each error standard deviations estimate to find the mean and standard deviation for each computation, excluding the `NaN`s. First rearrange the data to be by data set.

In [None]:
tc_est_by_dataset = []
for abrv, name in zip(dataset_abrv, dataset_name):
    idx_loc = np.char.find(tc_est.dataset_combo.data, abrv)
    dataset_loc = np.where(idx_loc != -1)[0]
    idx_loc = idx_loc[dataset_loc]
    combos_single_dataset = tc_est.isel(dataset_combo=dataset_loc)
    tc_est_dataset = []
    for i in range(len(combos_single_dataset.dataset_combo.values)):
        tc_est_single_dataset = combos_single_dataset.isel(dataset_combo=i, combo_idx=idx_loc[i])
        tc_est_single_dataset = tc_est_single_dataset.drop_vars(['dataset_combo', 'combo_idx'])

        tc_est_dataset.append(xr.Dataset(data_vars=tc_est_single_dataset,
                                coords={'dataset_name': name, 'est_idx': i, 'lat': ds.lat, 'lon': ds.lon}))
    
    tc_est_dataset = xr.concat(tc_est_dataset, dim='est_idx')
    
    tc_est_by_dataset.append(tc_est_dataset)

tc_est_by_dataset = xr.concat(tc_est_by_dataset, dim='dataset_name')

tc_est_by_dataset.dataset_name.attrs['description'] = 'Dataset names.'
tc_est_by_dataset.est_idx.attrs['description'] = 'Index of the TC triplet set that the estimate was calculated.'

tc_est_by_dataset = tc_est_by_dataset.compute().chunk(-1)
tc_est_by_dataset

Now let's compute and plot the means and standard deviations.

In [None]:
mean_tc_est = tc_est_by_dataset.error.mean(dim='est_idx', skipna=True, keep_attrs=True)
std_tc_est = tc_est_by_dataset.error.std(dim='est_idx', ddof=1, skipna=True, keep_attrs=True)
count_tc_est = np.isfinite(tc_est_by_dataset.error).sum(dim='est_idx')
count_tc_est.attrs['units'] = 'counts'

plt = mean_tc_est.hvplot
plt = mean_tc_est.hvplot(groupby=['dataset_name'], geo=True, coastline=True, 
                         clim=(0,50), title='Mean Error').opts(frame_width=500) + \
      std_tc_est.hvplot(groupby=['dataset_name'], geo=True, coastline=True,
                        title='Std of Error').opts(frame_width=500) + \
      count_tc_est.hvplot(groupby=['dataset_name'], geo=True, coastline=True,
                          title='Number of data points used in calculation').opts(frame_width=500)

import panel as pn
pn.panel(plt.cols(2), widget_location='top')