# EC Application

Since we saw in the `2_TC_application.ipynb` notebook that the estimated ET error variances using TC potentially had error cross-correlations, we will look into how expanding TC to include additional data sets can help estimate this cross-correlation.

In [1]:
import holoviews as hv
import hvplot.xarray
import panel as pn
import cartopy.crs as ccrs
import numpy as np
import xarray as xr
from xarray_einstats import linalg
from scipy.stats import percentileofscore
import itertools
import warnings

First, we will run in the Extended Collocation notebook to create our EC function.

In [2]:
%run TC/EC_function.ipynb

## Combine Data Sets in Xarray
Next, we need to load in our data sets and limit them to a common date range. Since we need at least three data sets to utilized EC, we will restrict the data ranges of all data sets to have the beginning date of the third oldest starting date and ending data of the third most recent ending date. This choice allows us to save memory usage, while also utilizing the largest amount of data. For data sets with a more restricted date range, due to one data set having a smaller date range, we will limit the date range further at the time of the EC computation.

In [4]:
files = ['Data/ssebop/ssebop_aet_regridded.nc',
         'Data/gleam/gleam_aet.nc',
         'Data/era5/era5_aet_regridded.nc',
         'Data/nldas/nldas_aet_regridded.nc',
         'Data/terraclimate/terraclimate_aet_regridded.nc',        
         'Data/wbet/wbet_aet_regridded.nc',
         ]
dataset_name = ['SSEBop', 'GLEAM', 'ERA5', 'NLDAS', 'TerraClimate', 'WBET']
dataset_abrv = ['S', 'G', 'E', 'N', 'T', 'W']

date_ranges = np.zeros((2, len(files)), dtype='datetime64[ns]')
for i, file in enumerate(files):
    set = xr.open_dataset(file, engine='netcdf4', chunks={'lon': -1, 'lat': -1, 'time': -1})
    date_ranges[:, i] = [set.time.min().values, set.time.max().values]

# Take the third oldest start and third most recent end dates
date_range = [np.sort(date_ranges[0, :])[2], np.sort(date_ranges[1, :])[3]]
date_range

[numpy.datetime64('1958-01-01T00:00:00.000000000'),
 numpy.datetime64('2022-12-01T00:00:00.000000000')]

Using the date range, we can now combine all of the data sets into a single `Xarray` `DataSet` for easy computations.

In [5]:
def preprocess(ds):
    """
    Keep only the specified time range for each file.
    """
    return ds.sel(time=slice(date_range[0], date_range[1]))

ds = xr.open_mfdataset(files, engine='netcdf4', preprocess=preprocess, combine='nested', concat_dim='dataset_name')
ds = ds.assign_coords({'dataset_name': dataset_name})
ds.dataset_name.attrs['description'] = 'Dataset name'

# Need time as first index for TC computation
ds = ds.transpose('time', ...)
# The data set is less than 1GiB, so let's read it into memory vs keeping as a dask array
ds = ds.compute()
ds

As stated above, since we have data sets with different date ranges, we will need to trim the date ranges here before computing the EC error covariance matrix. This will be slightly complicated. So, let's make it the date range selection its own function.

In [6]:
def common_date_range(ds, combo):
    """Return the common date slice of the datasets."""
    old_common_date = []
    recent_common_date = []
    for abrv in combo:
        idx = [j for j in range(len(ds['dataset_name'])) if abrv == ds['dataset_name'][j]][0]
        old_common_date.append(date_ranges[0, idx])
        recent_common_date.append(date_ranges[1, idx])
    
    return slice(np.max(old_common_date), np.min(recent_common_date))

Like the TC error variance estimates in `3_EC_application.ipynb`, we will compute the error covariance matrix for all 90 possible combinations of EC for four data sets (i.e., Quadruple Collocation or QC), since the computation is fast. Additionally, since we only want the error covariance for a given pair and not the whole covariance matrix, the resulting EC error covariances will be small in memory (~100kiB).

In [22]:
# Generate a list of the combinations, need two correlated data sets, then two additional ones
combos = list(itertools.combinations(dataset_abrv, 2))
combos = [list(corr_combo + indep_combo) for corr_combo in combos for indep_combo in combos if ((corr_combo[0] not in indep_combo) and (corr_combo[1] not in indep_combo))]
combos[0:10]

[['S', 'G', 'E', 'N'],
 ['S', 'G', 'E', 'T'],
 ['S', 'G', 'E', 'W'],
 ['S', 'G', 'N', 'T'],
 ['S', 'G', 'N', 'W'],
 ['S', 'G', 'T', 'W'],
 ['S', 'E', 'G', 'N'],
 ['S', 'E', 'G', 'T'],
 ['S', 'E', 'G', 'W'],
 ['S', 'E', 'N', 'T']]

Now that we have our data set combinations, let's compute the EC error covariance matrices and extract the error covariance. We will do this for each season independently along with the full year. (The season and full year will be denoted with the monthly abbreviations contained within the season or `All`, respectively.)

In [23]:
# We want to ignore all of the sqrt and log warnings with negative values
warnings.filterwarnings("ignore", category=RuntimeWarning)

# Override the name to the abbreviation for easier indexing
ds['dataset_name'] = dataset_abrv

# Create list of seasons
seasons = ['All'] + list(np.unique(ds.time.dt.season))

ec_covar_est = []
ec_covar_est_season = []


for combo in combos:
    for season in seasons:
        if season == 'All':
            ds_season = ds
        else:
            ds_season = ds.isel(time=(ds.time.dt.season == season))

        ds_combo = ds_season.sel(time=common_date_range(ds, combo), dataset_name=combo)
        
        ec_covar, _ = ec_covar_multi(ds_combo.aet.data, corr_sets=[1, 1, 2, 3])
    
        # Since we only want off covariance of first two data sets, extract the off diagonal element
        # Average together as sig_01 != sig10 as discussed in our random example.
        covar = (ec_covar[0, 1, ...] + ec_covar[1, 0, ...]) / 2
        
        ec_covar_est_season.append(xr.Dataset(data_vars={'covar': (['dataset_combo', 'season', 'lat', 'lon'], 
                                                            covar[None, None, ...])},
                                              coords={'dataset_combo': [''.join(combo)], 'season': [season], 
                                                      'lat': ds.lat, 'lon': ds.lon}))
    ec_covar_est.append(xr.concat(ec_covar_est_season, dim='season'))
    ec_covar_est_season = []

ec_covar_est = xr.concat(ec_covar_est, dim='dataset_combo')

# Convert layout to be by covariance pair vs long dataset_combo list
abrv_combos = [''.join(combo) for combo in list(itertools.combinations(dataset_abrv, 2))]
covar_est_by_dataset_pair = []
est_pair = []
for abrv_combo in abrv_combos:
    idx_loc = np.char.find([dataset[0:2] for dataset in ec_covar_est.dataset_combo.data], abrv_combo)
    dataset_loc = np.where(idx_loc != -1)[0]
    covar_pair_datasets = ec_covar_est.isel(dataset_combo=dataset_loc)

    est_pair.append([combo[2:] for combo in ec_covar_est.dataset_combo.data[dataset_loc]])

    covar_est_by_dataset_pair.append(xr.Dataset(data_vars={'covar': (['est_idx', 'season', 'lat', 'lon'], 
                                                          covar_pair_datasets.covar.data)},
                                     coords={'covar_pair': abrv_combo, 'est_idx': np.arange(len(dataset_loc)),
                                             'season': seasons, 'lat': ec_covar_est.lat, 'lon': ec_covar_est.lon}))

covar_est_by_dataset_pair = xr.concat(covar_est_by_dataset_pair, dim='covar_pair')

covar_est_by_dataset_pair = covar_est_by_dataset_pair.assign_coords(est_pair=(['covar_pair', 'est_idx'], np.array(est_pair)))
covar_est_by_dataset_pair

covar_est_by_dataset_pair.covar.attrs['description'] = 'EC error covariance estimate for the data sets in covar_pair.'
covar_est_by_dataset_pair.covar.attrs['units'] = 'mm2.month-2'
covar_est_by_dataset_pair.covar_pair.attrs['description'] = ('Correlated data set pair used in EC evaluation '
                                                  '(abbriviations: T=TerraClimate, E=ERA5, '
                                                  'N=NLDAS, G=GLEAM, W=WBET, S=SSEBop).')
covar_est_by_dataset_pair.est_idx.attrs['description'] = 'Index of the other two data sets used in the TC triplet as contained in est_pair.'
covar_est_by_dataset_pair.season.attrs['description'] = 'Season of the year given by the first letter of each month within the season. The full year is given by "All".'
covar_est_by_dataset_pair.est_pair.attrs['description'] = 'Abbreviations of the other two data sets used in the TC triplet.'

covar_est_by_dataset_pair = covar_est_by_dataset_pair.chunk(-1)

# Reset the name back from the abbreviation
ds['dataset_name'] = dataset_name

covar_est_by_dataset_pair = covar_est_by_dataset_pair.compute()
covar_est_by_dataset_pair

## EC Discussion

Since looking at each possible combination of correlated pairs with uncorrelated pairs may be overwhelming, we will instead average the covariance from each estimation and plot them to visualize how correlated data sets may be.

In [24]:
mean_covar_est = covar_est_by_dataset_pair.covar.mean(dim='est_idx', skipna=True, keep_attrs=True)
mean_covar_est.name = 'err_covar'
median_covar_est = covar_est_by_dataset_pair.covar.median(dim='est_idx', skipna=True, keep_attrs=True)
median_covar_est.name = 'err_covar'
std_covar_est = covar_est_by_dataset_pair.covar.std(dim='est_idx', ddof=1, skipna=True, keep_attrs=True)
std_covar_est.name = 'Std of error covar std'
count_covar_est = np.isfinite(covar_est_by_dataset_pair.covar).sum(dim='est_idx')
count_covar_est.name = 'Counts'
count_covar_est.attrs['units'] = 'counts'

plt = mean_covar_est.hvplot(groupby=['covar_pair', 'season'], geo=True, coastline=True, 
                            clim=(-300,500), title='Mean Error Covariance').opts(frame_width=500) + \
      median_covar_est.hvplot(groupby=['covar_pair', 'season'], geo=True, coastline=True, 
                              clim=(-300,500), title='Median Error Covariance').opts(frame_width=500) + \
      std_covar_est.hvplot(groupby=['covar_pair', 'season'], geo=True, coastline=True,
                           clim=(0,350), title='Std of Error Covariance').opts(frame_width=500) + \
      count_covar_est.hvplot(groupby=['covar_pair', 'season'], geo=True, coastline=True,
                             title='Number of data points used in calculation').opts(frame_width=500)

pn.panel(plt.cols(2), widget_location='top')

From these plots, we can see that some data sets may be correlated with each other. To visualize this clearly, we can plot all of the pixels from each independent data set pair and average of pairs as histograms, where one count is a pixel. This will allow us to see if a whole map shows a net error covariance or if the covariances are evenly distributed around zero, which could indicate minimal cross-correlation of errors.

In [16]:
def histogram_plts(covar_pair='SG', season='All'):
    da0 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=0, season=season)
    da1 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=1, season=season)
    da2 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=2, season=season)
    da3 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=3, season=season)
    da4 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=4, season=season)
    da5 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=5, season=season)
    da_mean = mean_covar_est.sel(covar_pair=covar_pair, season=season)
    da_median = median_covar_est.sel(covar_pair=covar_pair, season=season)
    da0.name = covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=0).data.item()
    da1.name = covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=1).data.item()
    da2.name = covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=2).data.item()
    da3.name = covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=3).data.item()
    da4.name = covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=4).data.item()
    da5.name = covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair, est_idx=5).data.item()
    da_mean.name = 'Mean'
    da_median.name = 'Median'

    plt = (da0.hvplot.hist(bins=50, bin_range=(-500,500), title='TC Error Covariance Distribution of '+covar_pair, 
                           xlabel='Error Covariance (mm2.month-2)', ylabel='Counts', alpha=1, normed=True)
           * da1.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
           * da2.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
           * da3.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
           * da4.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
           * da5.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
           * da_mean.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True)
           * da_median.hvplot.hist(bins=50, bin_range=(-500,500), alpha=1, normed=True))

    return plt

def percent_table(covar_pair='SG', season='All'):
    da0 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=0, season=season)
    da1 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=1, season=season)
    da2 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=2, season=season)
    da3 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=3, season=season)
    da4 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=4, season=season)
    da5 = covar_est_by_dataset_pair.covar.sel(covar_pair=covar_pair, est_idx=5, season=season)
    da_mean = mean_covar_est.sel(covar_pair=covar_pair, season=season)
    da_median = median_covar_est.sel(covar_pair=covar_pair, season=season)

    percentiles = ([percentileofscore(da.data.flatten(), 0, kind='strict', 
                                     nan_policy='omit') for da in [da0, da1, da2, da3, da4, da5]]
                   + [percentileofscore(da_mean.data.flatten(), 0, kind='strict',  nan_policy='omit')]
                   + [percentileofscore(da_median.data.flatten(), 0, kind='strict',  nan_policy='omit')])
    correlation_str = np.array(['Neutral '] * len(percentiles))
    correlation_str[np.array(percentiles) <= 35] = 'Positive'
    correlation_str[np.array(percentiles) >= 65] = 'Negative'
    table = hv.Table({'Independent Pair': list(covar_est_by_dataset_pair.est_pair.sel(covar_pair=covar_pair).data) + ['Mean', 'Median'], 
                      'Percentile of 0': np.round(percentiles, 2),
                      'Net Correlation': correlation_str}, ['Independent Pair', 'Percentile of 0', 'Net Correlation']).opts(width=350)

    return table


covar_pair_widget = pn.widgets.Select(name="covar_pair", value="SG", options=list(covar_est_by_dataset_pair.covar_pair.values))
season_widget = pn.widgets.Select(name="season", value="All", options=['All', 'DJF', 'MAM', 'JJA', 'SON'])

bound_plot = pn.bind(histogram_plts, covar_pair=covar_pair_widget, season=season_widget)
bound_table = pn.bind(percent_table, covar_pair=covar_pair_widget, season=season_widget)

pn.Column(covar_pair_widget, season_widget, pn.Row(bound_plot, bound_table))

As a reminder the data set abbreviation are: **S=SSEBop, G=GLEAM, E=ERA5, N=NLDAS, T=TerraClimate, W=WBET,**

From these results, we can see that indeed some data sets do have net positive covariances. For example SSEBop and NLDAS, along with ERA-5 and TerraClimate, both have strong net positive covariances. This indicates that these data sets likely have some common modeling assumption or input data that are causing them to have correlated errors.

Interestingly though, we also find some data sets that have net negative covariances, for example SSEBop and TerraClimate along with ERA-5 and WBET. This is highly unexpected from an input data perspective as this seems to indicate that as the uncertainty in one data set increases the other decreases. Typically, we would expect the data sets to have net positive covariances due to commonalities in the input data or calculation method propagating the same errors into the data set. However, these histograms give an aggregated view of the maps. It is possible that net negative covariances could be a result of one data set having lower error variances in one geographical region compared to the other.

For convinience, we include the table showing the info about each data set below

| Data set | SSEBop | GLEAM v3b | ERA5 | NLDAS | TerraClimate | WBET |
| ------  |  ----  | -----     | ---- | ----- | ----         | ---- |
| Resolution | 0.01 deg (1 km) | 0.25 deg (22.5 km) | 0.1 deg (9 km) | 0.125 deg (11.25 km) | 0.04166 deg (3.75 km) | 0.009 deg (800 m) |
| Measurement System | Satellite | Satellite | Reanalysis | Land Surface Model | Satellite + Water Balance | Water Balance + Satellite |
| Calculation Method | "hot" /"cold" reference pixels | Priestley and Taylor equation; Gash's alaytical model; Soil moisture based | Reanalysis | Land Surface Model | Penman-Monteith equation + Thornthwaite-Mather WBM | Water Balance, Meteorlogical/Climate regression, ensemble averaging |
| Input Data | **STRM** elevn; **PRISM** Ta; **MODIS** Ts, emissivity, albedo, and NDVI; **GDAS** ETo | **CERES** radiation; **TMPA** precip; **AIRS** Ta; **GLOBSNOW** snow-water equiv; **CCI** vegetation optical depth; **GLDAS** and **CCI** Soil moisture; **MODIS** GVCF (global vegetation continuous fields); **IGBP-DIS** soil properties; **CGLFRD** lightning flash rate for rainfall inference | **CHTESSEL** Land surface model using model cycle Cy45r1 (2018) | **NARR** (North American Regional Reanalysis) atmospheric forcing data; **PRISM** precip | **WorldClim** Ta, vapor, precip, solar radiation, wind (Uses **MODIS** Ts, cloud cover; **STRM** elevn); **CRU** Ts4.0, Tmax, Tmin, vapor, precip, Ta; **JRA-55** Ta, vapor, precip, radiation, wind | **PRISM** precip, mean Ta, max Ta, min Ta; **USGS** water use irrigaion, national elevation dataset, NWIS gage II discharge; **EROS** land cover (1938-1999); **Landsat** NLCD land cover (2000-2018); **gridMT** wind; **Koppen-Geiger** climate classification; **Fenneman & Johnson** physiographic province classification; **EPA** level III ecoregions; **STATSGO2** soil saturated hydraulic conductivity, porosity, field capacity, thickness, available water capacity |