# TempEst-NEXT Validation Suite

This Notebook provides a standard suite of model tests for TempEst-NEXT.  This serves two purposes:

1. Reproducibility of research.  This notebook is used to generate final manuscript figures relating to model performance. The notebook itself is provided with the published model and dataset, allowing exact reproducibility (and easy modification) of the analysis.
2. Efficient, consistent testing.  After modifying the model, running this notebook is a quick way to make sure everything still works and to assess performance impacts.

For validation, we use two pre-retrieved datasets as well as some automatic retrieval of new data.  The two pre-retrieved datasets are a "development" set of ~900 USGS gages (nominally 1,000; 900ish with overlapping data coverage) and a "test" set of ~400 (nominal) USGS gages paired with daymet meteorology, 3DEP topography, NLCD land cover, etc.  The development set was used for model development and tuning, while the test set is reserved for final validation (i.e., here).

## Assessed Model Characteristics

The goal is to assess several model performance characteristics.  Forecasting is used for a handful of tests, but most analysis focuses on hindcasting for computational convenience.  It is assumed that any performance discrepancies in forecasting vs hindcasting would be apparent in the tests that cover both, and thus that not every analysis needs to test forecasting.

In this notebook, forecasting is primarily tested by using archived weather forecasts to predict what the forecast would have been for time periods where observations are available, which is necessary for automatic testing.  There is code to run a "real" forecast, but this cannot be automatically evaluated because observations are not available, so the user is left to go back and check once observations are available.

1. Calibrated model hindcasting and forecasting performance, using TempEst-NEWT like a typical single-watershed model
2. General ungaged hindcasting and forecasting performance
3. Ungaged-region hindcasting performance
4. Ungaged-elevation hindcasting performance
5. Ungaged-time-period hindcasting performance
6. Disturbance hindcasting performance
7. Small-stream hindcasting performance?

## Tests

The following tests are used to assess the above performance characteristics.

- Calibrated testing: train a model on the first 70% of observations for each stream, then evaluate performance for predicting the last 30%.  This uses the development dataset.
  - Use meteorology estimates (daymet) for training and testing: hindcast test.  Because the model architecture does not actually use "today's" weather (up through yesterday only), this is also a 24-hour forecast test.
  - Use weather forecast archives (HRRR, GFS/GEFS) for training and testing: forecast test.  Test forecast periods of 2, 3 (HRRR), 7, 10, 14, 17 (GFS/GEFS) days.  Note forecast period is 1 day *past* the last day of the weather forecast, since NEWT does not depend on day-of weather.
  - Use meteorology estimates for training and forecasts for testing: evaluate the impact of heterogeneous datasets.  Use 2-day HRRR-forced forecast only.
- Gagewise cross-validation: partition development dataset gages into *k* equal sets.  Train a model on all partitions but one, and evaluate performance for predicting the excluded partition.  This tests general ungaged performance, not accounting for any potential impact of having used the same dataset for model tuning.  Hindcasting (met estimate) only.
- Test set validation: train a model on the development set, and evaluate performance for predicting the test set.  This tests general ungaged performance for a fully-independent dataset.
  - Meteorology estimates (hindcast)
  - Weather forecast archives (forecast) for a range of lead times
  - Train on estimates/test on forecasts (forecast with heterogeneous data)
- Extrapolation hindcasting tests: partition the combined development and testing sets along some characteristic of interest, and use a model trained on one group to predict the other group.  This tests the ability of TempEst-NEXT to extrapolate in terms of specific characteristics.  All hindcasting.
  - Regional: partition the CONUS into contiguous regions and run leave-one-out cross-validation over the regions.
  - Elevation: train a model on the lower elevations and predict higher elevations.  Partial dependency plots and previous research suggest there is a major shift in watershed dynamics around 2300 m, and it is difficult to extrapolate past that barrier.
  - Time (walk-forward validation): train a model up to a given year, then predict the next year.  This tests whether the model can extrapolate forward in time.
- Regime-shift hindcasting: identify a set of watersheds for which the observed thermal regime has shifted significantly.  Train the model on everything else, then try to predict the disturbed watersheds and see how the model performs.  This assesses whether TempEst-NEXT is capable of capturing regime shifts.
- Small-stream hindcasting: if possible, use the model to predict temperatures at very small (e.g., first-order headwaters, centimeters to a few meters wide) streams where some observations are available, just to see if it works there.

# Data Preparation

In [1]:
import NEXT
import NEWT
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

There are some major outliers that are either erroneous (negative temperatures) or wildly unrepresentative (hot springs) that we remove to produce realistic performance estimates.

In [None]:
dev_data = pd.read_csv("DevData.csv", dtype={"id": "str"})
dev_data["date"] = pd.to_datetime(dev_data["date"])
dev_data = dev_data[(dev_data["temperature"] > -0.5) & (dev_data["temperature"] < 40)]
# dev_data["day"] = dev_data["date"].dt.day_of_year
# gsamp = pd.read_csv("GageSample.csv",
#                    dtype={"id": "str"})
# def idfix(data):
#     data = data[data["id"].apply(lambda x: x.startswith("USGS"))]
#     data["id"] = data["id"].apply(lambda x: x.split("_")[1])
#     return data
# lcov = idfix(pd.read_csv("LandCover.csv"))
# area = idfix(pd.read_csv("Area.csv"))
# topo = idfix(pd.read_csv("Topography.csv"))
# dev_data = dev_data.merge(lcov, on="id").merge(area, on="id").merge(topo, on="id").merge(gsamp[["id", "lat", "lon"]], on="id")
# dev_data.to_csv("DevData.csv", index=False)
# test_data = whatever

# Calibrated Tests

In [41]:
def cut_dev(gid, start=None):
    idd = dev_data[dev_data["id"] == gid]
    if idd["temperature"].mean() > 35 or idd["temperature"].mean() < 0:
        return (None, None)  # bad data or major outlier
    if start is None:
        cut = round(len(idd) * 0.7)
        if cut >= 365:
            return (idd.iloc[cut:], idd["date"].iloc[cut+1])
        else:
            return (None, None)  # dataset too small
    else:
        return idd[idd["date"] >= start]

def cal_val(gid, cal_fn = cut_dev, val_fn = cut_dev):
    (train, cutoff) = cal_fn(gid)
    try:
        if cutoff is not None:
            test = cal_fn(gid, cutoff)
            model = NEWT.Watershed.from_data(train)
            return model.run_series(test)
    except Exception as e:
        print(e)

## Hindcast

In [42]:
with warnings.catch_warnings(action="ignore"):
    preds = pd.concat([cal_val(x) for x in dev_data["id"].unique()])

'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'
'NoneType' object has no attribute 'generate_ts'


In [51]:
preds = preds[abs(preds["anom"]) < 10]  # I don't know why one of the models is predicting colossal anomalies, but it is.

Overall performance summary below.  Note that stationarity, in particular, does absurdly well as a comparison point (same temperature today as yesterday).  As far as I'm aware, this comparison has not been run for most previous models.  It would be interesting to see how much of a lag is required for NEWT to outperform stationarity.  This does suggest that, if you have observations, "same as yesterday" is probably a better bet than most non-data-assimilating models.

Interestingly, when a massive outlier that was predicting anomalies in the thousands of degrees is removed, global performance is very similar to gagewise performance.  Note that huge anomaly sensitivity isn't representative of any real use case, since in a calibrated model that would be corrected for and the (smoothed) coefficient estimation model won't predict such high sensitivity.  (If it did happen, it would be fairly obvious that ~3000 C is not a reasonable estimate.)

In [48]:
with warnings.catch_warnings(action="ignore"):
    print(preds.groupby("id").apply(NEWT.analysis.perf_summary).describe())

               R2        RMSE         NSE  StationaryNSE  ClimatologyNSE  \
count  943.000000  943.000000  943.000000     943.000000      943.000000   
mean     0.921120    1.567262    0.919433       0.976682        0.951546   
std      0.097137    0.485112    0.100530       0.017775        0.051659   
min      0.088175    0.121245    0.080547       0.855635        0.287647   
25%      0.920707    1.311004    0.919830       0.969289        0.935658   
50%      0.944844    1.529255    0.944183       0.980622        0.958143   
75%      0.961753    1.757990    0.960944       0.988834        0.979532   
max      0.988151    7.747312    0.987592       0.999138        1.000000   

       AnomalyNSE       Pbias        Bias     MaxMiss  
count  833.000000  943.000000  943.000000  943.000000  
mean     0.152678    0.071368    0.006446    1.292032  
std      3.812618    0.761416    0.085490    0.881240  
min    -77.248323   -9.045063   -1.071342    0.015025  
25%      0.309765   -0.133994   -0.

In [49]:
NEWT.analysis.perf_summary(preds)

Unnamed: 0,R2,RMSE,NSE,StationaryNSE,ClimatologyNSE,AnomalyNSE,Pbias,Bias,MaxMiss
0,0.959498,1.573854,0.959488,0.983768,0.607717,0.902755,0.033256,0.004434,1.905588


## Forecast

### HRRR

### GFS/GEFS

## Heterogeneous Forecast

# Gagewise Cross-Validation

Well, something extremely weird is happening with the anomaly predictions.  It's not the coefficients themselves, which are consistently "sane" (max sensitivity coefficient is 0.95).  Something weird must be happening with the weather anomaly analysis itself.  The problem is not air temperature dailies, which range from -26 to +40.  So there must be some sort of issue with air temperature itself.  Not that either: observed tmax ranges from -28 to +45.  So what on Earth is going on?

Even though tmax is reasonable, the anomaly jumps from a sane 4.1 at the 80.5 percentile to ~600M at 81%.  More precisely, the jump is at about 80.73%.  How on Earth are ~20% of the anomaly predictions suddenly jumping into the millions of degrees?  We know that it's happening for the large majority of watersheds, since even the 75th percentile NSE is solidly negative.  That also implies that the cause is something dynamic.

There seem to be two major possibilities:

- Threshold engine is going haywire
- Convolution is somehow broken

It looks like threshold issues could conceivably result if the cutoff and minimum temperatures are quite close.  Let's take a closer look at the coefficients.

Some of the gaps are quite small (as low as 0.01), but some erroneous coefficient in the hundreds would not explain predictions in the millions, and regardless, that occurs rarely.  That, in itself, is not the problem.

In [None]:
# Modbuilder: data -> (ws -> prediction)
def make_modbuilder(use_clim, lookback):
    def next_modbuilder(data):
        nx = NEXT.NEXT.from_data(data)
        def prd(x):
            mod = nx.make_newt(x, reset=True, use_climate=use_clim, climyears=lookback).get_newt()
            mod.dynamic_engine = None
            return mod.run_series(data)
        return prd
        # return lambda x: nx.make_newt(x, reset=True, use_climate=use_clim, climyears=lookback).get_newt().coefs_to_df()
        # return lambda x: nx.run(x, reset=True, use_climate=use_clim, climyears=lookback)
    return next_modbuilder

In [None]:
clim=False
lookback=5
with warnings.catch_warnings(action="ignore"):
    # kfr = NEWT.analysis.kfold(dev_data, make_modbuilder(clim, lookback), output=f"results/kfold_coefficients{'_withclim_lookback' + str(lookback) if clim else ''}.csv")
    kfr = NEWT.analysis.kfold(dev_data, make_modbuilder(clim, lookback), output=f"results/kfold_nothreshold.csv")
    # kfr = NEWT.analysis.kfold(dev_data, make_modbuilder(clim, lookback), output=f"results/kfold{'_withclim_lookback' + str(lookback) if clim else ''}.csv")

In [None]:
kfr.groupby("id").apply(NEWT.analysis.perf_summary, include_groups=False).describe()

In [None]:
kfr.describe()

# Test Set Validation

## Hindcast

## Forecast

### HRRR

### GFS/GEFS

## Heterogeneous Forecast

# Extrapolation Tests

## Regional

## Elevation

## Walk-Forward

# Regime Shift/Disturbance

# Small Stream

# True Forecast