# Coefficient Estimation for NEWT

The optimal model involves eight seasonality coefficients and a baseline air temperature sensitivity, a total of nine terms.  We don't need to estimate anything for the relationship between seasonal temperature and sensitivity on a given day, since a global GAM does that.  So we have nine coefficients to estimate.

However, those nine coefficients are correlated, which raises problems for uncertainty analysis and, in general, suggests inefficiency.  So the first thing we're going to do is principle component analysis on the coefficients.

In [None]:
import pandas as pd
import pandas.plotting as pdp
import numpy as np
import matplotlib.pyplot as plt
import rtseason as rts
import seaborn as sns
import scipy
import os
from sklearn.decomposition import PCA
import pygam  # https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html
from pygam import LinearGAM, s, te, l, f
from NEWT import Watershed, kfold, perf_summary, statics, analysis, engines
from NEWT.make_coefficients import build_training_data
import NEXT.data as ndata
import NEWT
from math import ceil
anomweights = np.array([0.132, 0.401, 0.162, 0.119, 0.056, 0.13 ])
bp = "/scratch/dphilippus/notebooks/next_validation/"

In [None]:
# sillymod: use one generic model.  Otherwise, fit each model to itself.
def dummy_modbuilder(data, sillymod):
    data = data.groupby("date")[["temperature", "tmax", "vp"]].mean().assign(date = lambda x: x.index)
    def runner(ws):
        try:
            if sillymod:
                return Watershed.from_data(data).run_series(ws)
            else:
                return Watershed.from_data(ws).run_series(ws)
        except Exception as e:
            print(e)
    return runner

In [None]:
data = pd.read_csv(bp + "DevDataBuffers.csv", dtype={"id": "str"}, parse_dates=["date"])
data = data[(data["temperature"] > -0.5) & (data["temperature"] < 40)]
data

In [None]:
data.drop(columns="date").describe()

## Add Reach Buffer Data 

For each site, we want 1-km-upstream, 15-m-buffer canopy cover and mean direction.  Conveniently, flow direction maps exactly onto slope aspect of the river itself, since by definition it flows precisely downhill.  Runtime depends on period of record, but is currently averaging 15 seconds per site (suggesting ~4 hours for all of them).

## A Silly Kfold Test

First we need to test the cross-validation setup, so we have a dummy model to test it with.  One option is to use a model that's just trained on everything.  The other option is to use a model that's trained on each watershed to predict itself.

## Prepare Coefficients

Since the kfold testing seems to be working, let's prepare model coefficients.  To recap, we need to provide seasonality coefficients, tmax and vp sensitivities, and tmax and vp dailies (for static/spin-up).  We can also set up dynamic and yearly modification engines, and will eventually separate estimators into static (at start) and climate/dynamic (through a climate modification engine), but that can come later.

A brief test was run and subsequently deleted to establish that simple sinusoid (annual-period, variable-phase sine) does a solid job capturing vp and tmax annual cycles (median R2 0.95 and 0.92, respectively), and therefore those coefficients are suitable here.

In [None]:
def make_coefficients(grp):
    if len(grp["day"].unique()) < 181:
        return None
    anomilized = NEWT.watershed.anomilize(grp).sort_values("date")[["date", "st_anom", "at_anom"]]  # st_anom, at_anom
    ssn = rts.ThreeSine.from_data(grp[["day", "temperature"]]).to_df().drop(columns=["RMSE", "R2"])
    for i in range(6):
        anomilized[f"delta{i}"] = anomilized["at_anom"].shift(i)
    anomilized = anomilized.dropna()
    X = anomilized.loc[:, "delta0":"delta5"]
    y = anomilized["st_anom"]
    prd = X @ anomweights
    ssn["sensitivity"] = y.abs().mean() / prd.abs().mean()
    return ssn

In [None]:
coefs = data.groupby("id").apply(make_coefficients, include_groups=False).droplevel(1)
coefs.describe()

In [None]:
coefs.corr()

## PCA

In [None]:
offset = coefs.mean()
scale = coefs.std()

In [None]:
co_norm = (coefs - offset) / scale
co_norm

There are some nontrivial cross-correlations, so let's see what the principal component axes look like.

In [None]:
pca = PCA()
fit = pca.fit(co_norm)
evr = fit.explained_variance_ratio_
print(evr)
print(np.cumsum(evr))

Of the 9 coefficients, we capture 96% of the variance with 8 of them, 92% with 7, 86% with 6, etc.  But that's not the point: we're looking to predict uncorrelated variables, not compress the space.

In [None]:
sns.lineplot(x=np.arange(1, len(evr)+1), y=np.cumsum(evr))

In [None]:
pcax = pd.DataFrame(fit.components_,
            columns=coefs.columns)  # shape: components x features. So a row is a PC and a column is a feature.
pcax

So what's actually going into this stuff?  From most important to least:

0. Distributed across pretty much everything except Fall, Spring, and Summer days.
1. Heavy on the three days omitted before, plus some weight on Intercept, Amplitude, and FallWinter.
2. Heavy on SpringDay and sensitivity with some Amplitude.
3. Heavy on SpringSummer and Summer, Spring and Fall days
4. SummerDay, SpringSummer, and sensitivity
5. FallWinter, with a little Fall and Winter days
6. Amplitude and FallDay
7. WinterDay, SpringDay, and Intercept
8. Intercept, SummerDay, and FallWinter

Note that all components and all features have roughly equal total absolute weights, just distributed differently.  Total non-absolute weights are quite variable.

In [None]:
sns.heatmap(pcax.T.abs(), cmap="viridis")

As a sanity check, coefficients x t(PCA) should give us 9 uncorrelated variables.  It works!  And we invert it by multiplying by (non-transposed) PCA components.

In [None]:
comps = co_norm @ np.transpose(fit.components_)
comps.columns = [f"pca{x}" for x in range(9)]
comps.corr()

In [None]:
fit.components_

In [None]:
coefs_rc = comps @ fit.components_
coefs_rc.columns = coefs.columns
coefs_rc = coefs_rc * scale + offset
coefs_rc - coefs

# Model

We'll use a GAM.  This lets us test a broad range of relationships while keeping it quick and interpretable, and has precedent in the literature (PNW from Siegel et al at NOAA Northwestern Fisheries).

## Generalized Code

In [None]:
preds = NEWT.coef_est.preprocess(data).drop(columns=["level_1_x", "level_1_y", "date", "day"])
preds.columns

In [None]:
mdata = comps.merge(preds, on="id", how="left")

In [None]:
X = mdata.iloc[:, 10:]
allvar = X.columns
Y = mdata.loc[:, "pca0":"pca8"]

In [None]:
allvar

In [None]:
# Reference linear model
Xnp = np.concatenate((np.ones((len(X), 1)), X.to_numpy()), axis=1)
lfit = np.linalg.lstsq(Xnp, Y, rcond=None)[0]
prd = Xnp @ lfit
pd.concat([
    pd.DataFrame({"R2": np.corrcoef(Y.iloc[:, i], prd[:, i])[0, 1]**2}, index=[Y.columns[i]])
    for i in range(len(Y.columns))
])

In [None]:
varname = {
    'intercept': "Mean Air Temperature (C)",
    'prcp': "Mean Precipitation (mm/day)",
    'cold_prcp': "Mean Subfreezing Precip (mm/day)",
    'frozen': "Proportion of Days Below Freezing",
    'srad': "Mean Solar Radiation (W/m2)",
    'water': "Water Land Cover Fraction",
    'developed': "Developed Land Cover Fraction",
    'barren': "Barren Land Cover Fraction",
    'forest': "Forest Land Cover Fraction",
    'shrubland': "Shrubland Land Cover Fraction",
    'herbaceous': "Herbaceous Land Cover Fraction",
    'cultivated': "Cultivated Land Cover Fraction",
    'wetland': "Wetland Land Cover Fraction",
    'ice_snow': "Ice/Snow Land Cover Fraction",
    'canopy': "Last-km Riparian Canopy Cover (%)",
    'flowdir': "Mean Last-km Flow Direction (deg)", 
    'area': "Watershed Area (m2)",
    'elev': "Mean Elevation (m)",
    'elev_min': "Min Elevation (m)",
    'slope': "Watershed Mean Slope (m/m)",
    'asp_north': "Slope Aspect Mean North Component",
    'asp_east': "Slope Aspect Mean East Component",
    'lat': "Latitude (deg N)",
    'lon': "Longitude (deg E)",
    'prcp_sd': "Std Dev Precipitation (mm/day)",
    'srad_sd': "Std Dev Solar Radiation (W/m2)",
    'vp': 'Mean Vapor Pressure (Pa)',
    'vp_sd': "Std Dev Vapor Pressure (Pa)",
    'prcp_phi': "Mean Date of Precip (Julian day)",
    'prcp_index': "Precip Seasonality Index",
    'tmax': 'Mean Daily Max Air Temp (C)',
    'tmax_phi': 'Mean Date of Air Temp (Julian day)',
    'tmax_index': 'Air Temp Seasonality Index',
    "Intercept": "3S Intercept (C)",
    "Amplitude": "3S Amplitude (C)",
    "SpringSummer": "3S Spring/Summer Coef. (C)",
    "FallWinter": "3S Autumn/Winter Coef. (C)",
    "SpringDay": "3S Spring Date (Julian day)",
    "SummerDay": "3S Summer Date (Julian day)",
    "FallDay": "3S Autumn Date (Julian day)",
    "WinterDay": "3S Winter Date (Julian day)",
    "sensitivity": "Air Temp. Anomaly Sensitivity (C/C)"
}
rename = lambda vrs: [varname[v] for v in vrs]

After PCA, the best-variable linear performance is much worse, at 0.59.  We'll see how the GAM does.

In [47]:
eqmk = lambda N: sum([s(i) for i in range(1, N)], start=s(0))
without = lambda l, v: [i for i in l if i != v]

def gam_gcv(X, Y, eq, lam):
    return LinearGAM(eq, lam=lam).fit(X, Y).statistics_["GCV"]

def tune_lam(X, Y, lams, counter=0, tolerance=0.001, maxdepth=100, debug=False):
    # Lams: 3 lambda candidates (low, mid, high)
    # Tolerance: minimum change in GCV to proceed.
    print("|", end="")
    eq = eqmk(len(X.columns))
    (l, m, r) = lams
    left = gam_gcv(X, Y, eq, l)
    mid = gam_gcv(X, Y, eq, m)
    right = gam_gcv(X, Y, eq, r)
    if debug:
        print(f"{l} = {left} | {m} = {mid} | {r} = {right}")
    # First, we compute the improvement, the new set, and the reference
    # option.  After that, we'll decide whether to keep iterating.
    # Spacing is computed by powers of 10.
    if (delta := mid - left) > 0:  # low is good
        ref = l
        nl = (l/10, l, l * (m/l)**(1/2))
    elif (delta := mid - right) > 0:  # high is good
        ref = r
        nl = (m * (r/m)**(1/2), r, r*10)
    else:
        delta = mid - min([left, right])
        ref = m
        nl = (m * (l/m)**(1/2), m, m * (r/m)**(1/2))
    # Now we decide whether to proceed or return.
    if delta > tolerance and counter < maxdepth:
        return tune_lam(X, Y, nl, counter+1, tolerance, maxdepth, debug)
    else:
        return ref

def tune_gcv(varset, lam, gcv, X, Y, tolerance=0.0001, debug=False):
    # Iteratively tune a GAM to optimize GCV.  In each round, we optimize lambda and drop one variable.
    # Proceed only if we reduce GCV.  Otherwise, return the current arrangement.
    varset = varset.copy()
    if gcv is None:
        gcv = gam_gcv(X[varset], Y, eqmk(len(varset)), lam)
    # First pass: tune lambda.
    lam = tune_lam(X[varset], Y, (lam/3, lam, lam*3))
    # Next: drop one variable.
    best_gcv = gcv
    best_var = None
    if debug:
        print(gcv)
    for v in varset:
        new_vars = without(varset, v)
        new_gcv = gam_gcv(X[new_vars], Y, eqmk(len(new_vars)), lam)
        if debug:
            print(v, new_gcv)
        if new_gcv < best_gcv:
            best_gcv = new_gcv
            best_var = v
    if best_var is None or gcv - best_gcv < tolerance:
        return (varset, lam, gcv)
    else:
        varset.remove(best_var)
        print(f"Lambda: {lam:.2f} | Dropped: {best_var} | GCV: {best_gcv:.4f}")
        return tune_gcv(varset, lam, best_gcv, X, Y, tolerance)

def pca_pdps(gams, savebase, do_x=True, Npca=9, pca=fit.components_, scale=scale, offset=offset, cols=coefs.columns,
            rtn=False, hgt=3):
    """
    Generate PDPs, then invert PCA so they correspond to actual coefficients.
    Save resulting PDPs by y and by x in savebase.
    GAMs should be a list of tuples: (gam, x names, simple x names, PCA index)
    GAMs must be in order of PCA.
    
    How does the math work?
    A given PC corresponds to a variety of coefficients.
    If we multiply one X-PC (single column) by the corresponding PCA row,
    we get the contribution of that X, to that PC, distributed across all relevant Y.
    Now, if we take every PC-PDP for that X (with zeroes where NA) and multiply
    by the PCAs, we get the total contribution of that X to every Y.
    That makes a data frame of nX * nY.
    If we concatenate those together, we can group them by X or Y, and generate plots accordingly.
    
    To do that, we first have to extract X/PC pairs from all the PDPs.
    
    This is modified from pdps_by_x.
    
    Run separately with do_x=True (by-X) and False (by-Y) to avoid running out of memory.
    
    If rtn is True, it will just return the first plot, for debugging purposes.
    """
    xes = {}
    for (gam, xns, sxns, index) in gams:
        terms = [t for t in gam.terms if not t.isintercept]
        for (i, sxn) in enumerate(sxns):
            if not sxn in xes:
                xes[sxn] = {"xlab": xns[i], "ys": []}
            gxg = gam.generate_X_grid(term=i, n=100)
            X = gxg[:, terms[i].feature]
            pdep = gam.partial_dependence(term=i, X=gxg)
            xes[sxn]["ys"].append((
                index, X, pdep
            ))
    # Now we have a dictionary {simple x name: {"xlab", "ys": [(index, X, Y)]}}.  Hopefully the X are always the same?
    results = []
    for (sxn, vals) in xes.items():
        xlab = vals["xlab"]
        yinfo = vals["ys"]
        ref_x = yinfo[0][1]
        Y = np.zeros((len(ref_x), Npca))
        for yset in yinfo:
            if not (yset[1] == ref_x).all():
                raise ValueError(f"Mismatched X grids!  {sxn}, {yset[0]}")
            Y[:, yset[0]] = yset[2]
        # Now we have an X column (ref_x) and a Y matrix.
        # Don't add offset, because we just want partial dependency.
        Y_adj = pd.DataFrame(Y @ pca, columns=cols) * scale
        df = pd.concat([pd.Series(ref_x, name="X"), Y_adj], axis=1)
        df["X_name"] = xlab
        df["sxn"] = sxn
        results.append(df)
    results = pd.concat(results)
    # Now results is a data frame with X, [ys...], X_name, sxn
    # Plots by X variable.
    if do_x:
        for sxn, df in results.groupby("sxn"):
            xname = df["X_name"].iloc[0]
            ynames = df.columns[1:]
            fig = sns.relplot(df.melt(id_vars=["X", "X_name", "sxn"]), x="X", y="value",
                              col="variable", col_wrap=3, facet_kws={"sharey": False},
                              kind="line", height=hgt)
            fig.set_xlabels(xname)
            for ax in fig.axes:
                vn = varname[ax.title.get_text().split(" = ")[1]]
                ax.set_ylabel(vn)
                ax.set_title("")
            if rtn:
                return fig
            fig.savefig(savebase + f"X_{sxn}.png", dpi=1000)
            plt.close()
    # Plots by Y variable.
    else:
        for syn, df in results.melt(id_vars=["X", "X_name", "sxn"]).groupby("variable"):
            yname = varname[syn]
            # N = len(df["X_name"].unique())
            # wrap = int(np.sqrt(N)) + 1
            wrap = 4
            fig = sns.relplot(df, x="X", y="value", col="X_name", col_wrap=wrap,
                              facet_kws={"sharex": False}, kind="line", height=hgt)
            fig.set_ylabels(yname)
            for ax in fig.axes:
                vn = ax.title.get_text().split(" = ")[1]
                ax.set_xlabel(vn)
                ax.set_title("")
            if rtn:
                return fig
            fig.savefig(savebase + f"Y_{syn}.png", dpi=1000)
            plt.close()
    
def pdps(gam, xy0=True, names=None, ylab=None, save=None):
    nt = len(gam.terms) - 1
    Ny = 1 if nt < 3 else (2 if nt < 9 else 3)
    Nx = nt // Ny + (1 if nt % Ny > 0 else 0)
    _, axes = plt.subplots(Ny, Nx, figsize=(12,8))
    for i, term in enumerate(gam.terms):
        if term.isintercept:
            continue
        ax = axes[i // Nx, i % Nx] if Ny > 1 else axes[i]
        if i == 0 and xy0:
            XX = gam.generate_X_grid(term=i, meshgrid=True)
            Z = gam.partial_dependence(term=1, X=XX, meshgrid=True)
            co = ax.contourf(XX[1], XX[0], Z)
            
        else:
            XX = gam.generate_X_grid(term=i)
            pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)
    
            ax.plot(XX[:, term.feature], pdep)
            ax.plot(XX[:, term.feature], confi, c='r', ls='--')
        ax.set_xlabel(repr(term) if names is None else names[i])
        if ylab is not None and i % Nx == 0:
            ax.set_ylabel(ylab)
    plt.tight_layout()
    if save is not None:
        plt.savefig(bp + save)

        
def pdps_by_x(named_gams, savebase):
    # Generate PDPs, as above, but by x-variable, not y-variable.
    # named_gams should be a list of tuples: (gam, xnames, simple_xnames, yname).
    # They will be saved as savebase + simple_xname + .png.
    xes = {}
    for (gam, xns, sxns, yn) in named_gams:
        terms = [t for t in gam.terms if not t.isintercept]
        for (i, sxn) in enumerate(sxns):
            if not sxn in xes:
                xes[sxn] = {"xlab": xns[i], "ys": []}
            gxg = gam.generate_X_grid(term=i)
            X = gxg[:, terms[i].feature]
            pdep, confi = gam.partial_dependence(term=i, X=gxg, width=0.95)
            xes[sxn]["ys"].append((
                yn, X, pdep, confi
            ))
    for (sxn, vals) in xes.items():
        ys = vals["ys"]
        nt = len(ys)
        nrow = int(nt ** 0.5)
        ncol = ceil(nt / nrow)
        _, axes = plt.subplots(nrow, ncol, figsize=(12, 8))
        for (i, Y) in enumerate(ys):
            ax = axes[i // ncol, i % ncol] if nrow > 1 else axes[i] if nt > 1 else axes
            xs = Y[1]
            ax.plot(xs, Y[2])
            ax.plot(xs, Y[3], c='r', ls='--')
            ax.set_ylabel(Y[0])
            if i // ncol == nrow - 1:
                ax.set_xlabel(vals["xlab"])
        plt.tight_layout()
        plt.savefig(savebase + sxn + ".png", dpi=1200)

def get_pd(gam, n):
    XX = gam.generate_X_grid(term=n)
    y = gam.partial_dependence(term=n, X=XX)
    return (XX[:,n], y)

## PCA0

In [None]:
(var0, lam0, gcv0) = tune_gcv(list(allvar), 10, None, X, Y["pca0"])
print(var0, lam0, gcv0)
eq0 = eqmk(len(var0))
gam0 = LinearGAM(eq0, lam=lam0).fit(X[var0], Y["pca0"])
gam0.summary()

In [None]:
pdps(gam0, False, rename(var0), "PCA 0", "results/PCA_PDPs/GAM0.png")

## PCA1

In [None]:
(var1, lam1, gcv1) = tune_gcv(list(allvar), 10, None, X, Y["pca1"])
print(var1, lam1, gcv1)
eq1 = eqmk(len(var1))
gam1 = LinearGAM(eq1, lam=lam1).fit(X[var1], Y["pca1"])
gam1.summary()

In [None]:
pdps(gam1, False, rename(var1), "PCA 1", "results/PCA_PDPs/GAM1.png")

## PCA2

In [None]:
(var2, lam2, gcv2) = tune_gcv(list(allvar), 12, None, X, Y["pca2"])
print(var2, lam2, gcv2)
eq2 = eqmk(len(var2))
gam2 = LinearGAM(eq2, lam=lam2).fit(X[var2], Y["pca2"])
gam2.summary()

In [None]:
pdps(gam2, False, rename(var2), "PCA 2", "results/PCA_PDPs/GAM2.png")

## PCA3

In [None]:
(var3, lam3, gcv3) = tune_gcv(list(allvar), 10, None, X, Y["pca3"])
print(var3, lam3, gcv3)
eq3 = eqmk(len(var3))
gam3 = LinearGAM(eq3, lam=lam3).fit(X[var3], Y["pca3"])
gam3.summary()

In [None]:
pdps(gam3, False, rename(var3), "PCA 3", "results/PCA_PDPs/GAM3.png")

## PCA4

In [None]:
(var4, lam4, gcv4) = tune_gcv(list(allvar), 10, None, X, Y["pca4"])
print(var4, lam4, gcv4)
eq4 = eqmk(len(var4))
gam4 = LinearGAM(eq4, lam=lam4).fit(X[var4], Y["pca4"])
gam4.summary()

In [None]:
pdps(gam4, False, rename(var4), "PCA 4", "results/PCA_PDPs/GAM4.png")

## PCA5

In [None]:
(var5, lam5, gcv5) = tune_gcv(list(allvar), 10, None, X, Y["pca5"])
print(var5, lam5, gcv5)
eq5 = eqmk(len(var5))
gam5 = LinearGAM(eq5, lam=lam5).fit(X[var5], Y["pca5"])
gam5.summary()

In [None]:
pdps(gam5, False, rename(var5), "PCA 5", "results/PCA_PDPs/GAM5.png")

## PCA6

In [None]:
(var6, lam6, gcv6) = tune_gcv(list(allvar), 10, None, X, Y["pca6"])
print(var6, lam6, gcv6)
eq6 = eqmk(len(var6))
gam6 = LinearGAM(eq6, lam=lam6).fit(X[var6], Y["pca6"])
gam6.summary()

In [None]:
pdps(gam6, False, rename(var6), "PCA 6", "results/PCA_PDPs/GAM6.png")

## PCA7

In [None]:
(var7, lam7, gcv7) = tune_gcv(list(allvar), 10, None, X, Y["pca7"])
print(var7, lam7, gcv7)
eq7 = eqmk(len(var7))
gam7 = LinearGAM(eq7, lam=lam7).fit(X[var7], Y["pca7"])
gam7.summary()

In [None]:
pdps(gam7, False, rename(var7), "PCA 7", "results/PCA_PDPs/GAM7.png")

## PCA8

In [None]:
(var8, lam8, gcv8) = tune_gcv(list(allvar), 10, None, X, Y["pca8"])
print(var8, lam8, gcv8)
eq8 = eqmk(len(var8))
gam8 = LinearGAM(eq8, lam=lam8).fit(X[var8], Y["pca8"])
gam8.summary()

In [None]:
pdps(gam8, False, rename(var8), "PCA 8", "results/PCA_PDPs/GAM8.png")

## Plot GAMs By XVar

In [None]:
# (gam, xnames, simple_xnames, yname)
savebase = bp + "results/PCA_PDPs/PDPbyX_"
allgams = [
    (gam0, rename(var0), var0, "PCA0"),
    (gam1, rename(var1), var1, "PCA1"),
    (gam2, rename(var2), var2, "PCA2"),
    (gam3, rename(var3), var3, "PCA3"),
    (gam4, rename(var4), var4, "PCA4"),
    (gam5, rename(var5), var5, "PCA5"),
    (gam6, rename(var6), var6, "PCA6"),
    (gam7, rename(var7), var7, "PCA7"),
    (gam8, rename(var8), var8, "PCA8")
]
pdps_by_x(allgams, savebase)

## Plot Reconstructed GAMs

In [50]:
savebase = bp + "results/PCA_PDPs/RxcPDPs_"
allgams = [
    (gam0, rename(var0), var0, 0),
    (gam1, rename(var1), var1, 1),
    (gam2, rename(var2), var2, 2),
    (gam3, rename(var3), var3, 3),
    (gam4, rename(var4), var4, 4),
    (gam5, rename(var5), var5, 5),
    (gam6, rename(var6), var6, 6),
    (gam7, rename(var7), var7, 7),
    (gam8, rename(var8), var8, 8)
]
pca_pdps(allgams, savebase, do_x=False)

# Print GAMs

In [None]:
# Run this to build list...
print("var_sets = [")
# for (vr, nm) in [("itx", "Intercept"), ("amp", "Amplitude"), ("ssu", "SpringSummer"),
#                  ("fw", "FallWinter"), ("spd", "SpringDay"), ("sud", "SummerDay"),
#                  ("fad", "FallDay"), ("wid", "WinterDay"),
#                  # ("atc", "at_coef")
#                  ("tcmax", "threshold_coef_max"), ("tcmin", "threshold_coef_min"), ("tcc", "threshold_act_cutoff")
#                 ]:
for pca in range(9):
    vr = str(pca)
    nm = "PCA" + vr
    vrs = eval("var" + vr)
    eq = eval("eq" + vr)
    lam = eval("lam" + vr)
    print(f'    {{"name": "{nm}", "vars": {vrs}, "eq": {eq}, "lam": {lam:.0f}}},')
    # print(f"var_{vr} = {eval('var_' + vr)}")
    # print(f"eq_{vr} = {eval('eq_' + vr)}")
    # print(f"lam_{vr} = {eval('lam_' + vr)}")
print("]")

# Cross-Validation Test

Right, we've got some approaches laid out.  Let's test it!  Export the GAM setups and hop over to the validation notebook.

Oddly, the "smart GAMs" are doing slightly worse than the "naive GAMs", though global R2 is much better.  Mainly, RMSE is marginally higher (2.3 vs 2.2 C median).  The plots show the usual mix of some near-perfect fits and some wildly off, with everything in between.

One possibility is that the relatively high lambdas, or likewise the aggressive paring-down, hinder cross-validation performance.  Performance characteristics shown suggest that the amplitude terms or the weather sensitivity may be faring poorly.  It's also possible that the intercept-normalizing of the amplitude coefficients was a net negative.

Allowing more flexibility helped but did not fully address the problem.  Likewise for non-normalizing.

I do wonder if the threshold behavior might be hurting rather than helping, since it seems rather hard to predict threshold coefficients.  The last resort would be that we really need the point-area data.

Excluding thresholds makes performance worse, though it does make the model ~3x faster.  The other last resort is to see what happens if we do include elevation.

Or, I may have been too aggressive about excluding covariates.

- Initial test: R2 0.94 (global 0.86), RMSE 2.3 (2.9) C, NSE 0.88 (0.86), bias 2.8% (2.0%) = 0.34 (0.26) C, max miss 3.0 (14.1) C
- More flexible: R2 0.94 (0.86), RMSE 2.3 (2.9) C, NSE 0.88 (0.86), bias 2.1% (2.0%) = 0.27 (0.27) C, max miss 2.9 (12.3) C
- Non-normalized: R2 0.94 (0.87), RMSE 2.3 (2.8) C, NSE 0.88, bias 2.5% (1.9%) = 0.35 (0.25) C, max miss 3.0 (12.2) C
- No threshold: R2 0.94 (0.87), RMSE 2.3 (2.8) C, NSE 0.88 (0.87), bias 2.6% (2.0%) = 0.35 (0.26) C, max miss 3.3 (12.3) C
- Smarter GAMs (with threshold): R2 0.94 (0.88), RMSE 2.1 (2.7) C, NSE 0.90 (0.88), bias 2.2% (2.0%) = 0.29 (0.26) C, max miss 3.0 (9.3) C

Now we're talking!  And the major problem does seem to be anomaly prediction, but this version is good enough for now.

The anomaly NSE is actually surprisingly good, at 0.50 (better than TE2, oddly), but, oddly, that's worse than stationary ("same as yesterday").  In TempEst 2, stationary NSE was ~0.2.  Not sure what happened there.  Though, of course, for ungaged watersheds we don't *have* an observation for yesterday, so it's still an improvement and considerably better than climatology (NSE = 0).