The reviewer asked to assess the robustness of our LRFs using a bootstrapping procedure. Here is their comment:

> avoiding causality statements without further physical analysis and (2) assessing the robustness of the upper-tropospheric features by bootstrapping the linear response functions. For instance, one could calculate the level-by-level variance of the linear response function about di􏰃erent states that have similar thermodynamic properties (e.g. latitude by latitude).

Let's focus on the tropics, since that is where the instability is.

In [None]:
from jacobian import *
from src.data import open_data
import common

In [None]:
def get_model(path="../../models/265/5.pkl"):
    # open model
    model = torch.load(path)
    model.eval()
    
    return model

from random import randint

def bootstrap_samples(tropics, n):
    sample_dims = ['time', 'x', 'y']
    dim_name = 'sample'
    indexers = {dim: xr.DataArray(np.random.choice(tropics[dim], n), dims=[dim_name])
                for dim in sample_dims}
    samples_dataset = tropics.sel(**indexers)
    
    for i in range(n):
        rand_ind = randint(0, n-1)
        sample = (samples_dataset
                  .isel(sample=rand_ind)
                  .expand_dims(['y', 'x'], [-2, -1])
                  .compute())
        yield sample
        
        
def get_jacobian(model, sample):
    necessary_variables = sample[model.input_names]
    jac = saliency_map_one_location(model, necessary_variables)
    return jac


# compute boot strap stats
from collections import defaultdict

def apply_list_jacobian(func, seq):
    keys = seq[0].keys()
    output = defaultdict(dict)
    for ink in keys:
        for outk in keys:
            output[outk][ink] = func([it[outk][ink] for it in seq])
    return output


def mean(seq):
    n = len(seq)
    return sum(seq)/n

def std(seq):
    n = len(seq)
    mu = mean(seq)
    variance = sum((it - mu)**2 for it in seq)/n
    return torch.sqrt(variance)


def std_error(seq):
    n = len(seq)
    return std(seq)/torch.sqrt(torch.tensor(n).float())

from toolz import curry

@curry
def quantile(seq, q):
    arr = torch.stack(seq).detach().numpy()
    ans = np.quantile(arr, q, axis=0)
    return torch.tensor(ans)

In [None]:
# bootstrap sample size
n = 100

model = get_model(path="../../models/265/5.pkl")
training = open_data('training')
training['region'] = common.get_regions(training.y)
tropics = training.isel(y=slice(30,34))
samples = bootstrap_samples(tropics, n)
jacobians = [get_jacobian(model, sample) for sample in samples]

Compute some statistics accross the boot strapped samples

In [None]:
mean_jacobian = apply_list_jacobian(mean, jacobians)
std_jacobian = apply_list_jacobian(std, jacobians)
se_jacobian = apply_list_jacobian(std_error, jacobians)
q25_jacobian = apply_list_jacobian(quantile(q=.025), jacobians)
q975_jacobian = apply_list_jacobian(quantile(q=.975), jacobians)
p = training.p.values[0].ravel()

In [None]:
plot((se_jacobian, p));

I am not sure the standard error is the right thing to plot here. The question is what the significance threshold is for each pane.

I think using the quantiles of the boot strap cycle is more relevant. Here is the 97.5-percentile of the LRF:

In [None]:
plot((q975_jacobian, p));

and the 2.5% percentile:

In [None]:
plot((q25_jacobian, p));

We can see that the sign reverse between these plots. I assume this is due to diurnal variability. Perhaps, I should condition these LRFS based on the time of day, or only consider the LRF averaged over the diurnal cycle. Nonetheless, the magnitude of the upper level sensitivity remains equivalent.