# WARNING

The results using linear regression below worked when I generate the report for chris, but they are no longer very reproducible. I think Ridge Regression is actually necessary to generate stable linear response functions!

# WARNING

In [None]:
%pylab inline

import sys
sys.path.insert(0, "../")

import numpy as np
from sklearn.externals import joblib
import xarray as xr
from xnoah.data_matrix import unstack_cat, stack_cat
import pandas as pd

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA

from matplotlib import gridspec
from matplotlib.colors import SymLogNorm

from lib.models import WeightedOutput, get_lrf, weights_to_np

In [None]:
!conda list

In this notebook I analyze the NgAqua data which includes the radiative component. In the future I will change this analysis to only look at the convective component of the signal $Q_{1c}= Q_1 - Q_{rad}$.

In [None]:



def compute_dp(o):
    p_ghosted = np.hstack((2*p[0]-p[1], p, 0))
    p_interface = (p_ghosted[1:] + p_ghosted[:-1])/2
    dp = -np.diff(p_interface)
    
    return dp



def plot_lrf(lrf, p, input_vars, output_vars,
             width_ratios=[1, 1, .3, .3],
            figsize=(10,5),
            image_kwargs={}):
    p = np.asarray(p)
    ni, no = len(input_vars), len(output_vars)
    print("Making figure with", ni,"by", no, "panes")

    fig, axs = plt.subplots(no, ni, figsize=figsize)
    
    grid = gridspec.GridSpec(2, len(input_vars), width_ratios=width_ratios)

    for j, input_var in enumerate(input_vars):
        for i, output_var in enumerate(output_vars):
            ax = plt.subplot(grid[i,j])
            lrf_pane = np.asarray(lrf.loc[input_var][output_var])
    #             print(lrf_pane.shape)
            if lrf_pane.shape[0] == 1:
                ax.plot(lrf_pane.ravel(), p)
                ax.invert_yaxis()
            else:
                # compute pressure difference
                dp = compute_dp(p)
                im = ax.pcolormesh(p, p, (lrf_pane).T/dp, **image_kwargs)
                # make colorbar
                cbar = plt.colorbar(im, orientation='vertical', pad=.02)
                # invert y axis for pressure coordinates
                ax.invert_yaxis()
                ax.invert_xaxis()
                
            # turn off axes ylabels
            if j > 0:
                ax.set_yticks([])
            if i < no-1:
                ax.set_xticks([])
                
            # Add variable names
            if j == 0:
                ax.set_ylabel(output_var)
                
            if i == 0:
                ax.set_title(input_var)

            
    return fig, axs
    

In [None]:
X = xr.open_dataset("X.nc")#.pipe(mysel)
Y = xr.open_dataset("Y.nc")#.pipe(mysel)
w = xr.open_dataarray("w.nc")
p = xr.open_dataset("stat.nc").p

# mu = X.mean(['time', 'x', 'y'])

In [None]:
d = WeightedOutput.quickfit("X.nc", "Y.nc", "w.nc")
x, y, mod = map(d.get, ['x', 'y', 'mod'])
mod.score(x,y)

Let's plot the estimate linear response functions normalized. As in Kuang (2010?), each column in the plots below is normalized by the layer thickness there.

In [None]:
lrf = get_lrf(mod, x, y)
plot_lrf(lrf, p, ['QT', 'SL', 'SHF', 'LHF'], ['Q1', 'Q2']);

These linear response functions are nuts! They don't look anything like Zhiming Kuangs'.

The height level of this point is 19800 m. The index corresponding to this height is 28. Let's look at what the data is like around there.

In [None]:
def unprep(x):
    x_d = unstack_cat(x, "features").unstack("samples")
    # need to fix the fact that x_d['z'] has dtype object
    x_d['z'] = np.asarray(x_d['z'], dtype=float)
    return x_d

x_d = unprep(x)
y_d = unprep(y)

In [None]:
x_d.isel(z=28, time=-21).QT.plot()

In [None]:
x_d.QT.isel(time=-21, y=8).plot()

In [None]:
qt = x_d.QT.mean(['x', 'y', 'time'])
semilogx(qt, qt.z)
plt.plot(qt[28], qt.z[28], '*', markersize=20)

My guess is that 19800m is the highest level at which $q_t$ is still dynamically influenced and correlated with convection in some way. Because it is the highest level, it is also least moist so the corresponding cofficients in the linear response function must be very large. I am not sure why Zhiming Kuang did not have this problem. Did he constrain his analysis to the troposphere?

Overall, this indicates that we needs a more effective way to normalize the data.

# Plotting weighted by standard deviation of the inputs

The coefficient in the matrix is the important thing. It is the coefficient times the typical deviation of the signal that matters, so here I weight the matrix by the standard deviation of the input fields.

In [None]:
def xarray_std_to_df(std):
    input_std = stack_cat(std, "features", ['z'])
    
    return pd.Series(input_std.data, index=input_std.indexes['features'], )


def weight_lrf(lrf, sig):
    sig = xarray_std_to_df(sig)
    sig_weighted_lrf  = lrf.apply(lambda x: x*sig)
    return sig_weighted_lrf

In [None]:

sig_weighted_lrf = weight_lrf(lrf, X.std(['x', 'y', 'time']))
plot_lrf(lrf, p, ['QT', 'SL', 'SHF', 'LHF'], ['Q1', 'Q2']);

This shows that the top of the domain actually doesn't contribute much, but the signal in the lower part of the atmosphere is still very strange.

# Remove LHF and SHF from the analysis

It appears that latent heat and sensible heat flux have vert consistent structures.

**Are the QT and SL matrices above strange because the coherent signals are all captured by SHF and LHF?**

We can test this by performing an analysis with and without the heat fluxes.

## Q1, Q2 ~ LHF + SHF

In [None]:
from lib.models import weights_to_np, prepvar, WeightedOutput


def fit_linear_response_model(mod=LinearRegression(), input_vars = ['SHF', 'LHF']):
    x = prepvar(X[input_vars], sample_dims=['x', 'time', 'y'], feature_dims=[])
    
    y = prepvar(Y, sample_dims=['x', 'time', 'y'])
    
    mod = WeightedOutput(mod, weights_to_np(w, y.features))
    mod.fit(x, y)
    return x, y, mod

In [None]:
xmat, ymat, mod = fit_linear_response_model(input_vars = ['LHF', 'SHF'])
lrf = get_lrf(mod, xmat, ymat)
mod.score(xmat, ymat)

In [None]:
lrf

The R2 *much* lower now. Here are the corresponding linear response functions.

In [None]:
fig, axs = plt.subplots(2,2, figsize=(3,5))


for i, in_var in enumerate(['SHF', 'LHF']):
    for j, out_var in enumerate(['Q1', 'Q2']):
        
        resp = np.asarray(lrf.loc[in_var][out_var]).ravel()
        ax = axs[j,i]
        axs[j,i].plot(resp, p)
        axs[j,i].invert_yaxis()
        
        if i == 0:
            ax.set_ylabel(out_var)
        else:
            ax.set_yticks([])
        if j == 0:
            ax.set_title(in_var)
        
        

## Q1, Q2 ~ QT + SL

In [None]:
xmat, ymat, mod = fit_linear_response_model(input_vars = ['QT', 'SL'])
lrf = get_lrf(mod, xmat, ymat)
sig_weighted_lrf = weight_lrf(lrf, X[['QT', 'SL']].std(['x', 'y', 'time']))
mod.score(xmat, ymat)

It appears that adding SHF and LHF does not improve the skill by very much.

In [None]:
assert np.isnan(np.asarray(sig_weighted_lrf)).sum() == 0

In [None]:
plot_lrf(sig_weighted_lrf, p, ['QT', 'SL'], ['Q1', 'Q2'],
         width_ratios=[1,1]);
#          image_kwargs=dict(norm=SymLogNorm(5)));

The linear response functions appear visually identical.

# Discussion

1. The linear response functions need to be weighted by sigma to avoid a spurious visual plot of the linear response functions. Why doesn't this show up in Kuang et al?

2. The LRFs are actually quite skillful at predicting the data, but they don't do as good of a job by the eyeball norm.

3. These linear response functions indicate that the boundary layer is effectively controlling convection in these simulations. 

## Addendum Ridge Regression

In [None]:

def plot_ridge_alpha(mod):
    xmat, ymat, mod = fit_linear_response_model(mod=mod, input_vars = ['QT', 'SL'])
    lrf = get_lrf(mod, xmat, ymat)
    sig_weighted_lrf = weight_lrf(lrf, X[['QT', 'SL']].std(['x', 'y', 'time']))
    score = mod.score(xmat, ymat)
    plot_lrf(sig_weighted_lrf, p, ['QT', 'SL'], ['Q1', 'Q2'],
             width_ratios=[1,1]);
    print(f"model: {mod};\n R2={score}")

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
plot_ridge_alpha(make_pipeline(VarianceThreshold(.001), Ridge(0.01, normalize=True)))

The R2 of this fit is only slightly different, but the answer is visually very different!