In this notebook, I compare the performance of two linear models for $Q_{1c}$ and $Q_2$ which have units K/day. In both cases the inputs are the concatenated profiles of $s_l$ (g/kg) and $q_T$(K). This produces a matrix $X$ of iunputs, where each row is a 68 dimensional vector for a given horizontal location, for a given time point. Concatenating $Q_{1c}$ and $Q_2$ produces a matrix of outputs, $Y$, with 68 columns. For the tropics region of NGAqua, both $X$ and $Y$ have 821248 rows.

We will call the first linear regression, **LR**. It consists of the following steps
1. Discard any column which has a variance of less then .001. This removes the columns corresponding to the upper atmospheric moisture field, which is essentially 0. Removing these columns is necessary, to avoid an ill-conditioned linear regression.
2. Perform linear least squares regression (w/ constant profile added).

The next linear model we fit which we call **MCR(n)**, prefilters the inputs $X$ by projecting onto the first $n$ MCA modes. It consists of these steps:
1. Normalize each column by removing its mean, and dividing by the mass-weighted vertical average of the standard deviation of the corresponding physical variable $q_T$, $s_L$.
2. Weight each column by the square root of the layer mass.
3. Peform MCA analysis with $n$ modes, and compute the matrix $P_n$ which projects $X$ onto this modes $n$ modes.
4. Project the inputs onto the MCA mode.
5. Perform linear least squares regression (w/ constant profile) with the outputs of the previous step (a $821248\times n$ matrix) onto the full outputs $Y$.


In [None]:
import xarray as xr
import numpy as np
import pandas as pd

from sklearn.externals import joblib
from lib.models import get_linear_model, get_mca_mod
from lib.util import weighted_r2_score
import holoviews as hv
hv.extension('bokeh')

Load data and get the models

In [None]:
data = joblib.load("../data/ml/ngaqua/data.pkl")

_, weight_out = data['w']

x_train, y_train = data['train']
x_test, y_test = data['test']


# get objects for fitting linear and MCA models
lm = get_linear_model(data)
mcr  = get_mca_mod(data)

In [None]:
data['train'][0].shape

Define a function which computes the mass weighted R2 of the data.

In [None]:
def score_model(mod, x, y):
    pred = mod.predict(x)
    return weighted_r2_score(y, pred, weight_out)

The R2 of the linear model is 28%

In [None]:
lm.fit(x_train, y_train)
print("LM Cross Validation R2 = ",
      score_model(lm, x_test, y_test),
     "Training R2 = ", score_model(lm, x_train, y_train))

Let's look at the R2 of the maximum component regression for different numbers of components

In [None]:
for n in [1,2,5,10,20,30,40,50,60,68]:
    # set the number of components to keep
    mcr.mca.set_params(n_components=n)
    mcr.fit(x_train, y_train)
    input_var_explained = mcr.mca.explained_var_
    print(f"MCR(n_comp={n})\n",
          "\n % Input variance explained", input_var_explained,
          "\n Training R2", score_model(mcr, x_train, y_train),
          "\n Cross Validation R2 = ", score_model(mcr, x_test, y_test),"\n"*2
          )

As we retain more components, the fraction of the input variance which is explained by MCA goes from 0 to 1. Surprisingly the R2 computed on the training and testing data does not increase quickly at all. For example, the model with 20 MCA modes explains .98 of the input variance but only .20 of the output variance. **MCR** achieves its maximum cross validation performance when at least 50 modes are kept.

We can interpret this in two ways:

1. Modes which account for only 2% of the variance are extremely important
2. The relationship between the input modes and output modes is highly nonlinear, and adding more degrees of freedom helps ameliorates this nonlinearity.



# Strongly Nonlinear response of convection to the MCA modes



Maybe we can learn something by looking at scatter plots of the **MCR(2)** model's predictions versus the actual ouputs.

In [None]:
from lib.plots.model_evaluation import scatter_plot_z
from lib.util import dict_to_xr, output_to_xr

In [None]:
mcr.mca.set_params(n_components=2)
mcr.fit(x_train, y_train)

pred= output_to_xr(mcr.predict(x_test), y_test.coords)
truth= output_to_xr(y_test, y_test.coords)
lr = output_to_xr(lm.predict(x_test), y_test.coords)



In [None]:
plot_data = dict_to_xr({'MCR(2)': pred, 'truth': truth, "LR": lr})
scatter_plot_z(plot_data.Q1c, "truth", ["MCR(2)", "LR"], "variable", engine='points')\
.redim.values(z=[6555])\
.layout("model")

As we can see, for a z=6555 m, the relationship between the MCR(2) prediction of $Q_{1c}$ and the actual output of $Q_{1c}$ is much more nonlinear than for the LR prediction, but it seems we should still be able to fit this well using a nonlinear model. This fits with the notion that small dimensional nonlinear systems, can be embeded with higher dimension nonlinear spaces, and it is a well known fact that linear model perform better in high dimensional spaces. It also makes sense in terms of physical quantities because convection probably responds more nonlinearly to column water vapor, than to the humidity at some height.

## Training a single layer perceptron on the MCA scores

Let's see how the a nonlinear model using the first two MCA modes as an input performs.

In [None]:
from lib.torch_models import TorchRegressor, single_layer_perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

import torch
from torch.autograd import Variable
from torch.nn import MSELoss
mse = MSELoss()
w = Variable(torch.FloatTensor(weight_out))

def loss_function(output, y):
    return mse(output.mul(w.sqrt()), y.mul(w.sqrt()))

slp = TorchRegressor(single_layer_perceptron, loss_fn=loss_function,
                     optim_kwargs=dict(lr=.002),
                    num_epochs=4)

In [None]:
mca_mlp = get_mca_mod(data, mod=make_pipeline(StandardScaler(), slp))

# only train on a subset of the data
inds = np.random.choice(x_train.shape[0], 100000, replace=False)

mca_mlp.fit(x_train.data[inds], y_train.data[inds])
mca_mlp_pred = output_to_xr(mca_mlp.predict(x_test), y_test.coords)

Now let's plot the R2 value of the nonlinear regression on top of MCA.

In [None]:
%%opts Curve[invert_axes=True]
def r2_to_xr(truth, pred):

    sse = ((pred - truth)**2).sum(['x', 'y', 'time'])
    ss = ((truth.mean(['x','y','time']) - truth)**2).sum(['x', 'y', 'time'])
    return 1 - sse/ss

r2_plot_data = dict_to_xr({
    'mca_mlp': r2_to_xr(truth.Q1c, mca_mlp_pred.Q1c),
    'LR': r2_to_xr(truth.Q1c, lr.Q1c),
    'MCR': r2_to_xr(truth.Q1c, pred.Q1c)
}, dim_name="model")


hv.Dataset(r2_plot_data).to.curve("z").overlay("model").redim.label(Q1c="R2 of Q1c")

As we can see all the prediction performs much better in the middle of the atmosphere than, and make large errors below 2000m and in the stratosphere.

Also, the nonlinear MCA_MLR regression performs nearly as well as the linear model, which has many more inputs.

Now let's look at the scatter plot again.

In [None]:

plot_data = dict_to_xr({'MCR(2)_SLP':mca_mlp_pred, 'MCR(2)': pred, 'truth': truth, "LR": lr})
scatter_plot_z(plot_data.Q1c, "truth", ['MCR(2)_SLP', "MCR(2)", "LR"], "variable", engine='points')\
.layout("model").cols(2)