This notebooks shows results for principal components regression of the NGAqua data.

In [None]:
import xarray as xr
import numpy as np
import pandas as pd

from toolz import *
from toolz.curried import get

from tqdm import tqdm
from sklearn.externals import joblib
from lib.models import get_pcr_mod, mse, weighted_r2_score
from lib.util import dict_to_xr
import holoviews as hv
hv.extension('bokeh')

In [None]:
data = joblib.load("../data/ml/ngaqua/data.pkl")
ntrain = 10000

_, weight_out = data['w']

x_train, y_train = data['train']
x_test, y_test = data['test']

# training indices (random sample for speed)
train_inds = np.random.choice(x_train.shape[0], ntrain, replace=False)


def score_model(mod, x, y):
    pred = mod.predict(x)
    return weighted_r2_score(y, pred, weight_out)

Make the PCR model

In [None]:
pcr = get_pcr_mod(data)

In [None]:

cv_data = {}


for n in tqdm([1,2,5,10,20,30,40,50,60,68]):
#     print(f"Fitting model for {n} components")
    # set the number of components to keep
    pcr.pca.set_params(n_components=n)
    pcr.fit(x_train[train_inds], y_train[train_inds])
    cross_val_mse = mse(pcr.predict(x_test), y_test, dims=['samples'])
    
    cv_data[n] = {'test_score': score_model(pcr, x_test, y_test),
                 'train_score': score_model(pcr, x_train, y_train),
                 'mse_profile': cross_val_mse}
    

In [None]:
%%opts NdOverlay[legend_position='top_left']
df = pd.DataFrame({'test': valmap(get('test_score'), cv_data),
                   'train': valmap(get('train_score'), cv_data)}).reset_index()
hv.Table(pd.melt(df, id_vars='index')).to.curve("index", "value").overlay()\
.redim.range(value=(0,.4))\
.redim.label(index="n", value="R2 score")


PCR performs poorly for small numbers of retained components. What is the variance explained of the input?

In [None]:
hv.Curve(pcr.pca.explained_variance_ratio_.cumsum(), kdims=['n'], vdims=['Cumulative fraction of explained variance'])\
.redim.range(n=(0,20))

The first few modes, explain much of the variance, so as with MCA based regression, they main problem is likely the nonlinearity between the modes rather than the importance of extremely low variance modes.

Now what do the vertical structures of the errors look like? To do this, let's first collect all the MSE data into one data array.

In [None]:
cross_val_mse = valmap(get('mse_profile'), cv_data)

mse_data = xr.concat(cross_val_mse.values(), dim=pd.Index(cross_val_mse.keys(), name='n')).unstack("features")
mse_data

I overlay all the different curves in this plot.

In [None]:
%%opts Curve[invert_axes=True] NdOverlay[show_legend=False]
m = hv.Dataset(mse_data.sel(n=slice(0,40))).to.curve("z").overlay("n").layout("variable").redim.label(Q1c="MSE")
m

It seems that the errors are larger for Q1c than they are for Q2. It is interesting that including more components decreases the error in the troposphere,but does .not decrease the error for Q1c in the stratosphere. This indicates that the errors there are inherently unpreditable there.