In this notebook, I will try the principal component binning approach to convective parametrization. This approach gives rise to a natural stochastic scheme, but for the time-being, I will restrict my focus to how well the scheme describes then mean. If it cannot capture the mean, then the scheme is not doing a good job of capture the distribution of Q1 and Q2.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import numpy as np
from sklearn.externals import joblib
import xarray as xr

Load the data

In [None]:
data = joblib.load("../data/ml/ngaqua/data.pkl")
ntrain = 10000

scale_in, scale_out = data['scale']
weight_in, weight_out = data['w']
x_train, y_train = data['train']
x_test, y_test = data['test']

p = xr.open_dataset("../data/raw/ngaqua/stat.nc").p

This is how the mean performs. Not the well obviously. It seems have special difficulty capturing the second baroclinic mode.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.decomposition import PCA

# make a transformer which scales the data
input_scaler = FunctionTransformer(lambda x: x* np.sqrt(weight_in)/scale_in)


from lib.util import output_to_xr, dict_to_xr, swap_coord


def plot_preds(predictions):
    predictions = {k: output_to_xr(v, y_test.coords)
                   for k, v in predictions.items()}

    preds_xr = dict_to_xr(predictions, dim_name="model")\
    .pipe(lambda x: swap_coord(x, z=p))
    height = len(predictions) * 2
    axs = preds_xr.isel(x=0, y=8).Q1c.plot(col='model', col_wrap=1, cmap="inferno", vmin=-20, vmax=100,
                                           figsize=(8,height))
    plt.gca().invert_yaxis()
    

In [None]:

preprocessor = make_pipeline(input_scaler, PCA(n_components=2, whiten=True))

mod = make_pipeline(preprocessor, KNeighborsRegressor(20))
mod.fit(x_train, y_train)

y_pred = mod.predict(x_test)

In [None]:

predictions = {'true': y_test, 'pca(2) | knn(n=20)': y_pred}
    
plot_preds(predictions)

We can see that the mean of the knearest neighbors method does not perform particularly well, and mostly just captures the first baroclinic heating mode. what is its R2 score

In [None]:
from lib.models import weighted_r2_score

In [None]:
weighted_r2_score(y_test, y_pred, weight=weight_out)

As we can see this performs much worse the neural network or even the linear model. The R2 score on the testing data is negative. This is probably because the predicted heating is much too smooth in time. Maybe a binning approach will work better.

# Binning approach

First we need to decise what bins we should be using.

In [None]:
x_scores = preprocessor.transform(x_train)

In [None]:
plt.hexbin(x_scores[:,0], x_scores[:,1])

This hexbin plot shows that we should maybe lay down a uniform grid with size 0.1 in PC1 and PC2.

In [None]:
from collections import defaultdict
from toolz import valmap
from functools import reduce
from sklearn.base import BaseEstimator, RegressorMixin

class Binner2D(BaseEstimator, RegressorMixin):
    """Regressor which returns the average value over a grid in the input features
    
    This is very similar to a nearest neighbors lookup in theory.
    
    If input data is given which falls in a bin never observed in the training dataset,
    the average over all values in the bin of the first column is given.
    """
    
    def columns_bins(self, x):
        xmin, xmax = x.min(), x.max()
#         this failed
#         return np.linspace(xmin, xmax, 200)    

        return np.percentile(x, np.arange(.5,100,.5))
        
    def get_bin_membership(self, x):
        return np.hstack(np.digitize(x[:,i], self.bin_grids_[i])[:,None]
                         for i in range(x.shape[1]))
    
    
    def fit(self, x, y=None):
        x, y = np.asarray(x), np.asarray(y)
        self.bin_grids_ = [self.columns_bins(x[:,i]) for i in range(x.shape[1])]
        
        self.bin_membership_ = self.get_bin_membership(x)
        
        # store indexes in a dict
        # this provides very fast lookup
        self.bins_ = defaultdict(list)
        for i in range(x.shape[0]):
            memb = tuple(self.bin_membership_[i].flat)
            self.bins_[memb].append(i)
            
            
        # take the mean of the output in each bin
        self.bin_out_avg_ = valmap(lambda inds: y[inds].mean(axis=0), self.bins_)
        self.bin_counts_ = valmap(len, self.bins_)
        
        
        # Take average over PC1
        # This will be used as the output for the empty bins
        pc1_avg = {}
        for i,_ in enumerate(self.bin_grids_[0]):
            non_empty_bins = []
            for k, v in self.bins_.items():
                if k[0] == i:
                    non_empty_bins.extend(v)
                
            pc1_avg[i] = y[non_empty_bins].mean(axis=0)
            
            
        self.pc1_avg = pc1_avg
             
        
        return self
    
    
    def query_avg(self, memb):
        memb = tuple(memb.flat)
        
        if memb in self.bin_out_avg_:
            return self.bin_out_avg_[memb]
        else:
            return self.pc1_avg[memb[0]]
    
    def predict(self, x):
        membership = self.get_bin_membership(x)
     
        return np.vstack(self.query_avg(memb)[None,:] for memb in membership)

    

In [None]:
bin2d = make_pipeline(input_scaler, PCA(n_components=2), Binner2D())
bin2d.fit(x_train, y_train)

In [None]:
predictions['pca(2) | bin2d'] = bin2d.predict(x_test)

In [None]:
plot_preds(predictions)

As we can see the bin averaging gives very similar results to k nearest neighbors. Of the two, the k nearest neighbors is simpler to implement. It is also easier to use k nearest neighbors for data outside of the training dataset, because it will always return the nearest neighbors. On the other hand, the binning approach requires more adhoc approaches when the testing data is far from the training data.

In summary, it is not clear if this method failed because the principal component basis is not good, and the performance could improve when using an improved basis for the inputs (e.g. MCA). Let's see if this is the case

## MCA KNN regression

In [None]:
from lib.mca import MCARegression

In [None]:
mca_scale = [np.sqrt(w)/sc for w, sc in zip(data['w'], data['scale'])]


mod = make_pipeline(StandardScaler(), KNeighborsRegressor(20))
mca_knn = MCARegression(mod=mod, scale=mca_scale, n_components=2)
mca_knn.fit(x_train, y_train)


This class uses far too much memory to process all the testing data in one go, so I make the predictions for different slices, and then concatenate these predictions.

In [None]:
def split_slices(n, k):
    """python generator for splitting an array into chunks of size k"""

    starts = range(0, n, k)

    for start in starts:
        end = min(start + k, n)
        yield slice(start, end)



n = x_test.shape[0]
output = np.vstack(mca_knn.predict(x_test[sl]) for sl in split_slices(n, 1000))

predictions['mca(2) | knn(20)'] = output

What is the R2 of the MCA scheme

In [None]:
weighted_r2_score(y_test, output, weight=weight_out)

Okay, at least the quantity is positive, but still nowhere near as good as the Neural networks, which were getting a cross validation R2 of around 0.4-0.50.

In [None]:
plot_preds(predictions)

The predicted heating wiis much more concentrated over the times when the heating is actualy positive. Overall none of these techinques perform well when using the bin average or K nearest neighbors approach, so I am not hopeful that the averages will be much better. All in all, I think the single layer perceptron is the most viable model. Perhaps we could describe the stochasticity of the residual from that model using an approach like the one here. 

# Summary

1. The first 2 MCA modes give much better predictive value then the first 2 PCA modes
2. KNearestNeighbors and two-dimensional binning give similar answers.
3. These non-parameteric (i.e. lookup table like methods) perform substantially worse than the parametric neural network approaches, and are more during the prediction phase.