In this notebook I explore algorithms for clustering the samples, which I can ultimately use to develop stochastic convetive parameterizations.

I can model the transitions between clusters using a markov process, just as Eurika Kaiser does.

In [None]:
%matplotlib inline

In [None]:
import xarray as xr
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
from matplotlib.colors import SymLogNorm
from matplotlib.mlab import griddata



from sklearn.cluster import KMeans

In [None]:
def rms(x, axis=-1):
    return np.sqrt((x**2).mean(axis=axis))

In [None]:
import holoviews as hv
hv.extension('bokeh')

In [None]:
data_dict = joblib.load("../data/ml/ngaqua/data.pkl")
data_dict.keys()

In [None]:
def get_inputs(data):
    return data['train'][1]/data['scale'][1] * np.sqrt(data['w'][1])

In [None]:
y = get_inputs(data_dict)
y.shape

# K-Means clustering

Clustering is pretty slow, so we should only perform the fit on a subset of the data

In [None]:
np.random.seed(30)
idx = np.random.choice(y.shape[0], 10000, replace=False)

In [None]:
kmeans = KMeans(n_clusters=90).fit(y[idx])

In [None]:
clusters = pd.DataFrame(kmeans.cluster_centers_.T,
                        index=y.indexes['features'])

clusters.head()

Let's plot all the cluster centers

In [None]:
%%opts Curve[invert_axes=True]{+framewise}

plotme = pd.melt(clusters.reset_index(), id_vars=["variable", "z"], var_name="mode")

tab = hv.Table(plotme, kdims=["variable", "z", "mode"], vdims=["value"])
tab.to.curve("z")

In [None]:
y_pred = np.take(kmeans.cluster_centers_, kmeans.predict(y), axis=0)
r2_score(y, y_pred)

Many of the clusters in k-means are just scaled versions of each other.

Let's look at the number of members in each cluster.

In [None]:
bincounts = np.bincount(kmeans.predict(y))
hv.Curve(np.sort(bincounts)[::-1], kdims="cluster number", vdims=["population"])

There is no clear cutoff here. 

## Weighted clustering

In the simulation, few sample points have very strong precipitation, but they can strongly impact the total amount of precipitation. 

If we weight the samples with the strength of precipitation, the k-means algorithm will focus more on the events with large precipitation. 

In [None]:
sample_weights = rms(y.data, axis=1).compute()
# sample_weights = np.abs(np.sum(y.data, axis=1)).compute()

sample_weights/=sample_weights.mean()
sample_weights.shape = (-1, 1)

In [None]:
kmeans_w = KMeans(30).fit(y[idx]*sample_weights[idx]**.1)

In [None]:

hv.Curve(np.sort(rms(kmeans_w.cluster_centers_)), label="Weighted",
         kdims=["cluster label"], vdims=["RMS"]) \
*hv.Curve(np.sort(rms(kmeans.cluster_centers_)), label="Not Weighted",
        extents=(0,None,30,None))

What is the population of these RMS weighted clusters

In [None]:
rms_cent =  rms(kmeans_w.cluster_centers_)
pop =  np.bincount(kmeans_w.predict(y))
df = pd.DataFrame({'pop':pd.Series(pop),'rms': pd.Series(rms_cent)})\
.fillna(0)

In [None]:
%%opts Scatter[logy=True width=500]
hv.Scatter(df)

## Other clustering algorithms

Sklearn's affinity propogation is too slow even for relatively small numbers of samples. DBSCAN probably runs very slowly also.

# Feature selection

Let's use orthogonal matching pursuit and Lasso to help make sparse representations of the data.

Let's use PC as a dictionary

In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import Lasso, OrthogonalMatchingPursuit

In [None]:
y_pcs = PCA(n_components=30, whiten=True).fit_transform(y)

In [None]:
lasso = Lasso(.2).fit(y_pcs[idx], y[idx])

In [None]:
%%opts Curve[invert_axes=True]{+framewise}
coef = pd.DataFrame(lasso.coef_,
                        index=y.indexes['features'])

plotme = pd.melt(coef.reset_index(), id_vars=["variable", "z"], var_name="mode")

tab = hv.Table(plotme, kdims=["variable", "z", "mode"], vdims=["value"])
tab.to.curve("z")

We can see that these components are now sparsified.