Spectral Clustering Example.

The image loaded here is a cropped portion of the ``MERCATOR_LC80210392016114LGN00_B10.TIF`` LANDSAT image included as a public [datashader example](http://datashader.org/topics/landsat.html).

In addition to `dask-ml`, we'll use `rasterio` to read the data and `matplotlib` to plot the figures.
I'm just working on my laptop, so we could use either the threaded or distributed scheduler. I'll use the distributed scheduler for the diagnostics.

In [None]:
import rasterio
import holoviews as hv
from holoviews.operation.datashader import regrid
import dask.array as da
from dask_ml.cluster import SpectralClustering
from dask.distributed import Client
hv.extension('bokeh')

In [None]:
import intake
cat = intake.open_catalog('../catalog.yml')
list(cat)

In [None]:
xa = cat.landsat_sample.read_chunked()
#xa = cat.midwest_mosaic.read_chunked()
xa = xa.squeeze(dim='concat_dim', drop=True)[0]
xa

In [None]:
# Rescale for the clustering algorithm
xa = xa.astype(float)
xa = (xa - xa.mean()) / xa.std()

In [None]:
%%opts Image (cmap='viridis')
regrid(hv.Image(xa.T))

In [None]:
flat_input = xa.stack(z=('y', 'x'))
flat_input

In [None]:
client = Client(processes=False)
client

We'll reshape the image to be how dask-ml / scikit-learn expect it: `(n_samples, n_features)` where n_features is 1 in this case. Then we'll persist that in memory. We still have a small dataset at this point. The large dataset, which dask helps us manage, is the intermediate `n_samples x n_samples` array that spectral clustering operates on. For our 2,500 x 2,500 pixel subset, that's ~50

In [None]:
X = flat_input.expand_dims(dim='e', axis=1).values.astype('float')
X.shape

In [None]:
X = da.from_array(X, chunks=100_000)
X = client.persist(X)

And we'll fit the estimator.

In [None]:
clf = SpectralClustering(n_clusters=4, random_state=0,
                         gamma=None,
                         kmeans_params={'init_max_iter': 5},
                         persist_embedding=True)

In [None]:
%time clf.fit(X)

In [None]:
labels = clf.assign_labels_.labels_.compute()
labels.shape

In [None]:
def unstack_output(flat_input, output_values):
    """Unstack this output into the original input shape
    
    Parameters
    ----------
    flat_input : DataArray
        Flattened DataArray used as the input to the ML pipeline
    output_values : np.ndarray
        Output values from ML pipeline in same shape as `flat_input`
        
    Returns
    -------
    output: DataArray with same shape as original input
    """
    dims = flat_input.dims
    output = flat_input.copy()
    output.values = output_values
    
    for dim in dims:
        output = output.unstack(dim=dim)
        
    return output

In [None]:
output = unstack_output(flat_input, labels)
output

In [None]:
%%opts Image (cmap='viridis')
hv.Image(xa).relabel('Image') + hv.Image(output).relabel('Clustered')