# K-Means Clustering on Temperature PDF's

This notebook shows how `elm`, Ensemble Learning Models, can be used for ensemble approaches to K-Means with a `Pipeline` of preprocessors, normlizers and transformers before K-Means and loading files from a multi-file NetCDF dataset.

This notebook uses an ensemble approach to K-Means on those temperature spatial time series arrays.  The approach here is based on Loikith et al (2013)'s clustering of log probabilities of temperature anomalies.

```text
Classifying reanalysis surface temperature probability density functions (PDFs) over North America with cluster analysis
P. C. Loikith, B. R. Lintner, J. Kim, H. Lee, J. D. Neelin, and D. E. Waliser

GEOPHYSICAL RESEARCH LETTERS, VOL. 40, 3710–3714, doi:10.1002/grl.50688, 2013
```

### Setup with `xarray`, `numpy` and `matplotlib` imports

In [None]:
%matplotlib inline
import glob
import os
import random
import copy
import sys

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import xarray as xr

### Configure `dask-distributed` scheduler from environment

In [None]:
DASK_SCHEDULER = os.environ.get('DASK_SCHEDULER', None)

In [None]:
from distributed import Client
from distributed import local_client
import dask

client = Client(DASK_SCHEDULER) if DASK_SCHEDULER else Client()

### Set constants related to the data source

This is a global dataset of temperature and other weather variables from the MERRA2 project.  If you have not yet downloaded the MERRA2 data source of this notebook and want to run the notebook locally, create a bash script like the following one to download the data set from [http://goldsmr4.gesdisc.eosdis.nasa.gov](http://goldsmr4.gesdisc.eosdis.nasa.gov):

```bash
SCRIPT_DIR=`dirname $(readlink -e "$0")`
SPACIAL_URLS_FILE="$SCRIPT_DIR/MERRA2_M2T1NXSLV_west_pacific_urls.dat"
COOKIE_FILE="$HOME/.urs_cookies"
DEST_DIR="/mnt/efs"
USAGE="$0:
    Data downloader for NASA SBIR project.

usage: $0 [-h] [--cookie-file COOKIE_FILE] [--dest-dir DEST_DIR] [--spacial-urls SPACIAL_URLS_FILE]

arguments:
    -h, --help: show this help message and exit
    -c, --cookie-file: (Optional) path to cookie file. Default: $HOME/.urs_cookies
    -d, --dest-dir: (Optional) path to destination (download) directory. Default: /mnt/efs
    -s, --spacial-urls: (Optional) path to data file containing spacial data URLs. Default: $SPACIAL_URLS_FILE
"
while [ $# -gt 0 ]; do
    case "$1" in
        -c|--cookie-file)
            COOKIE_FILE="$2"
            shift; shift
            ;;
        -d|--dest-dir)
            DEST_DIR="$2"
            shift; shift
            ;;
        -s|--spacial-urls)
            SPACIAL_URLS_FILE="$2"
            shift; shift
            ;;
        -h|--help)
            echo "$USAGE"
            exit 1
            ;;
        *)
            echo "Error: Invalid option \"$1\""
            echo "$USAGE"
            exit 1
            ;;
    esac
done
[ ! -f "$COOKIE_FILE" ] && echo "Error: Non-existent cookie file provided" && echo "$USAGE" && exit 1
[ ! -d "$DEST_DIR" ] && echo "Error: Non-existent destination dir provided" && echo "$USAGE" && exit 1
[ ! -f "$SPACIAL_URLS_FILE" ] && echo "Error: Non-existent spacial URLs file provided" && echo "$USAGE" && exit 1

echo COOKIE_FILE = "$COOKIE_FILE"
echo DEST_DIR = "$DEST_DIR"
echo SPACIAL_URLS_FILE = "$SPACIAL_URLS_FILE"
echo

## Time period: Download the July and August data for all years
##
## Total wall clock time: 39m 4s
## Downloaded: 2773 files, 3.5G in 26m 56s (2.23 MB/s)
## real 39m4.178s
## user 0m5.155s
## sys  0m21.308s
time sudo wget \
    --no-verbose \
    --load-cookies "$COOKIE_FILE" \
    --save-cookies "$COOKIE_FILE" \
    --auth-no-challenge=on \
    --keep-session-cookies \
    --recursive \
    --continue \
    --no-parent \
    --no-clobber \
    --relative \
    --accept '*0[78][0-9][0-9].nc4' \
    --directory-prefix="$DEST_DIR" \
    "http://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/M2SDNXSLV.5.12.4/"

## Total wall clock time: 3m 25s
## Downloaded: 113 files, 2.0G in 2m 39s (12.7 MB/s)
## real 3m24.644s
## user 0m1.314s
## sys  0m11.405s
time sudo wget \
    --no-verbose \
    --load-cookies "$COOKIE_FILE" \
    --save-cookies "$COOKIE_FILE" \
    --auth-no-challenge=on \
    --keep-session-cookies \
    --recursive \
    --continue \
    --no-parent \
    --no-clobber \
    --relative \
    --accept '*0[78].nc4' \
    --directory-prefix="$DEST_DIR" \
    "http://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2_MONTHLY/M2IMNXASM.5.12.4/"
```

### Some constants related to the data source

*You will likely need to change the `MONTHLY_PATTERN` and `PATTERN` paths below if running the notebook locally*

In [None]:
FIRST_YEAR, LAST_YEAR = 1980, 2015 # the time domain of the input data

# Glob file matching patterns
PATTERN = '/mnt/efs/goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/M2SDNXSLV.5.12.4/{:04d}/{:02d}/*.nc4'

MONTHLY_PATTERN = '/mnt/efs/goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2_MONTHLY/M2IMNXASM.5.12.4/*/MERRA2_100.instM_2d_asm_Nx.*{:02d}.nc4'

# The name of the NetCDF variable (xarray.DataArray) we will use
TEMP_BAND = 'T2MMEAN'

MONTH = 7 # just working with July

YEARS = range(FIRST_YEAR, LAST_YEAR + 1)

### Check that our file pattern matches some NetCDF files

In [None]:
g = glob.glob(PATTERN.format(2000, 7))
g[:2]

### Use `earthio.load_array` to check one file

In [None]:
from earthio import load_array
example = load_array(g[0])

### An `ElmStore` is returned 

Using the `data_vars` attribute `ElmStore`s inherit from `xarray.Dataset`:

In [None]:
example.data_vars 

### Using `ElmStore` attributes

The attribute `T2MMEAN` exists (an `xarray.DataArray`) because `T2MMEAN` was a "variable" name in the NetCDF we opened.

In [None]:
example.T2MMEAN.shape

### Using summary statistics on a `DataArray`

In [None]:
example.T2MMEAN.values.var()

### Working with large temperature data sets, we can drop the other variables

Later when we use [`xarray.open_mfdataset`](http://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html) we can provide this list of variables to drop before concatenating temperature grids in time.

In [None]:
DROP_VARIABLES = [k for k in example.data_vars if k != TEMP_BAND]

### Helper function for date components of file name

In [None]:
def split_fname(f):
    parts = f.split('.')
    dt = parts[-2]
    return int(dt[:4]), int(dt[4:6]), int(dt[6:8])

### Ploting monthly mean temperature
Later we will use this monthly means `DataArray` for calculating residuals of the time series of grids (different of each time series point minus the long term mean for a pixel for July).

In [None]:

def month_means():
    dask.set_options(get=dask.async.get_sync)
    pat = MONTHLY_PATTERN.format(7)
    fs = glob.glob(pat)
    return xr.open_mfdataset(pat, 
                             lock=True,
                             drop_variables=DROP_VARIABLES, 
                             concat_dim='time').mean(dim='time')

MONTH_MEANS = month_means()
MONTH_MEANS.T2M # This is the only attribute we will use (average temperature K)
                # T2M is a name of a 'variable' in the underlying NetCDF files
(MONTH_MEANS.T2M - 273.15).plot.pcolormesh()
plt.title('Average temperatures for July (C)');

### Using `xarray.open_mfdataset`

In [None]:
help(xr.open_mfdataset)

### Defining a `sampler` function

A `sampler` given to `elm`'s fitting and prediction function should return
 * `X`, an `earthio.ElmStore` or `xarray.Dataset`
 * A tuple of `(X, y, sample_weight)` where X is as described above and `y` and `sample_weight` are either `None` or `numpy.ndarray`s of 1 dimension.

In [None]:

def sampler(month, days, **kwargs):
    with local_client() as lc:
        dask.set_options(get=lc.get)
        print('Sample - Month: {} Days: {}'.format(month, days))
        files = []
        for year in YEARS:
            pattern = PATTERN.format(year, month)
            fs = glob.glob(pattern)
            dates = [split_fname(f) for f in fs]
            keep = [idx for idx, d in enumerate(dates)
                    if d[1] == month and d[2] in days]
            files.extend(fs[idx] for idx in keep)
        print('Sample {} files'.format(len(files)))
        X = xr.open_mfdataset(files, lock=True, drop_variables=DROP_VARIABLES, concat_dim='time')
        X.attrs['sample_kwargs'] = {'month': month, 'days': days}
        X.attrs['band_order'] = [TEMP_BAND]
        X.attrs['old_dims'] = [getattr(X, TEMP_BAND).dims[1:]]
        X.attrs['old_coords'] = {k: v for k, v in X.coords.items()
                                 if k in ('lon', 'lat',)}
        return make_residuals(X)

### Our sampler of files calls this `make_residuals` function:

In [None]:
from earthio import ElmStore
def make_residuals(X, y=None, sample_weight=None, **kwargs):
    """Residuals of a spatial time series relative to a time series mean
    
    For each spatial point return (values - mean) where mean is the 
    mean of all points for a specific day-of-year
    And values is the time series for a a given spatial lon, lat point
    
    Parameters:
        X: ElmStore or xarray.Dataset
        y: passed through
        sample_weight: passed through
        kwargs: Should contain "month" (integer) and "days" (list of ints)
    Returns:
        (Xnew, y, sample_weight)
    """
    month = X.sample_kwargs['month']
    days = X.sample_kwargs['days']
    band_arr = getattr(X, TEMP_BAND)
    date = pd.DatetimeIndex(tuple(pd.Timestamp(v) for v in band_arr.time.values))
    arr = np.empty(band_arr.values.shape)
    for year in YEARS:
        for day in days:
            idxes = np.where((date.day == day)&(date.year == year)&(date.month == month))[0]
            slc = (idxes,
                   slice(None),
                   slice(None)
                   )
            one_day = band_arr.values[slc]
            arr[slc] = (one_day - MONTH_MEANS.T2M.values)
            assert np.abs(arr[slc].sum()) > 0
    data_arr = xr.DataArray(arr, coords=band_arr.coords, dims=band_arr.dims, attrs=band_arr.attrs)
    Xnew = ElmStore({TEMP_BAND: data_arr}, attrs=X.attrs, add_canvas=False)
    del X
    return (Xnew, y, sample_weight)

### Temperature time series sampling

Using the `sampler` function for the time series of July 2 data for each year.

In [None]:
s, _, _ = sampler(7, [2]) # 2nd of July
# we have only one DataArray - T2MMEAN
s.T2MMEAN.std(axis=0).plot.pcolormesh(y='lat', x='lon')
plt.title('Standard deviation of temperature resdiuals in time (C)')
plt.show()

### `elm.pipeline.steps` - Preprocessing and Transforms
The next cells create some preprocessng and transformation steps we will use in the `elm.pipeline.Pipeline` a few cells later.
Notes:
 * `elm.pipeline.steps.ModifySample` is a step that lets you call any function.  Your function should have the signature: `func(X, y=None, sample_weight=None, **kwargs):` and the function should return a tuple of `(X, y, sample_weight)` or just `X`, where in either case `X` is an `elm.readers.ElmStore` or `xarray.Dataset`.
 * `steps.Transform` takes a transformer instance like `IncrementalPCA()` and keyword argument `partial_fit_batches` which controls only the number of `partial_fit` operations for the `Transform` step itself.

In [None]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from elm.pipeline import steps, Pipeline
fixed_bins = steps.TSProbs(band=TEMP_BAND, num_bins=152, bin_size=0.5, log_probs=True)

fixed_bins  # The repr below shows which parameters may be adjusted for 
            # time series binning

### Fitting to multiple samples

The `sampler` function we created above is called with two arguments, an integer for the month, and a list of days.  To fit to multiple samples, we need to pass in an `args_list` which is a list of 2-tuples in this case that are unpacked as arguments to each call to `sampler`.

In [None]:
july_days = list(range(1, 32))
np.random.shuffle(july_days)
num_blocks = 4
block_size = len(july_days) // num_blocks
args_list = [(7, july_days[block: block_size + block])
             for block in range(0, len(july_days), block_size)
             if len(july_days) - block >= block_size]

args_list # 4 samples of 7 random July days each

## Ensemble varying `n_clusters`

The next cell modifies the `ensemble_kwargs` argument to `elm.pipeline.train.train_step`.

 * `ensemble_init_func`: A function that takes 
   * `cls` (e.g. `sklearn.cluster.KMeans`), 
   * `model_kwargs` (the `cls`'s `__init__` keyword args), 
   * `**kwargs`, the `ensemble_kwargs` passed to `train_step`
 * `model_selection`: A function that determines how `Pipeline` instances are passed from generation to generation of ensemble.  
 * `model_selection_kwargs`: Passed to `model_selection`
  * The model selection function should have a signature: `func(models, best_idxes=None, **kwargs)` 
  * `best_idxes` are indices in Pareto sorted best to worst fitness, where fitness is determined by `model_scoring` (see above in this notebook). 
  * `kwargs` will include `model_selection_kwargs` as well as `generation`, the current ensemble generation and `ngen` from the keywords given to `fit_ensemble`.
 * The `model_scoring` in this example uses `elm.model_selection.kmeans.kmeans_aic` to score the K-Means Akaike Information Criterion (AIC), a score which weights goodness-of-fit while penalizing for larger number of clusters. 
 * `model_scoring` sets the weights for fitness sorting (minimize = -1 here to minimize AIC) as a list as long as the sequence of scores returned by the scoring function, defaulting to `model.score` if available.

### `elm.pipeline.Pipeline`

To create a `Pipeline`, give a list of steps where each step is a class from `elm.pipeline.steps`, except the final step, which may be any estimator with a fit, transform, predict interface like most `scikit-learn` models.  Each step can be a tuple where the first item gives the step a name.

There are two keyword arguments to `Pipeline`:
 * `scoring` a function called after each model fitting to score the model.  Here K-Means is scored with Akaike Information Criterion
 * `scoring_kwargs`: Should have a key `score_weights` and may also give the keys `needs_proba` or `needs_threshold`. 


In [None]:
from elm.model_selection.kmeans import kmeans_aic, kmeans_model_averaging
pipe = Pipeline([('binning', fixed_bins), 
                 ('pca', steps.Transform(IncrementalPCA(), partial_fit_batches=2)),
                 ('kmeans', MiniBatchKMeans())],
                 scoring=kmeans_aic, 
                 scoring_kwargs=dict(score_weights=[-1]))


### Define `model_selection`, `ensemble_init_func` and other ensemble controls

Here we define `ensemble_init_func` a function for initializing the ensemble with `Pipeline` instances of varying parameters and `model_selection`, a function which takes a list of `(tag, pipeline instance)` tuples and `best_idxes` (Pareto sorting indices) to modify the ensemble members carried forward to the next generation of `fit_ensemble`.

In [None]:
n_clusters_choices = range(4, 11)
n_components_choices = range(3, 6)
TOP_N = 4

_num = 0

def _next_name():
    global _num
    _num += 1
    return 'new_model_{}'.format(_num)

def new_pipe(pipe, num_bins, bin_size):
    n_clusters = np.random.choice(n_clusters_choices)
    n_components = np.random.choice(n_components_choices)      
    print('New ensemble member - n_clusters: {} n_components: {} num_bins {}'.format(n_clusters, n_components, num_bins))
    return pipe.new_with_params(kmeans__n_clusters=n_clusters,
                                pca__n_components=n_components,
                                binning__num_bins=num_bins,
                                binning__bin_size=bin_size)

def model_selection(models, best_idxes=None, **kwargs):
    top = [models[idx] for idx in best_idxes[:TOP_N]]
    if kwargs['generation'] < kwargs['ngen']:
        tag, best = top[0]
        new = [(_next_name(), new_pipe(best,
                        best.get_params()['binning__num_bins'], 
                        best.get_params()['binning__bin_size']))
               for idx in range(len(models) - len(top))]
        return top + new
    return best
        
def ensemble_init_func(pipe, **kwargs):
    print('Calling ensemble_init_func with {} {} '.format(pipe, kwargs))
    models = []
    for num_bins, bin_size in zip([152, 76, 38], [0.5, 1, 2]):
        for _ in range(5):
            models.append(new_pipe(pipe, num_bins, bin_size))
    return models



### `fit_ensemble` keyword arguments:

 * `ngen` is the number of generations where each generation has `model_selection` function called after it.
 * `ensemble_init_func` is a function returing a list of `elm.pipeline.Pipeline` instances
 * `model_selection` is a function to take a list of trained models and return a list of models
 * `model_selection_kwargs` are keyword arguments passed to `model_selection`
 * `models_share_sample=True`, the default, means to fit all ensemble members to one sample at a time in each generation, cycling to the next sample in the next generation.

In [None]:
ensemble_kwargs = {
    'ngen': len(args_list),
    'ensemble_init_func': ensemble_init_func,
    'model_selection': model_selection,
    'model_selection_kwargs': {},
    'models_share_sample': True,
}

### Call `fit_ensemble`

In [None]:
pipe.fit_ensemble(sampler=sampler, args_list=args_list,client=client, **ensemble_kwargs)

### Call `predict_many` with a `serialize` callable

By default `predict_many` has the keyword argument `to_cube=True`, meaning to convert the 1-D prediction from the scikit-learn estimator to a 2-D raster with the coordinates of the input data.  

Here we are calling `predict_many` with:
 * The `ensemble` keyword to control which models are used in prediction - by default it would have used `pipe.ensemble`, a list of 15 elements in this case, but we are limiting to the first two members
 * The `sampler` and `args_list` given to `fit_ensemble` (the `arg_list` or `sampler` may differ between fitting and prediction as long as the sampler function produces a sample of a consistent number of dimensions).
 * `serialize`, a function which serializes each prediction to avoid storing a large number of arrays in memory (the return value from `predict_many` is either a list of `ElmStore`s (y predictions) if `serialize` is `None`, otherwise a list of outputs from the `serialize` callable given.

In [None]:
def serialize(y=None, X=None, tag=None, elm_predict_path='./predict'):
    '''Example serialize callable for predict_many'''
    # X is an ElmStore from the Pipeline and we have
    # kept "sample_kwargs" in attrs
    y.predict.plot.pcolormesh(levels=np.arange(np.max(y.predict.values)))
    plt.title('Climatic regions based on {}'.format(X.sample_kwargs))
    plt.show()
    return True
pred = pipe.predict_many(ensemble=pipe.ensemble[:2], sampler=sampler, args_list=args_list, serialize=serialize)

### Confirming we have 15 ensemble members

In [None]:
len(pipe.ensemble)

### Showing the return values of `serialize` that reduced memory footprint

In [None]:
pred

### Serializing the `Pipeline` for prediction later

In [None]:
from sklearn.externals import joblib
joblib.dump(pipe, 'pipe.pkl')

In [None]:
pipe = joblib.load('pipe.pkl')