## Carbon Monitoring Project

In [None]:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')

This notebook aims to visualize the data used in the carbon monitoring project [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) using Python tools.

The goals of this notebook:

* examine the measurements from each site
* generate some visualization or global model to predict one site from every other site.
* generate and explain model idea

To run this notebook, you will need `RSIF_2007_2016_05N_01L.mat` in the `examples` directory which you can download from https://gentinelab.eee.columbia.edu/content/datasets

## Loading FluxNet data ``extract_fluxnet.m``

[FluxNet](http://fluxnet.fluxdata.org/) is a worldwide collection of sensor stations that record a number of local variables relating to atmospheric conditions, solar flux and soil moisture. The data in the [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) repository is expressed as a collection of CSV files where the site names are expressed in the filenames.

This cell defines functions to

* read in the data from all sites
* do some data munging (i.e., date parsing, `NaN` replacement)

In [None]:
import dask
import intake

cat = intake.open_catalog('../catalog.yml')

In [None]:
keep = ['P_ERA', 'TA_ERA', 'PA_ERA', 'SW_IN_ERA', 'LW_IN_ERA', 'WS_ERA',
        'VPD_ERA', 'SWC_F_MDS_1', 'SWC_F_MDS_2', 'SWC_F_MDS_3',
        'TS_F_MDS_1', 'TS_F_MDS_2', 'TS_F_MDS_3', 'TIMESTAMP']

train = [*filter(lambda x: x!= 'TIMESTAMP', keep), 'DOY', 'site']

def read_and_clean_file(s3_path, predict="NEE_CUT_USTAR50"):
    df = cat.fluxnet_daily(s3_path=s3_path).to_dask()
    
    for col in keep:
        if col not in df.columns:
            if 'SWC_F' in col or 'TS_F' in col:
                df = df.assign(**{col: 0})
    
    if not (set(df.columns) >= set(keep)) or predict not in df.columns:
        print(s3_path, 'is missing required columns')
        return

    df[keep + [predict]] = df[keep + [predict]].fillna(0)
    df = df.assign(DOY=df.TIMESTAMP.dt.dayofyear)

    X = df[train]
    X = X.assign(y=df[predict])

    return X

Setting up a helper function to load and clean a particular data file

## Read and clean data
This will take a few minutes if the data is not cached yet. First we will get a list of all the files on the s3 bucket, then we will iterate over those files and cache, read, and munge the data in each one. This is necessary since the columns in each file don't necessarily match the columns in the other files. Before we concatenate across sites, we need to do some cleaning. 

In [None]:
from boto.s3.connection import S3Connection

conn = S3Connection()
bucket = conn.get_bucket('earth-data')
s3_paths = [f.key for f in bucket.list('carbon_flux/nee_data_fusion/FLX')]

In [None]:
datasets = []
for s3_path in s3_paths:
    dd = read_and_clean_file(s3_path)
    if dd is not None:
        datasets.append(dd)

In [None]:
metadata = cat.fluxnet_metadata().read()

## Merge data

Once the data are loaded in, they can be concatenated across sites and joined with the metadata (lat, lon, vegetation) relating to each site.

In [None]:
X = dask.dataframe.concat(datasets).compute()
X.columns

In order to use the categorical `igbp` field (vegetation), we will create a matrix of dummies where each column corresponds to one of the `igbp` types, the rows correspond to observations and all the cells are filled with 0 or 1. 

In [None]:
onehot_metadata = pd.get_dummies(metadata, columns=['igbp'])
onehot_metadata['igbp'] = metadata['igbp']
onehot_metadata.head()

We'll merge this dummies + metadata matrix with our concatenated matrix from above. 

In [None]:
df = pd.merge(X, onehot_metadata, on='site')
df.info()

## Sample the data
To speed up the rest of the computations, we'll take a sample (10%) of the observations. We'll also remove some variables that we don't want to use in the linear regression.

In [None]:
show = df.sample(frac=0.10)
sites = pd.Categorical(show['site']).codes
dropped = {}
for col in ['DOY', 'site', 'igbp', 'lat', 'lon']:
    dropped[col] = show[col].copy()
    show.pop(col)
        
print("{} observations and {} variables".format(*show.shape))
print("Generating a prediction with these variables: \n  {}".format(
    "\n  ".join(list(
        show.columns
    ))
))

These variables are sufficient to create the linear models at every site. However, the site information is hidden from the visualization algorithm.

* Good sanity checks:
    - latitude encoded some structure, longitude does not

## Visualization

Linear models work well *at one site* but this is confounded by

* lat/lon
* day of year
* environment type

We want to generate some visualization that accounts for these 4 variables and helps generate some understanding.

That is, these observations lie on some manifold. We want to learn the structure of that manifold, and visualize each observation on that manifold.

This work attempts to find similar observations - observations that have a similar structure between the independent variables (e.g., `P_ERA`) and dependent variables (the carbon flux measurement `y`).

UMAP is a tool for this, and has firm mathematical grounding (plus, it's nice to use).

In [None]:
import umap
reduct = umap.UMAP(verbose=True, n_epochs=None)#, n_neighbors=30)

In [None]:
reduct.fit(show.values)

In [None]:
embedding = reduct.embedding_
embedding

In [None]:
cols = ['lat', 'lon', 'igbp']
s = pd.DataFrame(dropped)
s['x0'] = embedding[:, 0]
s['x1'] = embedding[:, 1]
for col in cols:
    if col in show:
        s[col] = show[col]
    else:
        if not col in s:
            print(col)

We can explore this manifold by coloring a scatter plot according to different variables that we believe should have structure in this space.

In [None]:
from panel.interact import interact
import colorcet as cc

color_by_columns = ['lat', 'lon', 'DOY', 'site', 'igbp']

@interact(color=color_by_columns)
def create_figure(color='lat'):
    opts = {'plot': {'color_index': color, 'show_legend': False,
                     'width': 600, 'height': 600, 'colorbar': True,
                     'tools': ['hover']},
            'style': {'cmap': 'magma', 'legend': False}
    }
    if color == 'DOY':
        opts['style']['cmap'] = cc.cm['cyclic_mrybm_35_75_c68']
    if color == 'igbp':
        opts['style']['cmap'] = 'Category20'
        opts['plot']['legend_position'] ='right'
        opts['plot']['show_legend'] = True
    if color == 'site':
        opts['style']['cmap'] = 'Category20'
        opts['plot']['colorbar'] = False
        opts['plot']['width'] = 525

    chart = hv.Scatter(
        s, kdims=['x0', 'x1'], vdims=[color, 'site'], extents=(-15,-15,15,15)
    ).opts(**opts).relabel('Colored by: {}'.format(color))
    return chart

create_figure

### Taking a closer look at vegetation

We can specify a more custom color map for vegetation and rename the categories with more specific labels. 

In [None]:
igbp_vegetation = {
    'ENF': '01 - Evergreen Needleleaf forest',
    'EBF': '02 - Evergreen Broadleaf forest',
    'DNF': '03 - Deciduous Needleleaf forest',
    'DBF': '04 - Deciduous Broadleaf forest',
    'MF' : '05 - Mixed forest',
    'CSH': '06 - Closed shrublands',
    'OSH': '07 - Open shrublands',
    'WSA': '08 - Woody savannas',
    'SAV': '09 - Savannas',
    'GRA': '10 - Grasslands',
    'WET': '11 - Permanent wetlands',
    'CRO': '12 - Croplands',
}

# https://lpdaac.usgs.gov/about/news_archive/modisterra_land_cover_types_yearly_l3_global_005deg_cmg_mod12c1
lpdaac_palette = [
    '#008000', '#00FF00', '#99CC00', '#99FF99', '#339966', '#993366',
    '#FFCC99', '#CCFFCC', '#FFCC00', '#FF9900', '#006699', '#FFFF00'
]

In [None]:
%%opts Scatter [width=800, height=600] (color=Cycle(lpdaac_palette), size=1, muted_alpha=0)

s['vegetation'] = s['igbp'].apply(lambda x: igbp_vegetation[x])
ds = hv.Dataset(s, ['x0', 'vegetation'], ['x1', 'site'])
grouped = ds.to(hv.Scatter, kdims=['x0', 'x1'], extents=(-15,-15,15,15), vdims=['site'])

grouped.overlay('vegetation').options(legend_position='right')

Isolate each vegetation type so that any site eccentricities are made clear. In this, let's **color by site ID**

In [None]:
grouped.options(color_index='site', cmap='Category20', show_legend=False, size=1, alpha=0.8).layout().cols(3)

Next we will train a model to predict carbon flux globally. 

## Setup Dask
With dask, we can distribute tasks over cores and do parallel computation.

In [None]:
import dask
import dask.array as da
import dask.dataframe as dd
from distributed import Client

client = Client()
client

## Prediction
Linear models work well *at one site* but this is confounded by

* lat/lon
* day of year
* environment type

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler

assert 'site' not in show.columns
y = show['y'].values
X = pd.DataFrame({col: show[col].values 
                  for col in show.columns 
                  if col != 'y'})
print("X.shape =", X.shape)
assert 'y' not in X.columns

# transform data matrix so 0 mean, unit variance for each feature
X = StandardScaler().fit_transform(X.values)

In [None]:
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.linear_model import LinearRegression
from dateutil import rrule
from datetime import datetime, timedelta

def fit_and_predict(X_train, y_train, X_test, nbrs=False):
    if nbrs:
        _nbrs = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(X_train)
        distances, indices = _nbrs.kneighbors(X_test)
    else:
        indices = np.arange(len(X_train), dtype=int)
    
    X_train_filtered = X_train[indices.flat[:]] 
    y_train_filtered = y_train[indices.flat[:]] 
        
    
    model = LinearRegression()
    model.fit(X_train_filtered, y_train_filtered)
    return model.predict(X_test)

In [None]:
def prediction_stats(train_idx, test_idx, X, y, doy=None, predict_each='season', nbrs=False):
    start = datetime(2000, 1, 1)
    end = start + timedelta(days=365)
    
    if predict_each == 'month':
        get_time_id = lambda dt: dt.month
    elif predict_each == 'year':
        get_time_id = lambda dt: 1
    elif predict_each == 'season':
        seasons = {'spring': [3, 4, 5],
                   'summer': [6, 7, 8],
                   'fall': [9, 10, 11],
                   'winter': [12, 1, 2]}
        seasons = {month: season_id
                   for season_id, months in enumerate(seasons.values())
                   for month in months}
        get_time_id = lambda dt: seasons[dt.month] 
    else:
        msg = "predict_each should be in {'year', 'month', 'season'}, got '{}'"
        raise ValueError(msg.format(predict_each))
    
    # from https://stackoverflow.com/questions/153584/how-to-iterate-over-a-timespan-after-days-hours-weeks-and-months-in-python
    time_partitions = {(dt - start).days: get_time_id(dt)
                       for dt in rrule.rrule(rrule.DAILY, dtstart=start, until=end)}
    time_partitions[366] = max(time_partitions.values())
    
    test_days = doy[test_idx]
    
    preds = []
    for time_partition in time_partitions.values():
        if len(time_partitions.values()) > 1:
            time_idx = [i for i, day in enumerate(doy) if time_partitions[day] == time_partition]
            
            # get the test set specific to this time instance
            time_test_idx = np.intersect1d(test_idx, time_idx)
        else:
            time_test_idx = test_idx 

        if len(time_test_idx) == 0:
            continue
            
        y_hat = fit_and_predict(X[train_idx], y[train_idx], X[time_test_idx], nbrs=nbrs)
        y_test = y[time_test_idx]
        preds += [{'predicted': y_hat,
                   'actual': y_test,
                   'time_partition': time_partition,
                   'corrcoef': np.corrcoef(y_hat, y_test)[0][1]}]
    actual = [p['actual'] for p in preds]
    predicted = [p['predicted'] for p in preds]
    actual = np.concatenate(actual).flat[:]
    predicted = np.concatenate(predicted).flat[:]
    return {'time_partitions': preds,
            'actual': actual,
            'predicted': predicted,
            'corrcoef': np.corrcoef(actual, predicted)[0][1]}


from sklearn.model_selection import LeaveOneGroupOut
sep = LeaveOneGroupOut()
train_idx, test_idx = list(sep.split(X, y, sites))[0]
_ = prediction_stats(train_idx, test_idx, X, y, doy=dropped['DOY'])

In [None]:
from sklearn.model_selection import LeaveOneGroupOut
sep = LeaveOneGroupOut()
corrs = []

futures = []
n_splits = sep.get_n_splits(X, y, sites)
X_future = client.scatter(X)
y_future = client.scatter(y)
doy_future = client.scatter(dropped['DOY'])
for i, (train_index, test_index) in enumerate(sep.split(X, y, sites)):
    futures += [{'site_id': i,
                 'train_index': train_index,
                 'test_index': test_index,
                 'stats': client.submit(prediction_stats,
                                        train_index,
                                        test_index,
                                        X_future,
                                        y_future,
                                        doy=doy_future)}]

In [None]:
results = client.gather(futures)

In [None]:
out = [{'site_id': result['site_id'], **result['stats']}
       for result in results]

In [None]:
df = pd.DataFrame(out)
df.head()

In [None]:
%%opts VLine [show_legend=False] VLine (color='red')
corrs = df.corrcoef[~np.isnan(df.corrcoef.values)]
frequencies, edges = np.histogram(corrs, 20)

c1 = hv.Histogram((frequencies, edges), extents=(-1, None, 1, None))
c2 = hv.VLine(np.mean(corrs), label='mean')
c3 = hv.VLine(np.median(corrs), label='median')
c1 * c2# * c3

In [None]:
np.mean(corrs)