## Carbon Monitoring Project

In [None]:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')

This notebook aims to visualize the data used in the carbon monitoring project [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) using Python tools.

The goals of this notebook:

* examine the measurements from each site
* generate some visualization or global model to predict one site from every other site.
* generate and explain model idea

To run this notebook, you will need `RSIF_2007_2016_05N_01L.mat` in the `examples` directory which you can download from https://gentinelab.eee.columbia.edu/content/datasets

## Loading FluxNet data ``extract_fluxnet.m``

[FluxNet](http://fluxnet.fluxdata.org/) is a worldwide collection of sensor stations that record a number of local variables relating to atmospheric conditions, solar flux and soil moisture. The data in the [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) repository is expressed as a collection of CSV files where the site names are expressed in the filenames.

This cell defines functions to

* read in the data from all sites
* do some data munging (i.e., date parsing, `NaN` replacement)

In [None]:
import dask
import intake

cat = intake.open_catalog('../catalog.yml')

In [None]:
keep = ['P_ERA', 'TA_ERA', 'PA_ERA', 'SW_IN_ERA', 'LW_IN_ERA', 'WS_ERA',
        'VPD_ERA', 'SWC_F_MDS_1', 'SWC_F_MDS_2', 'SWC_F_MDS_3',
        'TS_F_MDS_1', 'TS_F_MDS_2', 'TS_F_MDS_3', 'TIMESTAMP']

train = [*filter(lambda x: x!= 'TIMESTAMP', keep), 'DOY', 'site']

def read_and_clean_file(s3_path, predict="NEE_CUT_USTAR50"):
    df = cat.fluxnet_daily(s3_path=s3_path).to_dask()
    
    for col in keep:
        if col not in df.columns:
            if 'SWC_F' in col or 'TS_F' in col:
                df = df.assign(**{col: 0})
    
    if not (set(df.columns) >= set(keep)) or predict not in df.columns:
        print(s3_path, 'is missing required columns')
        return

    df[keep + [predict]] = df[keep + [predict]].fillna(0)
    df = df.assign(DOY=df.TIMESTAMP.dt.dayofyear)

    X = df[train]
    X = X.assign(y=df[predict])

    return X

Setting up a helper function to load and clean a particular data file

## Read and clean data
This will take a few minutes if the data is not cached yet. First we will get a list of all the files on the s3 bucket, then we will iterate over those files and cache, read, and munge the data in each one. 

In [None]:
from boto.s3.connection import S3Connection

conn = S3Connection()
bucket = conn.get_bucket('earth-data')
s3_paths = [f.key for f in bucket.list('carbon_flux/nee_data_fusion/FLX')]

In [None]:
datasets = []
for s3_path in s3_paths:
    dd = read_and_clean_file(s3_path)
    if dd is not None:
        datasets.append(dd)

In [None]:
metadata = cat.fluxnet_metadata().read()

## Merge data

Once the data are loaded in, they need to be joined with the metadata relating to each site.

In [None]:
X = dask.dataframe.concat(datasets).compute()
X.columns

In [None]:
onehot_metadata = pd.get_dummies(metadata, columns=['igbp'])
onehot_metadata['igbp'] = metadata['igbp']

In [None]:
df = pd.merge(X, onehot_metadata, on='site')

In [None]:
show = df.sample(frac=0.10)
sites = pd.Categorical(show['site']).codes
dropped = {}
for col in ['DOY', 'site', 'igbp', 'lat', 'lon']:
    dropped[col] = show[col].copy()
    show.pop(col)
        
print("{} observations and {} variables".format(*show.shape))
print("Generating a prediction with these variables: \n  {}".format(
    "\n  ".join(list(
        show.columns
    ))
))

These variables are sufficient to create the linear models at every site. However, the site information is hidden from the visualization algorithm.

* Good sanity checks:
    - latitude encoded some structure, longitude does not

## Visualization

Linear models work well *at one site* but this is confounded by

* lat/lon
* day of year
* environment type

We want to generate some visualization that accounts for these 4 variables and helps generate some understanding.

That is, these observations lie on some manifold. We want to learn the structure of that manifold, and visualize each observation on that manifold.

This work attempts to find similar observations - observations that have a similar structure between the independent variables (e.g., `P_ERA`) and dependent variables (the carbon flux measurement `y`).

UMAP is a tool for this, and has firm mathematical grounding (plus, it's nice to use).

In [None]:
import umap
reduct = umap.UMAP(verbose=True, n_epochs=None)#, n_neighbors=30)

In [None]:
reduct.fit(show.values)

In [None]:
embedding = reduct.embedding_
embedding

In [None]:
cols = ['lat', 'lon', 'igbp']
s = pd.DataFrame(dropped)
s['x0'] = embedding[:, 0]
s['x1'] = embedding[:, 1]
for col in cols:
    if col in show:
        s[col] = show[col]
    else:
        if not col in s:
            print(col)

In [None]:
from bokeh.models import Select
from bokeh.layouts import row, widgetbox
from bokeh.palettes import Category20
from bokeh.plotting import curdoc
from holoviews.ipython.display_hooks import display
import colorcet as cc

colors = ['lat', 'lon', 'DOY', 'site', 'igbp']

def create_figure(color='lat', **kwargs):
    opts = {'plot': {'color_index': color, 'show_legend': False,
                     'width': 600, 'height': 600, 'colorbar': True,
                     'tools': ['hover']},
            'style': {'cmap': 'magma', 'legend': False}
}
    if color == 'DOY':
        opts['style']['cmap'] = cc.cm['cyclic_mrybm_35_75_c68']
    if color == 'igbp':
        opts['style']['cmap'] = 'Category20'
        opts['plot']['legend_position'] ='right'
        opts['plot']['show_legend'] = True
    if color == 'site':
        opts['style']['cmap'] = 'Category20'
        opts['plot']['colorbar'] = False
        opts['plot']['width'] = 700

    opts.update(**kwargs)
    chart = hv.Scatter(
        s, kdims=['x0', 'x1'], vdims=[color, 'site'], extents=(-15,-15,15,15)
    ).opts(plot=opts['plot'], style=opts['style'])
    return display(chart)

from ipywidgets import interactive

w = interactive(create_figure, color=colors)
w

## Taking a closer look at vegetation

In [None]:
igbp_vegetation = {
    'ENF': '01 - Evergreen Needleleaf forest',
    'EBF': '02 - Evergreen Broadleaf forest',
    'DNF': '03 - Deciduous Needleleaf forest',
    'DBF': '04 - Deciduous Broadleaf forest',
    'MF' : '05 - Mixed forest',
    'CSH': '06 - Closed shrublands',
    'OSH': '07 - Open shrublands',
    'WSA': '08 - Woody savannas',
    'SAV': '09 - Savannas',
    'GRA': '10 - Grasslands',
    'WET': '11 - Permanent wetlands',
    'CRO': '12 - Croplands',
}

In [None]:
s['vegetation'] = s['igbp'].apply(lambda x: igbp_vegetation[x])

In [None]:
ds = hv.Dataset(s, ['x0', 'vegetation'], ['x1', 'site'])
grouped = ds.to(hv.Scatter, kdims=['x0', 'x1'], extents=(-15,-15,15,15), vdims=['site'])

In [None]:
# https://lpdaac.usgs.gov/about/news_archive/modisterra_land_cover_types_yearly_l3_global_005deg_cmg_mod12c1
lpdaac_palette = [
    '#008000', '#00FF00', '#99CC00', '#99FF99', '#339966', '#993366',
    '#FFCC99', '#CCFFCC', '#FFCC00', '#FF9900', '#006699', '#FFFF00'
]

In [None]:
%%opts Scatter [width=800, height=600] (color=Cycle(lpdaac_palette), size=1, muted_alpha=0)
grouped.overlay('vegetation').options(legend_position='right')

Isolate each vegetation type so that any site eccentricities are made clear. In this, let's **color by site ID**

In [None]:
grouped.options(color_index='site', cmap='Category20', show_legend=False, size=1, alpha=0.8).layout().cols(3)

Take a look at the [Carbon Flux Prediction](Carbon_Flux_Prediction.ipynb) for more info on how to train a predictive model. 