## Carbon Monitoring Project

In [1]:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')

This notebook aims to visualize the data used in the carbon monitoring project [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) using Python tools.

The goals of this notebook:

* examine the measurements from each site
* generate some visualization or global model to predict one site from every other site.
* generate and explain model idea

To run this notebook, you will need to symlink the `data` directory of the [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) to `flux_data` in the `examples` directory of `EarthML`. In addition, you will need `RSIF_2007_2016_05N_01L.mat` in the `examples` directory which you can download from https://gentinelab.eee.columbia.edu/content/datasets

### Loading FluxNet data ``extract_fluxnet.m``

[FluxNet](http://fluxnet.fluxdata.org/) is a worldwide collection of sensor stations that record a number of local variables relating to atmospheric conditions, solar flux and soil moisture. The data is the [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) repository is expressed as a collection of CSV files where the site names are expressed in the filenames.

This cell defines functions to

* read in the data from all sites
* do some data munging (i.e., date parsing, `NaN` replacement)

In [2]:
import numpy as np
import datetime
import os

In [3]:
# DAILIES_DIR = '../../nee_data_fusion/data/in_situ/fluxnet_daily/'
# METADATA_CSV = '../../nee_data_fusion/data/in_situ/extracted/allflux_metadata.txt'
DAILIES_DIR = 'flux_data/dailies/'
METADATA_CSV = 'allflux_metadata.txt'

sites = [fname.split('_')[1] for fname in os.listdir(DAILIES_DIR)]
metadata = pd.read_csv(METADATA_CSV, header=None, names=['site', 'lat', 'lon', 'igbp', 'network'], 
                       usecols=['site', 'lat', 'lon', 'igbp'], index_col='site')

# get all the igbp codes for these sites
igbp_codes = metadata.loc[sites].igbp.unique()

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  # This is added back by InteractiveShellApp.init_path()


In [4]:
# Any missing metadata?
metadata.loc[sites].isnull().values.any()

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  


True

In [5]:
def _parse_days(integer):
    """ `integer` date as `20180704` to represent July 4th, 2018"""
    x = str(integer)
    d = {'year': int(x[:4]), 'month': int(x[4:6]), 'day': int(x[6:])}
    day_of_year = datetime.datetime(d['year'], d['month'], d['day']).timetuple().tm_yday
    return day_of_year

def clean(df, timestamp_col="TIMESTAMP", site='', keep=[], drop=[], predict=''):
    """
    Clean the dataset
    
    * Replace NaN's and any number less than -9990 with 0s
    * drop columns specified in `drop`
    * pull out prediction and feature matrices
    * Parse timestamp and pull out day of year ("DOY")
    """
    limit = -9990
    for i in range(50):
        df = df.replace(limit - i, np.nan)
    
    to_drop = [col for col in drop if col in df.columns]
    df.drop(columns=to_drop, inplace=True)
    for col in keep:
        if col not in df.columns:
            if 'SWC_F' in col or 'TS_F' in col:
                df[col] = 0
    
    df = df.fillna(0)
    df['DOY'] = df['TIMESTAMP'].apply(_parse_days)  
    df.pop('TIMESTAMP')
    X = df[keep]
    y = df[predict]
    return X, y

def load_fluxnet_site(site, one=False):
    """
    The main function to load data
    
    Parameters
    ----------
    site : str
        e.g., "US-CA1"
    one : bool, optional
        Whether to preform a dirty hack and create "one" dataframe that
        includes the prediction variable in the feature matrix.
        
    Returns
    -------
    X : pd.DataFrame
        Feature matrix. If ``one``, this will include the prediction variable
        and be the only thing returned.
    y : pd.DataFrame
        The prediction variable. Not returned if ``one``.
    """
    #dataRaw(dataRaw <= -9990) = 0/0 (is NaN?)
    #NaN -> zero
    prefix = 'FLX_{site}_FLUXNET'.format(site=site)
    filenames = [fname for fname in os.listdir(DAILIES_DIR)
                if fname.startswith(prefix)]
    if len(filenames) != 1:
        raise FileNotFoundError
    filename = filenames[0]
    path = '{directory}{filename}'.format(directory=DAILIES_DIR, filename=filename)
    
    raw_daily = pd.read_csv(path)    
    
    keep =  ['P_ERA',
             'TA_ERA',
             'PA_ERA',
             'SW_IN_ERA',
             'LW_IN_ERA',
             'WS_ERA',
             'SWC_F_MDS_1', 'SWC_F_MDS_2', 'SWC_F_MDS_3',
             'TS_F_MDS_1', 'TS_F_MDS_2', 'TS_F_MDS_3',
             'VPD_ERA',
             'DOY']
    drop = ["GPP_DT_VUT_USTAR50",
            "GPP_DT_CUT_USTAR50",
            "LE_F_MDS",
            "H_F_MDS"]
    predict = ["NEE_CUT_USTAR50",
               "NEE_VUT_USTAR50"]
    
    X, y = clean(raw_daily, keep=keep, drop=drop, predict=predict[0])
    X['site'] = site  # some metadata
    X['y'] = y
    return X

## Setup Dask
Dask is required to read in the CSVs and do preprocessing *quickly*.

In [6]:
import dask
import dask.array as da
import dask.dataframe as dd
from distributed import Client

client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:55234  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 8  Cores: 8  Memory: 17.18 GB


## Read in data

In [7]:
futures = client.map(load_fluxnet_site, sites, one=True)

In [8]:
succeeded = [f for f in futures if not f.exception()]
failed = [f for f in futures if f.exception()]

In [9]:
dfs = client.gather(succeeded)

## Merge data

Once the data are loaded in, they need to be joined with the metadata relating to each site.

In [10]:
df = pd.concat(dfs)
df.columns

Index(['P_ERA', 'TA_ERA', 'PA_ERA', 'SW_IN_ERA', 'LW_IN_ERA', 'WS_ERA',
       'SWC_F_MDS_1', 'SWC_F_MDS_2', 'SWC_F_MDS_3', 'TS_F_MDS_1', 'TS_F_MDS_2',
       'TS_F_MDS_3', 'VPD_ERA', 'DOY', 'site', 'y'],
      dtype='object')

In [11]:
# create a little dataframe for mapping categorical variables

# TODO: it looks like this is one hot encoding. Use pd.get_dummies instead?
# if not, there's another TODO:
# TODO: it looks like this is creating an identity matrix. Use np.eye?
a = np.zeros((len(igbp_codes), len(igbp_codes)), int)
np.fill_diagonal(a, 1)
categorical_igbp_mapper = pd.DataFrame(index=igbp_codes, columns=igbp_codes, data=a)
categorical_igbp_mapper.rename_axis('igbp', inplace=True)

# add metadata to the big dataframe
onehot_metadata = pd.get_dummies(metadata, columns=['igbp'])
onehot_metadata['igbp'] = metadata['igbp']
assert onehot_metadata.index.name == 'site'
onehot_metadata['site'] = onehot_metadata.index

In [12]:
df = pd.merge(df, onehot_metadata, on='site')

In [13]:
# set this to False to not include vegetation type in the calculation
include_veg = True

show = df.sample(frac=0.10)
sites = pd.Categorical(show['site']).codes
dropped = {}
for col in ['DOY', 'site', 'igbp', 'lat', 'lon']:
    dropped[col] = show[col].copy()
    show.pop(col)
    
if not include_veg:
    for col in igbp_codes:
        dropped[col] = show[col].copy()
        show.pop(col)
        
print("{} observations and {} variables".format(*show.shape))
print("Generating a prediction with these variables: \n  {}".format(
    "\n  ".join(list(
        show.columns
    ))
))

53254 observations and 29 variables
Generating a prediction with these variables: 
  P_ERA
  TA_ERA
  PA_ERA
  SW_IN_ERA
  LW_IN_ERA
  WS_ERA
  SWC_F_MDS_1
  SWC_F_MDS_2
  SWC_F_MDS_3
  TS_F_MDS_1
  TS_F_MDS_2
  TS_F_MDS_3
  VPD_ERA
  y
  igbp_BSV
  igbp_CRO
  igbp_CSH
  igbp_DBF
  igbp_DNF
  igbp_EBF
  igbp_ENF
  igbp_GRA
  igbp_MF
  igbp_OSH
  igbp_SAV
  igbp_SNO
  igbp_WAT
  igbp_WET
  igbp_WSA


These variables are sufficient to create the linear models at every site. However, the site information is hidden from the visualization algorithm.

* Good sanity checks:
    - latitude encoded some structure, longitude does not

## Visualization

Linear models work well *at one site* but this is confounded by

* lat/lon
* day of year
* environment type

We want to generate some visualization that accounts for these 4 variables and helps generate some understanding.

That is, these observations lie on some manifold. We want to learn the structure of that manifold, and visualize each observation on that manifold.

This work attempts to find similar observations - observations that have a similar structure between the independent variables (e.g., `P_ERA`) and dependent variables (the carbon flux measurement `y`).

UMAP is a tool for this, and has firm mathematical grounding (plus, it's nice to use).

In [14]:
import umap
reduct = umap.UMAP(verbose=True, n_epochs=None)#, n_neighbors=30)

UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
   learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
   metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
   n_neighbors=15, negative_sample_rate=5, random_state=None,
   repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
   target_metric='categorical', target_metric_kwds=None,
   target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
   transform_seed=42, verbose=True)


In [15]:
reduct.fit(show.values)

Construct fuzzy simplicial set
	 0  /  16
	 1  /  16
	 2  /  16
Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs


UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
   learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
   metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
   n_neighbors=15, negative_sample_rate=5, random_state=None,
   repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
   target_metric='categorical', target_metric_kwds=None,
   target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
   transform_seed=42, verbose=True)

In [16]:
embedding = reduct.embedding_
embedding

array([[-4.3887696 ,  0.526364  ],
       [ 1.0208132 , -1.886086  ],
       [-7.206009  ,  7.306581  ],
       ...,
       [ 2.5887456 , -5.287872  ],
       [ 7.234993  ,  0.21064246],
       [-1.4652796 ,  3.5141547 ]], dtype=float32)

In [17]:
cols = ['lat', 'lon', 'igbp']
s = pd.DataFrame(dropped)
s['x0'] = embedding[:, 0]
s['x1'] = embedding[:, 1]
for col in cols:
    if col in show:
        s[col] = show[col]
    else:
        if not col in s:
            print(col)

In [18]:
from bokeh.models import Select
from bokeh.layouts import row, widgetbox
from bokeh.palettes import Category20
from bokeh.plotting import curdoc
from holoviews.ipython.display_hooks import display
import colorcet as cc

colors = ['lat', 'lon', 'DOY', 'site', 'igbp']

def create_figure(color='lat', **kwargs):
    opts = {'plot': {'color_index': color, 'show_legend': False,
                     'width': 600, 'height': 600, 'colorbar': True,
                     'tools': ['hover']},
            'style': {'cmap': 'magma', 'legend': False}
}
    if color == 'DOY':
        opts['style']['cmap'] = cc.cm['cyclic_mrybm_35_75_c68']
    if color == 'igbp':
        opts['style']['cmap'] = 'Category20'
        opts['plot']['legend_position'] ='right'
        opts['plot']['show_legend'] = True
    if color == 'site':
        opts['style']['cmap'] = 'Category20'
        opts['plot']['colorbar'] = False
        opts['plot']['width'] = 700

    opts.update(**kwargs)
    chart = hv.Scatter(
        s, kdims=['x0', 'x1'], vdims=[color, 'site'], extents=(-15,-15,15,15)
    ).opts(plot=opts['plot'], style=opts['style'])
    return display(chart)

from ipywidgets import interactive

w = interactive(create_figure, color=colors)
w

## Taking a closer look at vegetation

In [19]:
igbp_vegetation = {
    'ENF': '01 - Evergreen Needleleaf forest',
    'EBF': '02 - Evergreen Broadleaf forest',
    'DNF': '03 - Deciduous Needleleaf forest',
    'DBF': '04 - Deciduous Broadleaf forest',
    'MF' : '05 - Mixed forest',
    'CSH': '06 - Closed shrublands',
    'OSH': '07 - Open shrublands',
    'WSA': '08 - Woody savannas',
    'SAV': '09 - Savannas',
    'GRA': '10 - Grasslands',
    'WET': '11 - Permanent wetlands',
    'CRO': '12 - Croplands',
}

In [20]:
s['vegetation'] = s['igbp'].apply(lambda x: igbp_vegetation[x])

In [21]:
ds = hv.Dataset(s, ['x0', 'vegetation'], ['x1', 'site'])
grouped = ds.to(hv.Scatter, kdims=['x0', 'x1'], extents=(-15,-15,15,15), vdims=['site'])

In [22]:
# https://lpdaac.usgs.gov/about/news_archive/modisterra_land_cover_types_yearly_l3_global_005deg_cmg_mod12c1
lpdaac_palette = [
    '#008000', '#00FF00', '#99CC00', '#99FF99', '#339966', '#993366',
    '#FFCC99', '#CCFFCC', '#FFCC00', '#FF9900', '#006699', '#FFFF00'
]

In [23]:
%%opts Scatter [width=800, height=600] (color=Cycle(lpdaac_palette), size=1, muted_alpha=0)
grouped.overlay('vegetation').options(legend_position='right')

Isolate each vegetation type so that any site eccentricities are made clear. In this, let's **color by site ID**

In [24]:
grouped.options(color_index='site', cmap='Category20', show_legend=False, size=1, alpha=0.8).layout().cols(3)

## Prediction
Linear models work well *at one site* but this is confounded by

* lat/lon
* day of year
* environment type

In [25]:
from sklearn.preprocessing import StandardScaler

assert 'site' not in show.columns
y = show['y'].values
X = pd.DataFrame({col: show[col].values 
                  for col in show.columns 
                  if col != 'y'})
print(X.shape)
assert 'y' not in X.columns

# transform data matrix so 0 mean, unit variance for each feature
X = StandardScaler().fit_transform(X.values)

(53254, 28)


In [51]:
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.linear_model import LinearRegression
from dateutil import rrule
from datetime import datetime, timedelta

def fit_and_predict(X_train, y_train, X_test):
    # TODO: find neighbors
    model = LinearRegression()
    model.fit(X_train, y_train)
    return model.predict(X_test)
    

* **TODO:** Select better method of finding neighbors
    * smart way of figuring out neighbors? Probably all observations within a certain radius
    * use the embedding (only if radius method doesn't work, requires reading UMAP paper)

In [58]:
def prediction_stats(train_idx, test_idx, X, y, doy=None):
    start = datetime(2000, 1, 1)
    end = start + timedelta(days=365)
    
    # from https://stackoverflow.com/questions/153584/how-to-iterate-over-a-timespan-after-days-hours-weeks-and-months-in-python
    days = [{'month': dt.month, 'day': (dt - start).days}
                   for dt in rrule.rrule(rrule.DAILY, dtstart=start, until=end)]
    months = {(dt - start).days: dt.month
               for dt in rrule.rrule(rrule.DAILY, dtstart=start, until=end)}
    months[366] = 12
    test_days = doy[test_idx]
    preds = []
    for month in set(d['month'] for d in days):
        time_idx = [i for i, day in enumerate(doy) if months[day] == month]
        
        # select a subset of the test set for this particular time
        time_test_idx = [i for i in test_idx if i in time_idx]
        
        if len(time_test_idx) == 0:
            continue
        y_hat = fit_and_predict(X[train_idx], y[train_idx], X[time_test_idx])
        y_test = y[time_test_idx]
        preds += [{'predicted': y_hat,
                   'actual': y_test,
                   'month': month,
                   'corrcoef': np.corrcoef(y_hat, y_test)[0][1]}]
    return preds

In [61]:
from sklearn.model_selection import LeaveOneGroupOut
sep = LeaveOneGroupOut()
corrs = []

futures = []
n_splits = sep.get_n_splits(X, y, sites)
X_future = client.scatter(X)
y_future = client.scatter(y)
doy_future = client.scatter(dropped['DOY'])
for i, (train_index, test_index) in enumerate(sep.split(X, y, sites)):
    futures += [{'site_id': i,
                 'train_index': train_index,
                 'test_index': test_index,
                 'stats': client.submit(prediction_stats,
                                        train_index,
                                        test_index,
                                        X_future,
                                        y_future,
                                        doy=doy_future)}]


In [None]:
results = client.gather(futures)

In [69]:
out = [{'site_id': result['site_id'], **stat}
       for result in results
       for stat in result['stats']]

In [78]:
df = pd.DataFrame(out)
df.head()

Unnamed: 0,actual,corrcoef,month,predicted,site_id
0,"[-6.1902, -4.8638900000000005, -6.29438, -5.03...",0.702115,1,"[-0.9407044776811352, -0.7633363136186352, -2....",0
1,"[-4.92984, -5.93506, -6.05462, -3.54733, -0.87...",0.851639,2,"[-1.6565247901811353, -1.3635560401811353, -1....",0
2,"[-4.418130000000001, -5.12997, -3.03883, -5.85...",-0.092674,3,"[-1.1080628761186353, -1.1304017433061353, -1....",0
3,"[-2.35481, -5.93506, -3.85037, -2.92802, -4.54...",-0.472338,4,"[-0.3652650245561352, 0.3660582176313648, -0.7...",0
4,"[-4.34129, -0.792329, -6.05462, -1.24862, -2.0...",-0.438046,5,"[-0.05654920424363519, 0.11862169419386481, 0....",0


In [85]:
corrs = df.corrcoef[~np.isnan(df.corrcoef.values)]
frequencies, edges = np.histogram(corrs, 20)

c1 = hv.Histogram((frequencies, edges), label='corr. coeffs')
c2 = hv.VLine(np.mean(corrs), label='mean')
c3 = hv.VLine(np.median(corrs), label='median')
c1 #* c2 * c3