## Carbon Monitoring Project

In [1]:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')

  return f(*args, **kwds)
  return f(*args, **kwds)


This notebook aims to visualize the data used in the carbon monitoring project [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) using Python tools.

The goals of this notebook:

* examine the measurements from each site
* generate some visualization or global model to predict one site from every other site.
* generate and explain model idea

To run this notebook, you will need to symlink the `data` directory of the [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) to `flux_data` in the `examples` directory of `EarthML`. In addition, you will need `RSIF_2007_2016_05N_01L.mat` in the `examples` directory which you can download from https://gentinelab.eee.columbia.edu/content/datasets

### Loading FluxNet data ``extract_fluxnet.m``

[FluxNet](http://fluxnet.fluxdata.org/) is a worldwide collection of sensor stations that record a number of local variables relating to atmospheric conditions, solar flux and soil moisture. The data is the [nee_data_fusion](https://github.com/greyNearing/nee_data_fusion/) repository is expressed as a collection of CSV files where the site names are expressed in the filenames.

This cell defines functions to

* read in the data from all sites
* do some data munging (i.e., date parsing, `NaN` replacement)
* do some feature engineering (i.e., dates to features that represent the cyclic structure of dates)

In [2]:
import numpy as np
import datetime
import os

In [3]:
DAILIES_DIR = '../../nee_data_fusion/data/in_situ/fluxnet_daily/'
LATLON_CSV = '../../nee_data_fusion/data/in_situ/extracted/allflux_metadata.txt'

sites = [fname.split('_')[1] for fname in os.listdir(DAILIES_DIR)]
latlon = pd.read_csv(LATLON_CSV, header=None, names=['site', 'lat', 'lon', 'igbp', 'network'], index_col='site')

In [4]:
igbp_codes = latlon.loc[sites].igbp.unique()

In [5]:
def ring(X, column="DOY"):
    """
    Turn an ordinal feature into a ring. e.g., day of year (1 to 365)
    into two features that represent a circle (sin and cos)
    
    Parameters
    ----------
    X : DataFrame
        Must have column ``column``
    """
    scaled = X[column].values.copy() / X[column].max()
    radians = scaled * 2 * np.pi
    return np.sin(radians), np.cos(radians)

def site_lat_lon(df, site):
    """
    Get a site's lat/lon from the global variables defined above
    
    Parameters
    ----------
    df : DataFrame
        lat and lon of each ``site``
    site : str
        name of a site in ``latlon.csv``. e.g., "US-CA1"
    """
    try:
        location = df.loc[site]
        return float(location['lat']), float(location['lon'])
    except:
        return None, None

def site_veg(df, site):
    """
    Get a site's lat/lon from the global variables defined abovel
    
    Parameters
    ----------
    df : DataFrame
        lat and lon of each ``site``
    site : str
        name of a site in ``latlon.csv``. e.g., "US-CA1"
    """
    location = df.loc[site]
    return location['igbp']

def _parse_days(integer):
    """ `integer` date as `20180704` to represent July 4th, 2018"""
    x = str(integer)
    d = {'year': int(x[:4]), 'month': int(x[4:6]), 'day': int(x[6:])}
    day_of_year = datetime.datetime(d['year'], d['month'], d['day']).timetuple().tm_yday
    return day_of_year

def clean(df, timestamp_col="TIMESTAMP", site='', keep=[], drop=[], predict=''):
    """
    Clean the dataset
    
    * Replace NaN's and any number less thatn -9990 with 0s
    * drop columns specified in `drop`
    * pull out prediction and feature matrices
    * Parse timestamp and pull out day of year ("DOY")
    """
    limit = -9990
    for i in range(50):
        df = df.replace(limit - i, np.nan)
    
    to_drop = [col for col in drop if col in df.columns]
    df.drop(columns=to_drop, inplace=True)
    for col in keep:
        if col not in df.columns:
            if 'SWC_F' in col or 'TS_F' in col:
                df[col] = 0
    
    df = df.fillna(0)
    df['DOY'] = df['TIMESTAMP'].apply(_parse_days)  
    df.pop('TIMESTAMP')
    X = df[keep]
    y = df[predict]
    return X, y

def load_fluxnet_site(site, one=False):
    """
    The main function to load data
    
    Parameters
    ----------
    site : str
        e.g., "US-CA1"
    one : bool, optional
        Whether to preform a dirty hack and create "one" dataframe that
        includes the prediction variable in the feature matrix.
        
    Returns
    -------
    X : pd.DataFrame
        Feature matrix. If ``one``, this will include the prediction variable
        and be the only thing returned.
    y : pd.DataFrame
        The prediction variable. Not returned if ``one``.
    """
    #dataRaw(dataRaw <= -9990) = 0/0 (is NaN?)
    #NaN -> zero
    prefix = 'FLX_{site}_FLUXNET'.format(site=site)
    filenames = [fname for fname in os.listdir(DAILIES_DIR)
                if fname.startswith(prefix)]
    if len(filenames) != 1:
        raise FileNotFoundError
    filename = filenames[0]
    
    raw_daily = pd.read_csv('{directory}{filename}'.format(directory=DAILIES_DIR, filename=filename))    
    
    lat, lon = site_lat_lon(latlon, site)
    raw_daily['lat'] = lat
    raw_daily['lon'] = lon
    veg = site_veg(latlon, site)
    raw_daily['igbp'] = veg
    
    # set all igbp cols to 0
    for igbp in igbp_codes:
        raw_daily[igbp] = 0
    
    # set just the relevant one to 1
    raw_daily[veg] = 1
    
    
    keep =  ['P_ERA',
             'TA_ERA',
             'PA_ERA',
             'SW_IN_ERA',
             'LW_IN_ERA',
             'WS_ERA',
             'SWC_F_MDS_1', 'SWC_F_MDS_2', 'SWC_F_MDS_3',
             'TS_F_MDS_1', 'TS_F_MDS_2', 'TS_F_MDS_3',
             'VPD_ERA',
             'DOY', 'lat', 'lon', 'igbp', *igbp_codes]
    drop = ["GPP_DT_VUT_USTAR50",
            "GPP_DT_CUT_USTAR50",
            "LE_F_MDS",
            "H_F_MDS"]
    predict = ["NEE_CUT_USTAR50",
               "NEE_VUT_USTAR50"]
    X, y = clean(raw_daily, keep=keep, drop=drop, predict=predict[0])
    
    X['DOY_ring_feature1'], X['DOY_ring_feature2'] = ring(X)
    
    # This is a quick and dirty hack; I should absolutely not
    # add the prediction variable to the feature matrix. But,
    # for the purposes of this notebook it worked.
    if one:
        X['site'] = site
        X['y'] = y
        return X
    return X, y

## Setup Dask
Dask is required to read in the CSVs and do preprocessing *quickly*.

In [6]:
import dask
import dask.array as da
import dask.dataframe as dd
from distributed import Client

client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:49358  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 8  Cores: 8  Memory: 17.18 GB


## Read in data

In [7]:
futures = client.map(load_fluxnet_site, sites, one=True)

In [8]:
succeeded = [f for f in futures if not f.exception()]
failed = [f for f in futures if f.exception()]

In [9]:
dfs = client.gather(succeeded)

In [10]:
df = pd.concat(dfs)
print("shape =", df.values.shape)
list(df.columns)

shape = (532544, 34)


['P_ERA',
 'TA_ERA',
 'PA_ERA',
 'SW_IN_ERA',
 'LW_IN_ERA',
 'WS_ERA',
 'SWC_F_MDS_1',
 'SWC_F_MDS_2',
 'SWC_F_MDS_3',
 'TS_F_MDS_1',
 'TS_F_MDS_2',
 'TS_F_MDS_3',
 'VPD_ERA',
 'DOY',
 'lat',
 'lon',
 'igbp',
 'CRO',
 'GRA',
 'DBF',
 'SAV',
 'OSH',
 'ENF',
 'DNF',
 'WET',
 'EBF',
 'CSH',
 'MF',
 'WSA',
 'SNO',
 'DOY_ring_feature1',
 'DOY_ring_feature2',
 'site',
 'y']

In [11]:
# set this to False to not inlude vegetation type in the calculation
include_veg = True

show = df.sample(frac=0.10)
dropped = {}
for col in ['DOY', 'site', 'lon', 'lat', 'DOY_ring_feature1', 'DOY_ring_feature2', 'igbp']:
    dropped[col] = show[col].copy()
    show.pop(col)
if include_veg is False:
    for col in igbp_codes:
        dropped[col] = show[col].copy()
        show.pop(col)
print("{} observations and {} variables".format(*show.shape))
print("Generating a prediction with these variables: \n  {}".format(
    "\n  ".join(list(
        show.columns
    ))
))

53254 observations and 27 variables
Generating a prediction with these variables: 
  P_ERA
  TA_ERA
  PA_ERA
  SW_IN_ERA
  LW_IN_ERA
  WS_ERA
  SWC_F_MDS_1
  SWC_F_MDS_2
  SWC_F_MDS_3
  TS_F_MDS_1
  TS_F_MDS_2
  TS_F_MDS_3
  VPD_ERA
  CRO
  GRA
  DBF
  SAV
  OSH
  ENF
  DNF
  WET
  EBF
  CSH
  MF
  WSA
  SNO
  y


These variables are sufficient to create the linear models at every site. However, the site information is hidden from the visualization algoritm.

* Good sanity checks:
    - lattitude encoded some structure, longitude does not

## Visualization

Linear models work well *at one site* but this is confounded by

* lat/lon
* day of year
* environment type

We want to generate some visualization that accounts for these 4 variables and helps generate some understanding.

That is, these observations lie on some manifold. We want to learn the structure of that manifold, and visualize each observation on that manifold.

This will have the finding similar observations that have a similar structure between the indepedent variables (e.g., `P_ERA`) and depedent variables (the carbon flux measurement `y`).

UMAP is a tool for this, and has firm mathematical grounding (plus, it's nice to use).

* **TODO**
    * Add an option to have the plot color be environment type
    * Run the visualization with and without including the environment type variable
    * Add circular colormap for day of year (and `DOY_ring_feature{1, 2}`)

In [12]:
import umap
reduct = umap.UMAP(verbose=True, n_epochs=None)#, n_neighbors=30)

UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
   learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
   metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
   n_neighbors=15, negative_sample_rate=5, random_state=None,
   repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
   target_metric='categorical', target_metric_kwds=None,
   target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
   transform_seed=42, verbose=True)


In [13]:
reduct.fit(show.values)

Construct fuzzy simplicial set
	 0  /  16
	 1  /  16
	 2  /  16
Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs


UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
   learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
   metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
   n_neighbors=15, negative_sample_rate=5, random_state=None,
   repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
   target_metric='categorical', target_metric_kwds=None,
   target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
   transform_seed=42, verbose=True)

In [14]:
embedding = reduct.embedding_
embedding

array([[-3.6512446,  6.6927996],
       [ 0.8228712, -4.32449  ],
       [ 1.6963229, -2.5155354],
       ...,
       [-1.9341797,  3.5845287],
       [-3.0901997,  7.422512 ],
       [ 2.2594516, -6.1193476]], dtype=float32)

In [15]:
cols = ['lat', 'lon', 'DOY_ring_feature1', 'DOY_ring_feature2', 'igbp']
s = pd.DataFrame(dropped)
s['x0'] = embedding[:, 0]
s['x1'] = embedding[:, 1]
for col in cols:
    if col in show:
        s[col] = show[col]
    else:
        assert col in s

In [42]:
from bokeh.models import Select
from bokeh.layouts import row, widgetbox
from bokeh.palettes import Category20
from bokeh.plotting import curdoc
from holoviews.ipython.display_hooks import display
#from IPython.display import display

colors = ['lat', 'lon', 'DOY', 'site', 'DOY_ring_feature1', 'DOY_ring_feature2', 'igbp']

def create_figure(color='lat', **kwargs):
    opts = {'plot': {'color_index': 2, 'show_legend': False,
                     'width': 600, 'height': 600, 'colorbar': True,
                     'tools': ['hover']},
            'style': {'cmap': 'magma', 'legend': False}}
    if color == 'DOY':
        opts['style']['cmap'] = 'hsv'
    if color in ['site', 'igbp']:
        opts['style']['cmap'] = 'Category20'
        if color == 'site':
            opts['plot']['colorbar'] = False
            opts['plot']['width'] = 700
        if color == 'igbp':
            opts['plot']['show_legend'] = True
    opts.update(**kwargs)
    chart = hv.Scatter(
        s, kdims=['x0', 'x1'], vdims=[color, 'site'], extents=(-15,-15,15,15)
    ).opts(plot=opts['plot'], style=opts['style'])
    return display(chart)

from ipywidgets import interactive

w = interactive(create_figure, color=colors)
w

interactive(children=(Dropdown(description='color', options=('lat', 'lon', 'DOY', 'site', 'DOY_ring_feature1',…

## Taking a closer look at vegetation

In [17]:
igbp_vegetation = {
    'ENF': '01 - Evergreen Needleleaf forest',
    'EBF': '02 - Evergreen Broadleaf forest',
    'DNF': '03 - Deciduous Needleleaf forest',
    'DBF': '04 - Deciduous Broadleaf forest',
    'MF': '05 - Mixed forest',
    'CSH': '06 - Closed shrublands',
    'OSH': '07 - Open shrublands',
    'WSA': '08 - Woody savannas',
    'SAV': '09 - Savannas',
    'GRA': '10 - Grasslands',
    'WET': '11 - Permanent wetlands',
    'CRO': '12 - Croplands',
}

In [18]:
s['vegetation'] = s['igbp'].apply(lambda x: igbp_vegetation[x])

In [19]:
ds = hv.Dataset(s, ['x0', 'vegetation'], ['x1', 'site'])
grouped = ds.to(hv.Scatter, kdims=['x0', 'x1'], extents=(-15,-15,15,15), vdims=['site'])

In [20]:
# https://developers.google.com/earth-engine/image_visualization
gee_palette = [
  '#152106', '#225129', '#369b47', '#30eb5b', '#387242',  # forest
  '#6a2325', '#c3aa69', '#b76031', '#d9903d', '#91af40',  # shrub, grass
  '#111149',  # wetlands
  '#cdb33b',  # croplands
]

In [21]:
# https://lpdaac.usgs.gov/about/news_archive/modisterra_land_cover_types_yearly_l3_global_005deg_cmg_mod12c1
lpdaac_palette = [
    '#008000', '#00FF00', '#99CC00', '#99FF99', '#339966', '#993366',
    '#FFCC99', '#CCFFCC', '#FFCC00', '#FF9900', '#006699', '#FFFF00'
]

In [22]:
%%opts Scatter [width=800, height=600] (color=Cycle(lpdaac_palette), size=1, muted_alpha=0)
grouped.overlay('vegetation').options(legend_position='right')

Isolate each vegetation type and color by site id so that any site ecentricities are made clear

In [36]:
grouped.options(color_index='site', cmap='Category20', show_legend=False, size=1, alpha=0.8).layout().cols(3)

## End of modifications
`----------------------------------------------------`


The rest of this notebook is copy-pasted from `Carbon-FLux.ipynb`. I haven't modified it yet.

`----------------------------------------------------`

### Adding RSIF data ``collocate_data_types.m``

RSIF is the 'Reconstructed Solar Induced Fluorescence' expressing solar energy flux (power) per meter squared arriving on the Earth's surface derived from a vegetation signal.

In [None]:
import scipy.io
rsif = scipy.io.loadmat('RSIF_2007_2016_05N_01L.mat')


The goal now is to add the RSIF time series added at positions of stations, then normalize the time dimension (resample to make the data sets have the same temporal sampling) and finally to perform a linear regression analysis on the combined data.

Note that the matlab file reference above was downloaded from https://gentinelab.eee.columbia.edu/content/datasets

In [None]:
%%opts Image [width=700 height=500 clipping_colors={'NaN': 'gray'}] Points (marker='x' color='cyan')

mdata = rsif['RSIF'] # NaNs outside of land area.
def rsif_image(day):
    return hv.Image(mdata[:,:,day], kdims=['lon', 'lat'], vdims=['RSIF'], bounds=(-180,-90,180,90))

rsif_dmap = hv.DynamicMap(rsif_image, kdims=['day']).redim.values(day=range(mdata.shape[2]))
raw_site_positions = {site:site_lat_lon(latlon, site) for site in sites}
site_positions = {site:(lon,lat) for (site, (lat,lon)) in raw_site_positions.items() if None not in (lat,lon)}
rsif_dmap * hv.Points(site_positions.values())

* Algorithm might be useful: Smoothing?
* Mismatched temporospatial satellite imagery.

1. Timestamp per lat/lon. Comes with triples. 
2. Polar: e.g 1pm local time.
3. Global (processed product?)

https://science.nasa.gov/earth-science/earth-science-data/data-processing-levels-for-eosdis-data-products

1. Level 0: Raw data (direct sensor flux). Highest resolution. Minimal model - good for ML.
2. Level 1: Sensor geometry.
3. Level 2: spatial regridding  
4. Level 3: check: physical variable. Primary science product.
5. Level 4: check: model added value.



## Sampling the RSIF signal at the sites

In [None]:
tables = []
for day in range(mdata.shape[2]):
    image = rsif_dmap.select(day=day)
    table = image.sample(samples=site_positions.values())
    tables.append(table.add_dimension('site', 0, site_positions.keys()))

In [None]:
tabulated = hv.HoloMap({day:tables[day] for day in range(mdata.shape[2])}, kdims=['Day'])
tabulated

## Viewing the RSIF time series per site

In [None]:
site_RSIF = tabulated.table().to(hv.Curve, 'Day', 'RSIF').drop_dimension(['lat', 'lon'])
site_RSIF

### biweekly_averaging.m

'RSIF' is put onto the same time sampling as the fluxnet data, adding an 'RSIF' field to the fluxnet dataframe for each site.

### main_lr_fluxnet.m


Finding the informative variables.

Main step:

```
model{s} = stepwiselm(X,y,''constant'',''Criterion'',''aic'',''Upper'',''linear'')
```

Using [stepwiselm](https://uk.mathworks.com/help/stats/stepwiselm.html):

```
mdl = stepwiselm(X,y,modelspec) creates a linear model of the responses y to the predictor variables in the data matrix X, using stepwise regression to add or remove predictors. modelspec is the starting model for the stepwise procedure.
```

Where ``allData`` is the table (effectively the dataframe):

```
cols= [3:11,14:22];
X = allData(:,cols,s);
y = allData(:,12,s);
```

* 3:11 : 'LAT','LON','P_ERA','TA_ERA','PA_ERA','SW_IN_ERA','LW_IN_ERA','WS_ERA','LE_F_MDS'
* 12: 'H_F_MDS' (Sensible heat flux, gapfilled using MDS method)

Check. NEE is the desired predicted. Column 13.

### main_ann_fluxnet.m

Main step for size ``s`` in ``trainANN`` using [``train``](https://uk.mathworks.com/help/nnet/ref/train.html) in the neural network toolbox:

```
% net = fitnet(parms.nodes,parms.trainFcn);
net = feedforwardnet(parms.nodes,parms.trainFcn);
[net,~] = train(net,x,t);
```

```
Xdex= [2:10,15,18,21,24];
Ydex = 13;
```

