# Pollen reference data for simulations

The objective of this notebook is to process the pandas dataframe generated in the notebook `/home/andrew/Documents/phd/data-proc/pollen-timeseries/pyogeo/notebooks/pollen-analysis/pollen-analysis.ipynb` (see [here](pollen-analysis.html) for static reference) to generate time series  for each study site with an annual temporal resolution for use in comparison to the outputs of simulation model runs. 

A copy of the aforementioned dataframe is stored in the file `data/0_pollen_timeseries.pkl` and will be the starting point for the following analysis:

1. Interpolate land cover proportion data at the temporal resolution provided by the European Pollen Database and produce outputs for each study site at annual resolution.

2. Calculate first and second time derivatives (i.e. slopes) for this interpolated data.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

## 0. Discard data we don't need for this processing step

First load `data/0_pollen_timeseries.pkl` into a `pandas.DataFrame` and inspect available study sites:

In [None]:
all_sites = pd.read_pickle("data/0_pollen_timeseries.pkl")

def print_included_sites(sites_df):
    print "All included study sites:"
    for s in all_sites.index.get_level_values("sitecode").unique():
        print "- {0}".format(s)
        
print_included_sites(all_sites)

In the version of `pollen_timeseries.pkl` loaded on 22/1/19, there were additional sites (bajondillo, puerto_de_los_tornos etc) which were not currently under investigation. For the remainder of this analysis I will consider only sites explicitly discussed during my upgrade:

In [None]:
included_sites = ['monte_areo_mire', 'atxuri', 'charco_da_candieira', 
                  'navarres', 'algendar', 'san_rafael']

all_sites = all_sites.loc[included_sites]

print_included_sites(all_sites)

Looking at the index names, we notice the inclusion of the `e_` index, which identifies an individual sediment core in the EPD. Let's look at a summary how how site codes relate to sediment core numbers:

In [None]:
sites_summary = all_sites.reset_index()[all_sites.index.names]
sites_summary.columns = sites_summary.columns.droplevel(1)
sites_summary = sites_summary.groupby(by=["sitecode", "e_"]).count()
sites_summary.columns = ['no_samples']
sites_summary

Each site has only one core associated with it, so the `e_` index can be dropped without losing any information

In [None]:
all_sites = all_sites.reset_index().drop("e_", axis=1).set_index(['sitecode', 'agebp'])
all_sites.head()

Notice that the data type of the `agebp` index is a float. Check if this can be safely converted to an integer value:

In [None]:
index_values = np.array(all_sites.index.get_level_values("agebp").unique())
rounded_index_values = np.rint(index_values)
rounded_index_dif = index_values - rounded_index_values
print "Largest difference between raw and rounded values: " + \
    str(max(rounded_index_dif.max(), abs(rounded_index_dif.min())))

This demonstrates that agebp can be made an integer index without losing any information

In [None]:
all_sites.index = all_sites.index.set_levels(
    all_sites.index.levels[1].map(
        lambda ix: np.rint(ix).astype("int")), "agebp")
all_sites.head()

Finally note that for the purpose of deriveing pollen proportion time series, we don't actually need the `pcount` columns

In [None]:
all_sites = all_sites.drop("pcount", axis=1)
all_sites.head()

## 1. Interpolate data to achieve annual temporal resolution

In this section we develop functions to create a new `DataFrame` based on `all_sites` -- `interp_df` -- which will hold interpolated data derived from `all_sites` at annual temporal resolution

### 1.1 Develop a function to create an interpolated DataFrame for a single site

As an example, use Algendar

In [None]:
algendar = all_sites.loc['algendar']
algendar.head()

Determine first and last `agebp` values:

In [None]:
earliest_date = algendar.index.max()
latest_date = algendar.index.min()

print "Earliest date: {0} yr BP\nLatest date: {1} yr BP".format(earliest_date, 
                                                                latest_date)

Derive new index based on this range

In [None]:
new_index_vals = np.arange(latest_date, earliest_date+1)
print new_index_vals

In [None]:
algendar_interp = pd.DataFrame(algendar.iloc[0:0], index=new_index_vals)
algendar_interp.index.name = "agebp"
algendar_interp.head()

Loop through the DataFrame at EPD resolution and assign those rows for which there is data in the EPD to the correct row in the interpolated dataframe

In [None]:
for i, row in algendar.iterrows():
    algendar_interp.loc[i] = row  
    
algendar_interp.head()

Interpolate missing data in each of the `pprop` columns

In [None]:
for c in algendar_interp.columns.get_level_values("lct"):
    algendar_interp["pprop", c] = algendar_interp["pprop", c] \
                                    .interpolate(method="linear")

Confirm no entry is less than 0

In [None]:
if algendar_interp.min().min() < 0:
    raise ValueError("Negative proportions are invalid:\n" \
                     + str(algendar_interp.min()))

In [None]:
algendar_interp.head()

Renormalise land cover proportions to ensure that each row totals 1.0

In [None]:
def renormalise_prop_row(row, tolerance=0):
    """Ensure LCT proportions add up to 1, normalise if not."""
    tot = row.sum()
    if abs(tot-1) > tolerance:
        return row/tot
    return row

algendar_interp = algendar_interp.apply(renormalise_prop_row, axis=1)

In [None]:
algendar_interp.head()

In [None]:
algendar_totals = algendar_interp.sum(axis=1)
assert len(algendar_interp[algendar_totals != 1]) == 0

Put the above logic together in a function

In [None]:
def interpolate_site_pprop(site_pprop_df):
    """Take EPD resolution data and interpolate to annual resolution.
    
    Args:
        ssite_pprop_df (:obj:`pandas.DataFrame`): EPD time resolution data for
            a single study site
            
    Returns:
        :obj:`pandas.DataFrame`: A new `DataFrame` with the same columns as 
            the input, but interpolated so it has an annual temporal 
            resolution.
    """
    # Infer earliest and latest dates in site DataFrame
    earliest_date = site_pprop_df.index.max()
    latest_date = site_pprop_df.index.min()
    
    # Derive new index based on this range
    new_index_vals = np.arange(latest_date, earliest_date+1)

    # Create new DataFrame with correct columns and index but no data
    interp_df = pd.DataFrame(site_pprop_df.iloc[0:0], index=new_index_vals)
    interp_df.index.name = "agebp"

    # Load data from EPD into new DataFrame
    for i, row in site_pprop_df.iterrows():
        interp_df.loc[i] = row

    # Interpolate missing data in each of the pprop columns
    for c in interp_df.columns.get_level_values("lct"):
        interp_df["pprop", c] = interp_df["pprop", c] \
                                    .interpolate(method="linear")

    # Confirm no entry is less than 0
    if algendar_interp.min().min() < 0:
        raise ValueError("Negative proportions are invalid:\n" \
                         + str(algendar_interp.min()))

    # Renormalise land cover proportions to ensure that each row totals 1.0
    def renormalise_prop_row(row, tolerance=0):
        """Ensure LCT proportions add up to 1, normalise if not."""
        tot = row.sum()
        if abs(tot-1) > tolerance:
            return row/tot
        return row

    interp_df = interp_df.apply(renormalise_prop_row, axis=1)
    
    # Confirm land cover proportions are sufficiently close to 1.0
    interp_df_totals = interp_df.sum(axis=1)
    
    if len(interp_df[interp_df_totals - 1 > 0.001]) != 0:
        raise ValueError("Not all LCT proportions add up to 1:\n" \
                        + str(interp_df[interp_df_totals != 1]))
        
    return interp_df

Test on San Rafael data

In [None]:
san_raf_interp = interpolate_site_pprop(all_sites.loc["algendar"])
san_raf_interp.head()

### 1.2 Create interpolated DataFrame for all sites

The aim here is to loop through all the sites in the `all_sites` DataFrame's `sitecode` index, create an interpolated version using `interpolate_site_pprop` and join them all together. This can be done with standard `pandas` methods for [concatenating objects](https://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects)


In [None]:
sites = all_sites.index.get_level_values("sitecode").unique()
all_sites_interp = pd.concat([interpolate_site_pprop(all_sites.loc[s]) 
                              for s in sites], keys=sites)

Save resulting DataFrame to a pickle for easy subsequent retrieval

In [None]:
all_sites_interp.to_pickle("data/1_all_sites_interp.pkl")

## 2. Calculate time derivatives for each study site's pollen proportions

Reload interpolated data from disk

In [None]:
all_sites_interp = pd.read_pickle("data/1_all_sites_interp.pkl")

Generally a gradient is given by 

$\text{Grad} = \frac{\Delta f}{\Delta t}$

However, because in this case $\Delta t$ is always 1 (because the resolution of the interpolated DataFrame is 1 year, the gradient is simply given by the difference between each cell and the previous one in the same column. Hence first derivatives can be calculated as follows:

In [None]:
def make_gradient_column_group(df, src_group, tgt_group):
    for site in df.index.get_level_values("sitecode").unique():
        for lct in df.columns.get_level_values("lct"):
            df.loc[site, (tgt_group, lct)] \
                = df.loc[site, (src_group, lct)].diff().values
    return df

all_sites_derivs = make_gradient_column_group(all_sites_interp, 
                                              src_group="pprop", 
                                              tgt_group="pprop_1deriv")

all_sites_derivs = make_gradient_column_group(all_sites_derivs, 
                                              src_group="pprop_1deriv", 
                                              tgt_group="pprop_2deriv")

In [None]:
all_sites_derivs.head()

In [None]:
all_sites_derivs.to_pickle("data/2_all_sites_with_derivatives.pkl")

## 3. Write zipped CSV files for each study site

In [None]:
all_sites_derivs = pd.read_pickle("data/2_all_sites_with_derivatives.pkl")

In [None]:
import os
from zipfile import ZipFile, ZIP_DEFLATED
from io import BytesIO
zip_dir = "data/3_single_sites_with_derivatives"
if not os.path.isdir(zip_dir):
    os.makedirs(zip_dir)

In [None]:
sites = all_sites_derivs.index.get_level_values("sitecode").unique()
col_groups = all_sites_derivs.columns.get_level_values(0).unique()

for site in sites:
    with ZipFile(os.path.join(zip_dir, site + "_pollen_timeseries.zip"), 'w', 
                 ZIP_DEFLATED) as z:
        for cols in col_groups:
            string_buffer = BytesIO()
            all_sites_derivs.loc[site, cols].to_csv(string_buffer, 
                                                    float_format='%.15f')
            z.writestr(cols + '.csv', string_buffer.getvalue())

## 4. [Experimental] Write HDF5 files for each study site

In [None]:
all_sites_derivs = pd.read_pickle("data/2_all_sites_with_derivatives.pkl")

Make directory for outputs if it doesn't exist

In [None]:
import os
zip_dir = "data/3_single_sites_with_derivatives"
if not os.path.isdir(hdf5_dir):
    os.makedirs(hdf5_dir)

In [None]:
for site in all_sites_derivs.index.get_level_values("sitecode").unique():
    all_sites_derivs.loc[site].to_hdf(
        os.path.join(hdf5_dir, site + "_pollen_timeseries.h5"), key="data")