# Make Land Cover Type (LCT) time series from EPD data

The purpose of this notebook is to :
1. Load pollen abundance time-series data extracted from the European Pollen Database for a selection of sites I am studying in the development of my PhD thesis.
2. Explore, consider the limitations of, and clean that data.
3. Support the systematic assignment of pollen types identified in the empirical data to the categorical land-cover types which will be represented in my simulation models. This is a form of modelling in itself, and serves as an abstraction couched in terms of the notion of a plant functional type. That is, plant _species_ which are postulated to be functionally identical as far as the model is concerned are assigned to the same plant functional group. This will be achieved using regular expressions to embelish the data in a pandas dataframe.
4. Produce, for each of my empirical study sites, time-series of the proportion of landscape occupied for each of the functional groups represented in the model for the duration of time for which there is abundance data for each study site. This will be presented in the form of a `.csv` file and a plot for each study site. These files will also include first and second derivatives of pollen abundance percentage at each time step.

The only input required to run this notebook is a path to the file `site_pollen_abundance_ts.csv` which is output from [`epd-query`](https://github.com/lanecodes/epd-query).

In [None]:
from dataclasses import dataclass
from pathlib import Path
import os
import sys
import re
from typing import Dict, List

import unidecode

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

from aslib import AgroSuccessLct

In [None]:
pwd = os.getcwd().split('/')[-1]
in_pollen_abundance = pwd == 'pollen-abundance'
TMP_DIR = Path('../tmp') if in_pollen_abundance else Path('tmp')
OUTPUT_DIR = Path('../outputs') if in_pollen_abundance else Path('outputs')
PLOTS_DIR = OUTPUT_DIR / 'plots'
PLOTS_DIR.mkdir(exist_ok=True)

## 1. Load pollen data from file

In [None]:
epd_data = pd.read_csv(TMP_DIR / 'site_pollen_abundance_ts.csv')

In [None]:
epd_data.head()

In [None]:
epd_data.groupby(['sitename', 'e_']).size()

## 2. Explore, condider the limitations of, and clean pollen core data

### Check numbers of samples in each core, narrow core selection

Load data for one of the three Navarres cores

In [None]:
nav1 = epd_data[epd_data['e_'] == 469]

In [None]:
def summarise_core(df, name):
    print(f'{df["sample_"].unique().size} samples in core {name}')
    print(f'{len(df.index)} records across all samples in core {name}')
    print(f'Top 10 varcodes in core {name}:')
    print(df.groupby('varcode')['varcode'].count().nlargest(10))    

In [None]:
summarise_core(nav1, 'nav1')

We see that one of the top pollen codes in the database for this sediment core corresponds to [pollen spike](https://quantpalaeo.wordpress.com/2017/07/28/pollen-spikes/) or is unspecified with `varcode` values of `conc.spk` and `...`. 

More troublingly, navares core 469, NAVA1 has only 15 samples.

In [None]:
nav2 = epd_data[epd_data['e_'] == 470]
summarise_core(nav2, 'nav2')

NAVA2 has only 30 samples

In [None]:
nav3 =  epd_data[epd_data['e_'] == 471]
summarise_core(nav3, 'nav3')

NAVA3 has 191 samples

Going forward, I'll prefer NAVA3 over NAVA1 and NAVA2 since it contains more samples. If I find something which makes NAVA3 seem unreliable, I may reconsider. For now, drop NAVA1 and NAVA2 from the `epd_data` dataframe.

In [None]:
epd_data = epd_data[~epd_data['e_'].isin([469, 470])]

### Look at top ten pollen contributing species for each study site, remove sediment spike

In [None]:
def print_top_species(epd_data):
    for ssite in epd_data['sitename'].unique():
        print('\n'+ssite)
        df = epd_data[epd_data['sitename']==ssite]
        df = df.groupby(['var_', 'varcode', 'varname']).agg({'count' : 'sum'})
        print(df.sort_values(by='count', ascending=False).head(5))
    del df

print_top_species(epd_data)

Navarres alone seems to have a lot of pollen spike in it. Also Monte Areo mire and Charco da Candieira have Lycopodium spike added. See [here](https://palynology.wordpress.com/2012/10/07/pollen-spike/) for background on pollen spike. To keep analyses between sites consistent, I will exclude these. 

In [None]:
def remove_varcodes(df: pd.DataFrame, varcodes: List[str]) -> pd.DataFrame:
    """Remove rows corresponding to specified varcodes from epd DF."""
    return df[~df['varcode'].isin(varcodes)]


def remove_varcodes_test_df():
    return pd.DataFrame({
        'varcode': ['goodvar1', 'badvar1', 'badvar2', 'goodvar2'],
        'count': np.random.randint(0, 4000, size=4)
    })


def test_remove_varcodes(test_df):
    res_df = remove_varcodes(test_df, ['badvar1', 'badvar2'])
    assert res_df.iloc[0]['varcode'] == 'goodvar1'
    assert res_df.iloc[1]['varcode'] == 'goodvar2'
    assert len(res_df.index) == 2
    
test_remove_varcodes(remove_varcodes_test_df())   

In [None]:
exclude_pollen_spike = True
if exclude_pollen_spike:
    epd_data = remove_varcodes(
        epd_data, ['Spi/tab', 'Lyc(ad)', 'Lyc(ct)', 'Lyc']
    )

Also note that San Rafael has a significant proportion of Botryococcus in its samples. This is a type of green algae. Since this doesn't correspond to any _land_ plant species, we exclude it

In [None]:
aquatic_plant_codes = [
    'Bry',
    'Zyg-T',
    'Spr-T',
    'Pot',      # Potamogeton, aquatic plant
    'Clo.i-T',  # Closterium idiosporum, green algae
    'Spi.cf.s', # Spirogyra cf. scrobiculata, green algae
    'Trl.s',    # Trilete spore(s),  not from modern terrestrial plant
]

exclude_non_land_plants = True
if exclude_non_land_plants:
    epd_data = remove_varcodes(epd_data, aquatic_plant_codes)

Identified lots of moss (Sphagnum) in, e.g. Atxuri. Exclude this

In [None]:
exclude_mosses = True
if exclude_mosses:
    epd_data = remove_varcodes(epd_data, ['Sph'])

Fungal spores such as Glomus turn up in Navarres. Exclude

In [None]:
fungal_species_codes = [
    'Glomus',
    'Pos',  # Polyadosporites, fungal spore http://www.redalyc.org/html/454/45437346003/index.html
]

exclude_fungi = True
if exclude_fungi:
    epd_data = remove_varcodes(epd_data, fungal_species_codes)

Remove records corresponding to pollen which could not be identified

In [None]:
unrecognised_species_codes = [
    'Ind.unkn',  # found in navarres
    'T16C',
]

In [None]:
exclude_unrecognised = True
if exclude_unrecognised:
    epd_data = remove_varcodes(epd_data, unrecognised_species_codes)

At this point, `epd_data` contains entries for all:
1. sediment cores
2. samples (depths/ ages)
3. species (careful to exclude pollen spike)

In [None]:
epd_data

#### Give each site an easily typed `sitecode` to refer to as an index

It will be convenient to be able to refer to sites as an index. To make these easy to type, create a `sitecode` column which strips out spaces and removes any unicode names

In [None]:
print(epd_data['sitename'].unique())

In [None]:
epd_data['sitecode'] = (
    epd_data['sitename']
    .apply(unidecode.unidecode)
    .str.replace(' ', '_')
    .str.lower()
)

print(epd_data.sitecode.unique())

In [None]:
epd_data.head()

#### Drop unnecessary columns

I plan to use `sitename` as an index going forward because it's natural to think in terms of study sites. This means I don't need other information in the dataframe I take forward in my analyses at the study site level of detail. So this information can easily be rerieved if needs be when debugging, I save this to disk and remove the extra columns

In [None]:
site_meta_fields = ['sitecode', 'sitename', 'site_', 'sigle', 'e_', 'chron_']
site_meta = epd_data.groupby(site_meta_fields).size().rename('num_records')
site_meta.to_csv(OUTPUT_DIR / 'site_metadata.csv', encoding='utf8', header=True)
epd_data = epd_data.drop(
    [x for x in site_meta_fields if x != 'sitecode'], axis=1
)
epd_data

`sample_` (a database key from the EPD) is also redundant at this point, since we can idenify each sample from its `agebp`. Similarly each variable (pollen species) is uniquely identified by its `varcode` so we can also drop `var_`.

In [None]:
epd_data = epd_data.drop(['sample_', 'var_'], axis=1)

In [None]:
epd_data.head()

Check `agebp` and `count` can be converted to `int` without loss of data, and do the conversion

In [None]:
def convert_field_to_int(df: pd.DataFrame, field: str) -> pd.DataFrame:
    """Convert named float field to int if no data would be lost."""
    assert (~df[field].isna()).all(), f'missing data found in {field}'
    assert ((df[field] - df[field].astype(int)) == 0).all(), (
     f'casting {field} to int caused loss of data'
    )
    df[field] = df[field].astype(int)
    return df

In [None]:
epd_data = (
    epd_data
    .pipe(lambda df: convert_field_to_int(df, 'agebp'))
    .pipe(lambda df: convert_field_to_int(df, 'count'))
)

In [None]:
epd_data

Create a unique index

In [None]:
epd_data = epd_data.set_index(['sitecode', 'agebp', 'varcode']).sort_index()

Find index is unexpectedly not unique

In [None]:
epd_data.index.is_unique

Check which sites duplicates are coming from

In [None]:
epd_data[epd_data.index.duplicated()].groupby(level=['sitecode']).count()

In [None]:
pct_affected = 107 / len(epd_data.loc['charco_da_candieira'].index) * 100
print(f'{round(pct_affected, 2):.2f}% of charco_da_candieira entries are duplicates')

As less than 1% of Charco da Candieira sample/ species combinations are affected, we will simply assume that where multiple entries are associated for a species in a single sample, the correct count is obtained by summing any duplicates. No other site's data are affected by this issue.

In [None]:
initial_index_len = len(epd_data.index)

In [None]:
varcode_to_varname_df = (
    epd_data.reset_index(level='varcode')[['varcode', 'varname']]
    .drop_duplicates()
    .set_index('varcode')
)

In [None]:
epd_data = (
    epd_data['count']
    .groupby(level=['sitecode', 'agebp', 'varcode']).sum()
    .to_frame()
    .join(varcode_to_varname_df)
)

In [None]:
assert initial_index_len - len(epd_data.index) == 107
assert epd_data.index.is_unique

Rename `count` to avoid an understandable but irritating namespace collision with the `pd.Series.count` method.

In [None]:
epd_data = epd_data.rename(columns={'count': 'pcount'})

In [None]:
epd_data

In [None]:
epd_data.loc['navarres'].head()

In [None]:
epd_data.groupby(level=['sitecode', 'agebp']).sum()

`epd_data` is now prepped and ready to use for subsequent analyses. Serialise a csv file so it can be retrieved without rerunning the above cells.

In [None]:
epd_data.to_csv(OUTPUT_DIR / 'clean_epd_data.csv')

## 3. Relate identified pollen species with model-dependent plant functional types

In [None]:
try:
    epd_data
except NameError:
    epd_data = (
        pd.read_csv(OUTPUT_DIR / 'clean_epd_data.csv')
        .set_index(['sitecode', 'agebp', 'varcode'])
    )   

Retrieve a list of unique `varname`-s found amongst the sediment cores analysed thus far in the notebook.

In [None]:
unique_species = (
    epd_data.reset_index()[['varname', 'varcode']].drop_duplicates()
    .set_index('varcode')
)
unique_species.head()

## Find the most common species for each study site

The objective is to ensure that approximately 90% of counted pollen is assigned to one of the following land cover type groups:

- Shrubland: includes grasses (Poaceae, formerly Gramineae, family), and juniper (genus Juniperus, belongs to cypress family Cupressaceae).
- Pine forest: anything belonging to the Pinus genus
- Deciduous forest: Beech family, Fagaceae and Chestnut (Castanea genus)
- Oak forest: anything belonging to the Quercus genus

Find percentage of each study site's total contributed by each species. These are the species whose mapping to land cover types are most important.

In [None]:
(
    epd_data.groupby(['sitecode', 'varcode'])['pcount'].sum().to_frame()
    .pipe(lambda df: df.join(df.groupby('sitecode')['pcount']
                             .sum().rename('site_total')))
    .assign(species_pct=lambda df: df['pcount'] / df['site_total'] * 100)
    .drop(columns='site_total')
    .groupby('sitecode')['species_pct'].nlargest(10)
    .reset_index(0, drop=True)
    .to_frame()
    .join(unique_species)
)

### Identify land-cover types with pollen species

The aim in this section is to construct a dictionary whose keys are land cover types included in my simulation models, and whose values are regular expressions which match the names of species contributing to those land cover types. This dictionary will then be used to say: if _this_ pattern is found in a species name, map it to _this_ land cover type.

The following land cover types are included in simulations, but not in the land cover type categories used in this notebook:

1. Water/Quarry
2. Burnt
3. Depleated agricultural land
3. Barley
4. Wheat
5. Transition forest

Land cover types 1-3 above don't produce any pollen. Barley and wheat produce grass pollen. This belongs to the Poaceae (formerly known as Gramineae) family, and is assumed to contribute to 'Shrubland'. There is no depleated agricultural land, barley or wheat land cover present at the beginning of a simulation, as these are anthropogenically induced land cover types.

I don't map pollen to the 'Transition forest' land cover type because this type is a mixture of pine and oak forest. When comparing simulation outputs to empirical pollen abundance, I will assume transition forest simulation cells contribute half a cell of pine forest pollen and half a cell of oak forest pollen. When generating Neutral Landscape models from pollen abundance, I will assume that no cells start off as transition forest, and allow transition forest cells to be introduced by a model 'burn in' period. An alternative proposal might be to change my modelling approach such that I effectively integrate out the transition forest state, so we create the possibility of transitioning directly between pine and oak, subject to the kind of environmental conditions which would support transition forest.

#### Map land land cover types to species

In [None]:
from taxa import POLLEN_LCT_MAPS, compose_regexs, SpeciesGroup

Write regex to species mapping to csv and latex files

In [None]:
def sg_list_to_df(sgs: List[SpeciesGroup]):
    """Convert list of SpeciesGroup objects to a dataframe."""
    df = pd.DataFrame([x.__dict__ for x in sgs])
    return df.fillna(np.nan)   

species_regex_df = (
    pd.concat({k: sg_list_to_df(v) for k, v in POLLEN_LCT_MAPS.items()})
    .reset_index(1, drop=True)
    .pipe(lambda df: df.set_index(pd.Index(df.index, name='Functional type')))
    .rename(columns={'regex': 'Regular expression', 'desc': 'Description',
                     'note': 'Note'})
    .set_index('Description', append=True)
    .sort_index()
)

In [None]:
species_regex_df.head()

In [None]:
species_regex_df.to_csv(OUTPUT_DIR / 'species_regex.csv')
species_regex_df.drop(columns='Note').to_latex(
    OUTPUT_DIR / 'species_regex.tex',
    longtable=True
)

Define a function which, given a species name, returns a list of land cover types.

In [None]:
def get_lct(species_name: str, pol_lct_dict: Dict[str, str],
            verbose=False) -> str:
    """Given a species name, map it to a land cover type.
    
    Throw a ValueError if species name matches more than one land cover type.
    """
    lcts = []
    for lct_name, regex in pol_lct_dict.items():
        if re.match(regex, species_name, re.IGNORECASE):
            lcts.append(lct_name)
            if verbose:
                print(regex + ' matches ' + species_name)
    
    if len(lcts) > 1:
        raise ValueError('Species name {0} matched multiple land cover type '
                         'regex strings: {1}'.format(species_name, lcts))
    if len(lcts) == 0:
        return None

    return lcts[0]


def test_get_lct():
    regex_to_lct_map = {lct_name: compose_regexs(
        [x.regex for x in species_group_list]
    ) for lct_name, species_group_list in POLLEN_LCT_MAPS.items()}
    
    def species_name_test(species_name, expected_lct):
        determined_lct = get_lct(species_name, regex_to_lct_map)
        assert  determined_lct == expected_lct, (
            f"expected '{expected_lct}' to be lct for species "
            f"'{species_name}' but got '{determined_lct}' instead."
        )
        
    species_name_test('Quercus ilex-type', 'oak_forest')
    species_name_test('Quercus', 'oak_forest')
    species_name_test('Pinus pinaster-type', 'pine_forest')
    species_name_test('Pinus', 'pine_forest')
    species_name_test('Rumex crispus-type', 'shrubland')
    species_name_test('Compositae subf. Cichorioideae', 'shrubland')
    species_name_test('Erica arborea-type', 'shrubland')
    species_name_test('Ericaceae', 'shrubland')
    species_name_test('Polypodium vulgare-type', 'shrubland')
    
test_get_lct()

Apply `get_lct` to each species included in the chronology

In [None]:
regex_to_lct_map = {lct_name: compose_regexs([x.regex 
                                              for x in species_group_list])
                    for lct_name, species_group_list in POLLEN_LCT_MAPS.items()}
unique_species['lct'] = unique_species.varname.apply(
    lambda x: get_lct(x, regex_to_lct_map)
)

In [None]:
mapped_species = unique_species[unique_species['lct'].notnull()]
mapped_species.to_csv(TMP_DIR / 'species_to_landcover_mapping.csv', index=False)

For each study site, find the percentage of pollen contributed by each species to each sample

In [None]:
epd_data = (
    epd_data
    .join(epd_data
          .groupby(level=['sitecode', 'agebp'])['pcount'].sum()
          .rename('sample_tot'))
    .assign(species_pct=lambda df: df['pcount'] / df['sample_tot'] * 100)
    .drop(columns='sample_tot')
)

assert (epd_data.groupby(level=['sitecode', 'agebp'])['species_pct'].sum()
        - 100 < 0.00001).all(), (
    'site/ sample percentage totals should equal 100'
)

In [None]:
epd_data

Add `lct` to index via `unique_species`

In [None]:
epd_data = (
    epd_data
    .join(unique_species.drop(columns='varname'))
    .assign(lct=lambda df: df['lct'].fillna('not_specified'))
    .set_index('lct', append=True)
    .swaplevel(3, 2)
    .sort_index()
)

In [None]:
epd_data

#### Evaluate proportion of pollen, for each study site, accounted for by land-cover type mapping

Aggregate `epd_data` from species level to land cover type level 

In [None]:
total_lct_pct_df = (
    epd_data
    .groupby(level=['sitecode', 'lct'])['pcount'].sum()
    .rename('lct_total_count').to_frame()
    .pipe(lambda df: df.join(df.groupby(level='sitecode')['lct_total_count']
                             .sum().rename('site_total')))
    .assign(site_lct_pct=lambda df: (df['lct_total_count'] 
                                     / df['site_total'] * 100))
    .loc[:, 'site_lct_pct']
    .unstack()
    .loc[:, ['shrubland', 'pine_forest', 'oak_forest', 'deciduous_forest', 'not_specified']]
)

assert (total_lct_pct_df.sum(1) - 100 < 0.00001).all(), (
    'per-sample total lct contributions should total 100%'
)

Write total percentage of pollen corresponding to each land cover type to outputs to facilitate subsequent plotting.

In [None]:
total_lct_pct_df.to_csv(OUTPUT_DIR / 'site_total_lct_pct.csv')

In [None]:
site_name_map = {
    'algendar': 'Algendar',
    'atxuri': 'Atxuri',
    'charco_da_candieira': 'Charco da\nCandieira',
    'monte_areo_mire': 'Monte Areo\nmire',
    'navarres': 'NavarrÃ©s',
    'san_rafael': 'San Rafael'
}

lct_name_map = {
    'shrubland': 'Shrubland',
    'pine_forest': 'Pine',
    'oak_forest': 'Oak',
    'deciduous_forest': 'Deciduous',
    'not_specified': 'Not allocated'
}

In [None]:
font = {'weight' : 'normal',
        'size'   : 13}
matplotlib.rc('font', **font)

In [None]:
lct_color_list = [x.color.hex_code for x 
                  in [AgroSuccessLct.SHRUBLAND, AgroSuccessLct.PINE,
                  AgroSuccessLct.OAK, AgroSuccessLct.DECIDUOUS]] + ['w']
f, ax = plt.subplots(figsize=(9, 5))
plot_df = total_lct_pct_df.rename(site_name_map, axis=0).rename(lct_name_map, axis=1)
plot_df.plot(kind='bar', stacked=True, ax=ax, color=lct_color_list, linewidth=.5, edgecolor='k')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.45), frameon=False)
plt.xticks(rotation=45)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlabel(None);
ax.tick_params(axis=u'x', length=0)
ax.set_ylabel('% pollen allocated to PFT');
plt.tight_layout()
plt.savefig(PLOTS_DIR / 'pct_pollen_allocated_lct.pdf', bbox_inches='tight')#, pad_inches=4)

The proportions of pollen not falling into one of the groups represented in the model above is deeped acceptable, i.e. at least 90% of pollen for simulated study sites is attributed to a modelled land cover type.

Summarise `epd_data` to land cover type level

In [None]:
lct_data = (
    epd_data
    .groupby(level=['sitecode', 'agebp', 'lct'])['species_pct'].sum()
    .unstack().replace(np.nan, 0)
)

In [None]:
lct_data.to_csv(TMP_DIR / 'site_uninterpolated_lct_ts.csv')

In [None]:
lct_data

## 4. Interpolate data to achieve annual temporal resolution

In this section we develop functions to create a new `DataFrame` based on `lct_dat` -- `interp_df` -- which will hold interpolated data derived from `lct_data` at annual temporal resolution

In [None]:
del lct_data

In [None]:
try:
    lct_data
except NameError:
    lct_data = (
        pd.read_csv(TMP_DIR /  'site_uninterpolated_lct_ts.csv')
        .set_index(['sitecode', 'agebp'])
    )   

In [None]:
lct_data

### Develop a function to create an interpolated DataFrame for a single site

As an example, use Algendar

In [None]:
lct_data

In [None]:
algendar = lct_data.loc[pd.IndexSlice['algendar', :], :]

In [None]:
algendar.head()

In [None]:
algendar.tail()

In [None]:
def interpolate_lct_data(site_lct_df: pd.DataFrame):
    """Interpolate LCT percentage DF for site to annual time steps.
    
    Resulting rows are normalised to ensure total for each sample equals
    100%.
    
    Input DF should have index levels ('sitecode', 'agebp').
    """
    site_lct_df = site_lct_df.copy().reset_index(level='sitecode')
    earliest, latest = site_lct_df.index.max(), site_lct_df.index.min()
    if len(site_lct_df['sitecode'].unique()) > 1:
        raise ValueError(
            'It is only appropriate to interpolate values for a single site'    
    )
    site = site_lct_df.iloc[0]['sitecode']

    return (
        site_lct_df
        .reindex(np.arange(latest, earliest + 1))
        .assign(sitecode=site)
        .set_index('sitecode', append=True).swaplevel()
        .interpolate(method='linear')
        .pipe(_check_all_positive)
        .transform(_normalise_lct_row, axis=1)
    )

In [None]:
def _check_all_positive(df: pd.DataFrame) -> pd.DataFrame:
    """Check all values in input dataframe are positive.
    
    Raise value error if there are negative values, else return
    original dataframe unchanged.
    """
    if not (df >= 0).all(axis=None):
        raise ValueError('Negative values found in DataFrame')
    return df

In [None]:
def _normalise_lct_row(row: pd.Series, tolerance: float=0) -> pd.Series:
    """Ensure each sample's LCT percentages total 100%.
    
    Normalise if necessary.
    """
    tot = row.sum()
    if abs(tot - 100) > tolerance:
        return (row / tot) * 100
    return row

In [None]:
def test_interpolate_lct_data(algendar_lct_df):
    res_df = (
        algendar_lct_df
        .drop(columns='not_specified')
        .pipe(interpolate_lct_data)
    )
    assert res_df.iloc[0].name[1] == 2262
    assert res_df.iloc[-1].name[1] == 8961
    assert len(res_df.index) == 8961 - 2262 + 1
    assert (res_df.sum(1) - 100 < 0.00001).all()

In [None]:
test_interpolate_lct_data(algendar)

### Create interpolated DataFrame for all sites

Interpolate land cover proportion data at the temporal resolution provided by the European Pollen Database and produce outputs for each study site at annual resolution.

The aim here is to loop through all the sites in the `all_sites` DataFrame's `sitecode` index and create an interpolated version using `interpolate_lct_data`.

In [None]:
interp_lct_data = (
    lct_data
    .drop(columns='not_specified')
    .groupby('sitecode')
    .apply(interpolate_lct_data)
    .droplevel(0)
)

In [None]:
interp_lct_data

Save resulting DataFrame to file for easy subsequent retrieval

In [None]:
interp_lct_data.to_csv(TMP_DIR / 'site_interpolated_lct_ts.csv')

### Calculate time derivatives for each study site's pollen proportions

Calculate first and second time derivatives (i.e. slopes) for this interpolated data.

Reload interpolated data from disk

In [None]:
try:
    interp_lct_data
except NameError:
    interp_lct_data = (
        pd.read_csv(TMP_DIR / 'site_interpolated_lct_ts.csv')
        .set_index(['sitecode', 'agebp'])
    )   

Generally a gradient is given by 

$\text{Grad} = \frac{\Delta f}{\Delta t}$

However, because in this case $\Delta t$ is always 1 (because the resolution of the interpolated DataFrame is 1 year, the gradient is simply given by the difference between each cell and the previous one in the same column. Hence first derivatives can be calculated as follows:

In [None]:
deriv_dict = dict()
deriv_dict['pct'] = interp_lct_data
deriv_dict['d1_pct'] = (interp_lct_data.groupby('sitecode')
                        .apply(lambda df: df.diff()))
deriv_dict['d2_pct'] = (deriv_dict['d1_pct'].groupby('sitecode')
                        .apply(lambda df: df.diff()))
deriv_lct_data = pd.concat(deriv_dict, axis=1)
del deriv_dict
deriv_lct_data.columns = ['_'.join(col).strip() 
                          for col in deriv_lct_data.columns.values]

In [None]:
deriv_lct_data.dropna()

Finally, write time series files for each study site to the outputs directory

In [None]:
for site, df in deriv_lct_data.groupby('sitecode'):
    site_dir = OUTPUT_DIR / site
    site_dir.mkdir(exist_ok=True)
    df.droplevel(0).to_csv(site_dir / 'lct_pct_ts.csv')