# Reconstructing the land-cover of prehistoric landscapes

The purpose of this notebook is to :
1. Load pollen abundance time-series data extracted from the European Pollen Database for a selection of sites I am studying in the development of my PhD thesis.
2. Explore, consider the limitations of, and clean that data.
3. Support the systematic assignment of pollen types identified in the empirical data to the categorical land-cover types which will be represented in my simulation models. This is a form of modelling in itself, and serves as an abstraction couched in terms of the notion of a plant functional type. That is, plant _species_ which are postulated to be functionally identical as far as the model is concerned are assigned to the same plant functional group. This will be achieved using regular expressions to embelish the data in a pandas dataframe.
4. Apply the Landscape Reconstruction Algorithm (LRA) to the pollen abundance data to infer the _proportion_ of landscape occupied from each plant functional group.
5. Produce, for each of my empirical study sites, time-series of the proportion of landscape occupied for each of the functional groups represented in the model for the duration of time for which there is abundance data for each study site. This will be presented in the form of a `.csv` file and a plot for each study site. 

The only input required to run this notebook is a path to the file `site_pollen_abundance_ts.csv` which is output from [`epd-query`](https://github.com/lanecodes/epd-query).

In [None]:
from pathlib import Path
import os
import sys
import re
from typing import Dict, List

import unidecode

import matplotlib as mpl
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection

mpl.rcParams['font.family'] = 'CMU Sans Serif'
import matplotlib.pyplot as plt
%matplotlib inline
#import seaborn as sns
from ipywidgets import interact, fixed

import pandas as pd
import numpy as np

In [None]:
TMP_DIR = Path('../tmp')
OUTPUT_DIR = Path('../outputs')

# 1. Load pollen data from file

In [None]:
epd_data = pd.read_csv(TMP_DIR / 'site_pollen_abundance_ts.csv')

In [None]:
epd_data.head()

In [None]:
epd_data.groupby(['sitename', 'e_']).size()

## 2. Explore, condider the limitations of, and clean pollen core data

### Check numbers of samples in each core, narrow core selection

Load data for one of the three Navarres cores

In [None]:
nav_dat = epd_data[epd_data['e_'] == 469]

In [None]:
nav_dat.sort_values(by='count', ascending=False)

We see that much of the pollen recorded in the database for this sediment core corresponds to [pollen spike](https://quantpalaeo.wordpress.com/2017/07/28/pollen-spikes/), with `varcode` values of `conc.spk`. 

In [None]:
nav_dat['sample_'].unique()

More troublingly, navares core 469, NAVA1 has only 15 samples.

In [None]:
nav2 = epd_data[epd_data['e_'] == 470]
nav2['sample_'].unique()

In [None]:
nav3 =  epd_data[epd_data['e_'] == 471]
nav3['sample_'].unique()

Going forward, I'll prefer NAVA3 over NAVA1 and NAVA2 since it contains more samples. If I find something which makes NAVA3 seem unreliable, I may reconsider. For now, drop NAVA1 and NAVA2 from the `epd_data` dataframe.

In [None]:
epd_data = epd_data[~epd_data['e_'].isin([469, 470])]

### Look at top ten pollen contributing species for each study site, remove sediment spike

In [None]:
def print_top_species(epd_data):
    for ssite in epd_data['sitename'].unique():
        print('\n'+ssite)
        df = epd_data[epd_data['sitename']==ssite]
        df = df.groupby(['var_', 'varcode', 'varname']).agg({'count' : 'sum'})
        print(df.sort_values(by='count', ascending=False).head(5))
    del df

print_top_species(epd_data)

Navarres alone seems to have a lot of pollen spike in it. Also Monte Areo mire and Charco da Candieira have Lycopodium spike added. See [here](https://palynology.wordpress.com/2012/10/07/pollen-spike/) for background on pollen spike. To keep analyses between sites consistent, I will exclude these. 

In [None]:
def remove_varcodes(df: pd.DataFrame, varcodes: List[str]) -> pd.DataFrame:
    """Remove rows corresponding to specified varcodes from epd DF."""
    return df[~df['varcode'].isin(varcodes)]


def remove_varcodes_test_df():
    return pd.DataFrame({
        'varcode': ['goodvar1', 'badvar1', 'badvar2', 'goodvar2'],
        'count': np.random.randint(0, 4000, size=4)
    })


def test_remove_varcodes(test_df):
    res_df = remove_varcodes(test_df, ['badvar1', 'badvar2'])
    assert res_df.iloc[0]['varcode'] == 'goodvar1'
    assert res_df.iloc[1]['varcode'] == 'goodvar2'
    assert len(res_df.index) == 2
    
test_remove_varcodes(remove_varcodes_test_df())   

In [None]:
exclude_pollen_spike = True
if exclude_pollen_spike:
    epd_data = remove_varcodes(
        epd_data, ['Spi/tab', 'Lyc(ad)', 'Lyc(ct)', 'Lyc']
    )

Also note that San Rafael has a significant proportion of Botryococcus in its samples. This is a type of green algae. Since this doesn't correspond to any _land_ plant species, we exclude it

In [None]:
aquatic_plant_codes = [
    'Bry',
    'Zyg-T',
    'Spr-T',
    'Pot',      # Potamogeton, aquatic plant
    'Clo.i-T',  # Closterium idiosporum, green algae
    'Spi.cf.s', # Spirogyra cf. scrobiculata, green algae
    'Trl.s',    # Trilete spore(s),  not from modern terrestrial plant
]

exclude_non_land_plants = True
if exclude_non_land_plants:
    epd_data = remove_varcodes(epd_data, aquatic_plant_codes)

Identified lots of moss (Sphagnum) in, e.g. Atxuri. Exclude this

In [None]:
exclude_mosses = True
if exclude_mosses:
    epd_data = remove_varcodes(epd_data, ['Sph'])

Fungal spores such as Glomus turn up in Navarres. Exclude

In [None]:
fungal_species_codes = [
    'Glomus',
    'Pos',  # Polyadosporites, fungal spore http://www.redalyc.org/html/454/45437346003/index.html
]

exclude_fungi = True
if exclude_fungi:
    epd_data = remove_varcodes(epd_data, fungal_species_codes)

Remove records corresponding to pollen which could not be identified

In [None]:
unrecognised_species_codes = [
    'Ind.unkn',  # found in navarres
    'T16C',
]

In [None]:
exclude_unrecognised = True
if exclude_unrecognised:
    epd_data = remove_varcodes(epd_data, unrecognised_species_codes)

At this point, `epd_data` contains entries for all:
1. sediment cores
2. samples (depths/ ages)
3. species (careful to exclude pollen spike)

In [None]:
epd_data

#### Give each site an easily typed `sitecode` to refer to as an index

It will be convenient to be able to refer to sites as an index. To make these easy to type, create a `sitecode` column which strips out spaces and removes any unicode names

In [None]:
print(epd_data['sitename'].unique())

In [None]:
epd_data['sitecode'] = (
    epd_data['sitename']
    .apply(unidecode.unidecode)
    .str.replace(' ', '_')
    .str.lower()
)

print(epd_data.sitecode.unique())

In [None]:
epd_data.head()

#### Drop unnecessary columns

I plan to use `sitename` as an index going forward because it's natural to think in terms of study sites. This means I don't need other information in the dataframe I take forward in my analyses at the study site level of detail. So this information can easily be rerieved if needs be when debugging, I save this to disk and remove the extra columns

In [None]:
site_meta_fields = ['sitecode', 'sitename', 'site_', 'sigle', 'e_', 'chron_']
site_meta = epd_data.groupby(site_meta_fields).size().rename('num_records')
site_meta.to_csv(TMP_DIR / 'site_metadata.csv', encoding='utf8', header=True)
epd_data = epd_data.drop(
    [x for x in site_meta_fields if x != 'sitecode'], axis=1
)
epd_data

`sample_` (a database key from the EPD) is also redundant at this point, since we can idenify each sample from its `agebp`. Similarly each variable (pollen species) is uniquely identified by its `varcode` so we can also drop `var_`.

In [None]:
epd_data = epd_data.drop(['sample_', 'var_'], axis=1)

In [None]:
epd_data.head()

Check `agebp` and `count` can be converted to `int` without loss of data, and do the conversion

In [None]:
def convert_field_to_int(df: pd.DataFrame, field: str) -> pd.DataFrame:
    """Convert named float field to int if no data would be lost."""
    assert (~df[field].isna()).all(), f'missing data found in {field}'
    assert ((df[field] - df[field].astype(int)) == 0).all(), (
     f'casting {field} to int caused loss of data'
    )
    df[field] = df[field].astype(int)
    return df

In [None]:
epd_data = (
    epd_data
    .pipe(lambda df: convert_field_to_int(df, 'agebp'))
    .pipe(lambda df: convert_field_to_int(df, 'count'))
)

In [None]:
epd_data

Create a unique index

In [None]:
epd_data = epd_data.set_index(['sitecode', 'agebp', 'varcode']).sort_index()

Find index is unexpectedly not unique

In [None]:
epd_data.index.is_unique

Check which sites duplicates are coming from

In [None]:
epd_data[epd_data.index.duplicated()].groupby(level=['sitecode']).count()

In [None]:
pct_affected = 107 / len(epd_data.loc['charco_da_candieira'].index) * 100
print(f'{round(pct_affected, 2):.2f}% of charco_da_candieira entries are duplicates')

As less than 1% of Charco da Candieira sample/ species combinations are affected, we will simply assume that where multiple entries are associated for a species in a single sample, the correct count is obtained by summing any duplicates. No other site's data are affected by this issue.

In [None]:
initial_index_len = len(epd_data.index)

In [None]:
varcode_to_varname_df = (
    epd_data.reset_index(level='varcode')[['varcode', 'varname']]
    .drop_duplicates()
    .set_index('varcode')
)

In [None]:
epd_data = (
    epd_data['count']
    .groupby(level=['sitecode', 'agebp', 'varcode']).sum()
    .to_frame()
    .join(varcode_to_varname_df)
)

In [None]:
assert initial_index_len - len(epd_data.index) == 107
assert epd_data.index.is_unique

Rename `count` to avoid an understandable but irritating namespace collision with the `pd.Series.count` method.

In [None]:
epd_data = epd_data.rename(columns={'count': 'pcount'})

In [None]:
epd_data

In [None]:
epd_data.loc['navarres'].head()

In [None]:
epd_data.groupby(level=['sitecode', 'agebp']).sum()

`epd_data` is now prepped and ready to use for subsequent analyses. Serialise a csv file so it can be retrieved without rerunning the above cells.

In [None]:
epd_data.to_csv(TMP_DIR / 'clean_epd_data.csv')

## 3. TODO Relate identified pollen species with model-dependent plant functional types

In [None]:
try:
    epd_data
except NameError:
    epd_data = (
        pd.read_csv(TMP_DIR / 'clean_epd_data.csv')
        .set_index(['sitecode', 'agebp', 'varcode'])
    )   

Retrieve a list of unique `varname`-s found amongst the sediment cores analysed thus far in the notebook.

In [None]:
unique_species = (
    epd_data.reset_index()[['varname', 'varcode']].drop_duplicates()
    .set_index('varcode')
)
unique_species.head()

## Find the most common species for each study site

The objective is to ensure that approximately 90% of counted pollen is assigned to one of the following land cover type groups:

- Shrubland: includes grasses (Poaceae, formerly Gramineae, family), and juniper (genus Juniperus, belongs to cypress family Cupressaceae).
- Pine forest: anything belonging to the Pinus genus
- Deciduous forest: Beech family, Fagaceae and Chestnut (Castanea genus)
- Oak forest: anything belonging to the Quercus genus

Find percentage of each study site's total contributed by each species. These are the species whose mapping to land cover types are most important.

In [None]:
(
    epd_data.groupby(['sitecode', 'varcode'])['pcount'].sum().to_frame()
    .pipe(lambda df: df.join(df.groupby('sitecode')['pcount']
                             .sum().rename('site_total')))
    .assign(species_pct=lambda df: df['pcount'] / df['site_total'] * 100)
    .drop(columns='site_total')
    .groupby('sitecode')['species_pct'].nlargest(10)
    .reset_index(0, drop=True)
    .to_frame()
    .join(unique_species)
)

### Identify land-cover types with pollen species

The aim in this section is to construct a dictionary whose keys are land cover types included in my simulation models, and whose values are regular expressions which match the names of species contributing to those land cover types. This dictionary will then be used to say: if _this_ pattern is found in a species name, map it to _this_ land cover type.

The following land cover types are included in simulations, but not in the land cover type categories used in this notebook:

1. Water/Quarry
2. Burnt
3. Depleated agricultural land
3. Barley
4. Wheat
5. Transition forest

Land cover types 1-3 above don't produce any pollen. Barley and wheat produce grass pollen. This belongs to the Poaceae (formerly known as Gramineae) family, and is assumed to contribute to 'Shrubland'. There is no depleated agricultural land, barley or wheat land cover present at the beginning of a simulation, as these are anthropogenically induced land cover types.

I don't map pollen to the 'Transition forest' land cover type because this type is a mixture of pine and oak forest. When comparing simulation outputs to empirical pollen abundance, I will assume transition forest simulation cells contribute half a cell of pine forest pollen and half a cell of oak forest pollen. When generating Neutral Landscape models from pollen abundance, I will assume that no cells start off as transition forest, and allow transition forest cells to be introduced by a model 'burn in' period. An alternative proposal might be to change my modelling approach such that I effectively integrate out the transition forest state, so we create the possibility of transitioning directly between pine and oak, subject to the kind of environmental conditions which would support transition forest.

#### Map land land cover types to species

In [None]:
def compose_regexs(regexs: List[str]) -> str:
    """Join list of regex patterns.
    
    Resulting pattern will match any one of the input patterns supplied in the
    list.
    """
    return  '|'.join(regexs)


def test_compose_regexs():
    test_patterns = [r'^foo.*', r'.*bar.*']
    test_str1 = 'foo blah blah'  # match
    test_str2 = 'blah bar foo'  # match
    test_str3 = 'blah foo blah'  # no match
    
    regex = compose_regexs(test_patterns)
    assert re.search(regex, test_str1)
    assert re.search(regex, test_str2)
    assert re.search(regex, test_str3) is None
    
test_compose_regexs()

In [None]:
POLLEN_LCT_MAPS = {
    'shrubland': [
        # grasses
        r'.*Poaceae|.*Gramineae|.*Cerealia.',
        # juniper
        r'.*Juniperus',
        # and cypress family
        r'.*Cupressaceae',
        # quillwort (prolific in Sanabria Marsh)
        r'.*Isoetes',
        # Goosefoot family (prolific in e.g. San Rafael)
        r'.*Chenopodiaceae',
        # Mugwort genus (prolific in e.g. San Rafael)
        r'.*Artemisia',
        # flowering plants in the same family as lettuce, dendelions etc
        r'.*Cichorioideae',
        # family of shrubby plants
        r'.*Asteroideae',
        # sedge family (superficially resemble grasses), see e.g. Atxuri
        r'.*Cyperaceae',
        # heather
        r'.*Calluna vulgaris',
        # heather family
        r'.*Erica(ceae|-type|\s)',
        #r'(.*Erica(ceae|-type|\s)|.*Erica-type.*|.*Erica arborea-type.*)',
        #  celery, carrot, parsley family
        r'.*Umbelliferae',
        # celery and marthwort genus
        r'.*Apium',
        # box plant (shrubby tree)
        r'.*Buxus',
        # genus of flowering plants, buttercup genus
        r'.*Ranunculus',
        # doc/ sorrel genus
        r'.*Rumex',
         # bracken/ ferns. Associated with pine forest??
        r'.*Pteridium|.*Polypodium|.*Filicales',
        # genus of gymnosperm shrubs
        r'.*Ephedra',
        # flowering plants found in wet regions
        r'.*Sparganium|.*Typha angustifolia',
        # plantain/ fleawort genus
        r'.*Plantago',
        # olive genus
        r'.*Olea',
    ],
    'pine_forest': [
        # PINE FOREST
        r'\s?Pinus\s?',
    ],
    'deciduous_forest': [
        # Chestnut
        r'.*Castanea',
        # Birch
        r'.*Betula',
        # Beech family
        r'.*Fagaceae',
        # Beech genus
        r'.*Fagus',
        # Alder genus
        r'.*Alnus',
        # Hazel
        r'.*Corylus',
        # Willow
        r'.*Salix',
        # Hornbeam
        r'.*Carpinus',
    ],
    'oak_forest': [
        # OAK FOREST
        r'\s?Quercus\s?',
    ], 
}

Define a function which, given a species name, returns a list of land cover types.

In [None]:
def get_lct(species_name: str, pol_lct_dict: Dict[str, str],
            verbose=False) -> str:
    """Given a species name, map it to a land cover type.
    
    Throw a ValueError if species name matches more than one land cover type.
    """
    lcts = []
    for lct_name, regex in pol_lct_dict.items():
        if re.match(regex, species_name, re.IGNORECASE):
            lcts.append(lct_name)
            if verbose:
                print(regex + ' matches ' + species_name)
    
    if len(lcts) > 1:
        raise ValueError('Species name {0} matched multiple land cover type '
                         'regex strings: {1}'.format(species_name, lcts))
    if len(lcts) == 0:
        return None

    return lcts[0]


def test_get_lct():
    regex_to_lct_map = {k: compose_regexs(v) 
                        for k, v in POLLEN_LCT_MAPS.items()}
    
    def species_name_test(species_name, expected_lct):
        determined_lct = get_lct(species_name, regex_to_lct_map)
        assert  determined_lct == expected_lct, (
            f"expected '{expected_lct}' to be lct for species "
            f"'{species_name}' but got '{determined_lct}' instead."
        )
        
    species_name_test('Quercus ilex-type', 'oak_forest')
    species_name_test('Quercus', 'oak_forest')
    species_name_test('Pinus pinaster-type', 'pine_forest')
    species_name_test('Pinus', 'pine_forest')
    species_name_test('Rumex crispus-type', 'shrubland')
    species_name_test('Compositae subf. Cichorioideae', 'shrubland')
    species_name_test('Erica arborea-type', 'shrubland')
    species_name_test('Ericaceae', 'shrubland')
    species_name_test('Polypodium vulgare-type', 'shrubland')
    
test_get_lct()

Apply `get_lct` to each species included in the chronology

In [None]:
regex_to_lct_map = {k: compose_regexs(v) 
                    for k, v in POLLEN_LCT_MAPS.items()}
unique_species['lct'] = unique_species.varname.apply(
    lambda x: get_lct(x, regex_to_lct_map)
)

In [None]:
mapped_species = unique_species[unique_species['lct'].notnull()]
mapped_species.to_csv(TMP_DIR / 'species_to_landcover_mapping.csv', index=False)

For each study site, find the percentage of pollen contributed by each species to each sample

In [None]:
epd_data = (
    epd_data
    .join(epd_data
          .groupby(level=['sitecode', 'agebp'])['pcount'].sum()
          .rename('sample_tot'))
    .assign(species_pct=lambda df: df['pcount'] / df['sample_tot'] * 100)
    .drop(columns='sample_tot')
)

assert (epd_data.groupby(level=['sitecode', 'agebp'])['species_pct'].sum()
        - 100 < 0.00001).all(), (
    'site/ sample percentage totals should equal 100'
)

In [None]:
epd_data

Add `lct` to index via `unique_species`

In [None]:
epd_data = (
    epd_data
    .join(unique_species.drop(columns='varname'))
    .assign(lct=lambda df: df['lct'].fillna('not_specified'))
    .set_index('lct', append=True)
    .swaplevel(3, 2)
    .sort_index()
)

In [None]:
epd_data

#### Evaluate proportion of pollen, for each study site, accounted for by land-cover type mapping

Aggregate `epd_data` from species level to land cover type level 

In [None]:
total_lct_pct_df = (
    epd_data
    .groupby(level=['sitecode', 'lct'])['pcount'].sum()
    .rename('lct_total_count').to_frame()
    .pipe(lambda df: df.join(df.groupby(level='sitecode')['lct_total_count']
                             .sum().rename('site_total')))
    .assign(site_lct_pct=lambda df: (df['lct_total_count'] 
                                     / df['site_total'] * 100))
    .loc[:, 'site_lct_pct']
    .unstack()
    .loc[:, ['shrubland', 'pine_forest', 'oak_forest', 'deciduous_forest', 'not_specified']]
)

assert (total_lct_pct_df.sum(1) - 100 < 0.00001).all(), (
    'per-sample total lct contributions should total 100%'
)

In [None]:
total_lct_pct_df

In [None]:
f, ax = plt.subplots()
total_lct_pct_df.plot(kind='bar', stacked=True, ax=ax)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

The proportions of pollen not falling into one of the groups represented in the model above is deeped acceptable, i.e. at least 90% of pollen for simulated study sites is attributed to a modelled land cover type.

Summarise `epd_data` to land cover type level

In [None]:
lct_data = (
    epd_data
    .groupby(level=['sitecode', 'agebp', 'lct'])['species_pct'].sum()
    .unstack().replace(np.nan, 0)
)

In [None]:
lct_data

### Plot pollen data

In [None]:
try:
    pol_df
except NameError:
    pol_df = pd.read_pickle('pol_df.pickle')
    print 'pol_df read from file.'

#### For print

In [None]:
def plot_print_chronology(sitename, earliest, latest, figlabel=None, save=False):
    df = pol_df.loc[sitename, :]['group_pct'] #extract pollen percents for specified site
    df = df[(df.index <= earliest) & (df.index >= latest)] # exclude samples from earlier that specified years before present
    
    def tweak_pct_ticks(axis, pct_vals):
        max_pct = int(round(pct_vals.max()*1.1))
        
        def get_increments(maximum):
            while maximum%4 <> 0:
                maximum += 1
            return [maximum/4 * i for i in range(5)]
        
        increments = get_increments(max_pct)
        axis.set_xlim(0, increments.pop())
        axis.xaxis.set_ticks(increments)
        
    def make_under_line_polygon(xx, yy, e, l):
        line_vertices = np.column_stack((xx, yy))
        leftmost_corners = np.array([[0, e], [0,l]])
        vertices = np.concatenate((line_vertices, leftmost_corners))
        return Polygon(vertices, True)       
    
    pollen_line_colour = '#145D85'
    
    f, axes = plt.subplots(1, len(df.columns), sharey=True)
    for i, group in enumerate(df.columns):
        xx = df[group].values
        yy = df.index.values
        axes[i].plot(xx,yy, color=pollen_line_colour)
        axes[i].set_title(group.title())
        axes[i].set_ylim([latest, earliest])
        tweak_pct_ticks(axes[i], xx)
        
        poly = make_under_line_polygon(xx, yy, earliest, latest)
        p = PatchCollection([poly], alpha=0.4)
        p.set_color(pollen_line_colour)
        axes[i].add_collection(p)
        
        if i == 0:
            axes[i].set_ylabel('yrs BP', fontsize=13)
            if figlabel:
                xticks = axes[i].get_xticks()
                yticks = axes[i].get_yticks()
                xtick_scale = xticks[1]-xticks[0]
                ytick_scale = yticks[1]-yticks[0]

                axes[i].text(-1.15*xtick_scale, latest-0.5*ytick_scale, 
                             figlabel,
                             fontdict = {'weight': 'bold',
                                         'size': 16}
                            )
    
    plt.gca().invert_yaxis()
    plt.subplots_adjust(hspace=0, wspace=0)
    f.text(0.51, 0.02, '% contribution to total pollen sample', ha='center', fontsize=13)
    #plt.suptitle(sitename, y=1.05, fontsize=12)
    
    if save:
        d = os.path.join('plots')
        if not os.path.exists(d):
            os.makedirs(d)

        plt.savefig(os.path.join('plots',
                                 (sitename.replace(' ', '_')+'_'
                                 +str(earliest)+'-'+str(latest)+'.pdf')))

In [None]:
for s in pol_df.index.get_level_values(0).unique():
    print s
    plot_print_chronology(s, 15000, 0)

Of these, to my eye, San Rafael looks the most interesting (like there's a lot going on). 

On the other hand, what's going on in Navarres at 6000 years ago with sprouters?

#### Interactive

In [None]:
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import gridplot, widgetbox, column# container for bokeh figure objects
from bokeh.models.widgets import Dropdown
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
output_notebook()

In [None]:
def plot_interactive_chronology(sitename):
    df = pol_df.loc[sitename, :]['group_pct'] #extract pollen percents for specified site
    
    # create a column data source for the plots to share
    source = ColumnDataSource(data=df.reset_index().to_dict('list'))
    
    # container for bokeh figure objects
    plots = [] 
    time_range=None
    
    TOOLS = "ypan,ywheel_zoom"
    
    def get_width(base, factor, plot_num):
        # function to increase width of first plot, since this ends up narrowed
        # due to being the only one with yaxis labels.
        if plot_num > 0:
            return base
        else:
            return int(round(base*(1+factor)))
    
    for i, group in enumerate(df.columns):
        p = figure(tools=TOOLS, plot_width=get_width(150, .25, i), 
                   plot_height=500, y_range=time_range,
                   title=group.title())
        p.line(group, 'agebp', source=source)
        if i == 0:
            p.y_range.flipped = True
            time_range = p.y_range
        else:
            p.yaxis.major_label_text_font_size = '0pt'
                    
        plots.append(p)
   
    p = gridplot([plots])
    t = show(p, notebook_handle=True)
                
    return t

In [None]:
for s in pol_df.index.get_level_values(0).unique():
    print s

In [None]:
pol_df.head()

In [None]:
loc[u'Charco da Candieira', :]['count_group_tot']

Earliest date for Charco da Candieira:

In [None]:
def print_daterange(sitename):
    df = pol_df.loc[sitename, :]['group_count']
    latest = df.index.min()
    earliest = df.index.max()
    print 'earliest date: {0} yr BP'.format(earliest)
    print 'latest date: {0} yr BP'.format(latest)   

In [None]:
for s in pol_df.index.get_level_values(0).unique():
    print s
    print_daterange(s)
    print '\n'

In [None]:
plot_interactive_chronology(u'Algendar')

In [None]:
pol_df.head()

#### Points of particular interest in time series (discussed in upgrade report)

##### San Rafael 4000 - 8000 yrs BP
Big variation in grasses shrubs and sprouters around the time it is thought agriculture reached Iberia (6500 yrs BP).

In [None]:
plot_print_chronology(u'San Rafael', 8500, 1000, figlabel='A', save=True)

##### Navarres 6000 - 7000 yrs BP
~ 200 year oscillation in percentages of grass and seeders 6400 - 6800 yrs BP, followed by sudden and sustained increase in sprouters after 6400 yrs BP

In [None]:
plot_print_chronology(u'Navarrés', 10500, 3000, figlabel='B', save=True)

## 4. Apply the LRA to infer land-cover proportion from pollen abundance

## 5. TODO Output plant functional group time-series for each study site

### TEMP Time-series of proportion of total pollen abundance for each plant functional group
- NOTE at present (May 18) I've not implemented the LRA yet so will output pollen _abundance_ between species, rather than using the LRA's method of correcting for the variance in pollen produced by different species.
- This is to get a preliminary model off the ground and should be corrected for as a priority.

Load pollen chronologies for study sites, and mappings to land cover classes

In [None]:
if 'epd_data' not in locals():
    import matplotlib 
    import matplotlib.pyplot as plt
    %matplotlib inline
    import pandas as pd
    import numpy as np
    # made in section 2 above
    epd_data = pd.read_pickle('epd_data.pkl')
    
if 'mapped_species' not in locals():
    # made in section 3 above
    mapped_species = pd.read_csv('species_to_landcover_mapping.csv')

In [None]:
print epd_data.head()
print epd_data[epd_data.pcount>2000]

note the very high counts for san_rafael. Are these realistic?

In [None]:
print epd_data.shape

In [None]:
epd_data = epd_data.reset_index().merge(mapped_species.drop('varname', axis=1), on='varcode', how='left')
epd_data = epd_data.dropna()
print epd_data.shape
print epd_data.head()

In [None]:
pollen_abundance = epd_data.groupby(['sitecode', 'e_', 'agebp', 'lct']).sum().unstack(3)
pollen_abundance = pollen_abundance.fillna(0)
pollen_abundance.loc[:,('pcount', 'total')] = pollen_abundance.sum(axis=1)

In [None]:
pollen_abundance.head()

Convert abundance to proportion

In [None]:
for c in pollen_abundance.pcount.columns:
    if c <> 'total':
        pollen_abundance.loc[:,('pprop', c)] = pollen_abundance.loc[:, ('pcount', c)]/pollen_abundance.loc[:, ('pcount', 'total')]

In [None]:
pollen_abundance.head()

Have a quick look at the data to check it seems reasonable

In [None]:
fig, ax = plt.subplots()
pollen_abundance.loc[('navarres',  471), 'pprop'].plot(ax=ax)
plt.legend(loc='upper right')

write processed pollen proportion data to disk

In [None]:
pollen_abundance.to_pickle('pollen_timeseries.pkl')

the above are pollen _proportion_ time series. These can be used as proportions feeding into an NLM. See the `/home/andrew/Dropbox/codes/python/notebooks/modified_random_clusters/implement_modified_random_clusters.html` for details

#### TO move across to MRC notebook
notebook for details of Supposing I start simulating Navarres from 7000 yrs BP, that gives me the following starting proportions:

In [None]:
# Helper function to find the nearest value to a given value in a numpy array
def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return array[idx]

In [None]:
nav_dat = pollen_abundance.loc[('navarres',  471), 'pprop']
nav_initial = nav_dat.loc[find_nearest(nav_dat.index.values, 7000)]
print nav_initial
#print find_nearest(nav_dat.index.values, 7000)

## Jan 2019 -- Add time derivatives to timeseries dataframe

In [None]:
pollen_abundance = pd.read_pickle('pollen_timeseries.pkl')

In [None]:
pollen_abundance.head()

In [None]:
pollen_abundance.shape

In [None]:
pollen_abundance['pprop'].xs('albufera_alcudia', level='sitecode').head()

The next step is to work out how to calculate, for each core, the pollen proportion slope with respect to the agebp index. This can be gathered as a new dataframe with the same MultiIndex as `pollen_abundance['pprop']`. This can then be joined back into `pollen_abundance` as `pollen_abundance['pprop_prime']`. The gradient of this will give `pollen_abundance['pprop_prime_prime']`

In [None]:
help(pollen_abundance.index)

## Appendices

### Rough working

#### Correlations between variables

Let's look at how the counts of different groups correlate with each other within each study site through time.

In [None]:
top_epd_data.head()

In [None]:
df['group_pct']['grass'].loc['Sanabria Marsh'].head()

In [None]:
import seaborn as sns

In [None]:
sns.pairplot(df.loc[u'Navarrés''Sanabria Marsh', :]['count_group_tot'])

Are there clusters we can find in the counts of different pollen? Could investigate using KNN.

In [None]:
pollen_conts.head()

In [None]:
pollen_conts['pollen_pct']['mean'].sort_values(ascending=False)

In [None]:
nav_pine = nav_dat[nav_dat.varname=='Pinus']

In [None]:
nav_pine['count'].hist()

In [None]:
nav_pine['pollen_pct'].hist()

In [None]:
nav_dat[nav_dat.varname=='Concentration spikes'].pollen_pct.plot()

### TODO General theory to look up

Looking at understanding pollen spikes
https://quantpalaeo.wordpress.com/2017/07/28/pollen-spikes/

Calculating deposition rates
http://www.europeanpollendatabase.net/wiki/lib/exe/fetch.php?media=epd_age-depth.pdf

Using litholgy (depth) and and c14 (time) or (equivalently??) `depthcm` and `age` columns from `agebasis` table could be used to calculate sediment deposition rates.

In [None]:
epd.ssites.append(762)
epd.ssites.append(1260)
epd.ssites.append(76)
epd.ssites.append(560)

In [None]:
epd.ssites