# Find top deciduous species

While developing the AgroSuccess simulation model I found the need to quantify the number of years different species' seeds remain viable in the [soil seed bank](https://en.wikipedia.org/wiki/Soil_seed_bank). 

Use the Soil seed bank longevity (`SoilSeedBank`) field from the BROT database

G. Pausas, Juli; Tavşanoğlu, Çağatay (2018): BROT 2.0: A functional trait database for Mediterranean Basin plants. figshare. Collection. https://doi.org/10.6084/m9.figshare.c.3843841.v1

In [None]:
import functools
import re
from typing import Dict

import pytest

import pandas as pd

from taxa import POLLEN_LCT_MAPS, compose_regexs

In [None]:
BROT_URLS = {
    k: 'https://ndownloader.figshare.com/files/' + v
    for k, v in {'data': '11194784', 'synonymous': '11194793',
                 'sources': '11194787', 'taxa': '11194790'}.items()
}

In [None]:
read_brot_data_csv = functools.partial(pd.read_csv, encoding='latin1')
read_brot_sources_csv = functools.partial(pd.read_csv, encoding='utf-8')

In [None]:
seed_bank_df = (
    read_brot_data_csv(BROT_URLS['data'])
    .pipe(lambda df: df[df['Trait'] == 'SoilSeedBank'])
    .drop(columns='Trait')
)
seed_bank_df.head()

In [None]:
taxa_regex_dict = {
    k: compose_regexs(x.regex for x in v)
    for k, v in POLLEN_LCT_MAPS.items()
    if k in ['deciduous_forest', 'pine_forest', 'oak_forest']
}

In [None]:
class MultipleMatchError(Exception):
    pass

In [None]:
def string_to_group(string: str, regex_dict: Dict[str, str]) -> str:
    """Match input string to a group.
    
    Throws a `MultipleMatchError` if `string` matches more than one group.
    
    Parameters
    ----------
    string: Pattern to assign to a group
    regex_dict: k: v pairs where k is the name of a group and v is a regex
        pattern which matches any of the strings which belong to the group
    """
    matches = [group for group, regex in regex_dict.items()
               if re.match(regex, string, re.IGNORECASE)]
    if len(matches) > 1:
        raise MultipleMatchError(f"'{string}' matched multiple patterns")
    if len(matches) == 1:
        return matches[0]
    return None

In [None]:
def test_string_to_group():
    assert string_to_group('foo',
                           {'group1': '.*foo', 'group2': '.*bar'}) == 'group1'
    assert string_to_group('bar',
                           {'group1': '.*foo', 'group2': '.*bar'}) == 'group2'
    with pytest.raises(MultipleMatchError) as e:
        string_to_group('foo bar',
                        {'group1': '.*foo', 'group2': '.*bar'}) == 'group2'

test_string_to_group()

In [None]:
taxa_to_lct = functools.partial(string_to_group, regex_dict=taxa_regex_dict)

In [None]:
seed_bank_df = (
    seed_bank_df.assign(lct=lambda df: df['Taxon'].apply(taxa_to_lct))
    .pipe(lambda df: df[~df['lct'].isna()])
)
seed_bank_df

All the deciduous species in BROT (Alder, Hazel and Beech) are all 'transient', meaning:

```
no soil seed bank; seeds germinate in the first favorable season after dispersal. Normally seed bank longevity o =1 yr (no persistent seed bank).
```

## Identify papers to read

In [None]:
seed_bank_df

Number of species discussed in papera by land cover type and source.

In [None]:
(
    seed_bank_df
    .groupby(by=['SourceID', 'lct'])['Data']
    .count().unstack()
    .fillna(0).astype(int)
)

Initially I planned to use `CCatalonia2008` as a reference, but it turns our that key just refers to an email address (`Espelta, J. M., Rodrigo, A. (anselm.rodrigo@uab.es)`). I have not had good experiences trying to get information out of UAB in th past, specifically I was not able to obtain Espelta's PhD thesis either from him or from the UAB library, despite the fact that it was the canonical source of a dataset discussed in `Zavala2000`. I'll just choose a different source to provide evidence for oak species' seed longevity.

In [None]:
papers_to_read = {
    'pine_forest': [
        ('Reyes2002b',
         'Discusses Pinus Pinaster, suggests persistent seeds',
         None),
        ('Vega2008',
         'Discusses Pinus Pinaster, suggests transient seeds',
         None),
    ],
    'oak_forest': [
        ('CCatalonia2008',
         'Discusses 3 oak species including ilex, as well as deciduous',
         'No paper listed, only email address'),
        ('Trabaud1997',
         'Claims to discuss three oak species including Q. Ilex, uses measurement',
          None),
    ],
    'deciduous_forest': [
        ('Olano2002',
         ('Discusses two different deciduous species, data determined using '
          'high accuracy measurement'),
         None),
    ],
}

In [None]:
paper_df = (
    pd.DataFrame(
        [(k, paper[0], paper[1])
         for k, paper_list in papers_to_read.items()
         for paper in paper_list
         # Only include papers with no reason not to include
         if paper[2] is None], 
        columns=['lct', 'SourceID', 'Reason']).set_index('SourceID')
    .join(
        read_brot_sources_csv(BROT_URLS['sources']).set_index('ID'),
        how='left'
    )
)
paper_df

In [None]:
for i, row in paper_df.iterrows():
    print(f"{i}: {row['FullSource']}\n")