# Extract succession pathway data published by Millington et al. 2009

In their paper, [Millington, Wainwright and Perry (2009)](https://doi.org/10.1016/j.envsoft.2009.03.013) describe a landscape fire-succession model which represents the variaous ways in which landscape vegetation can evolve under different environmental conditions. An important component of this model is a representation of the different pathways along which a particular patch of the landscape might evolve, contingent on other environmental variables appearing endogenously within the model. For example, during the course of secondary succession following a wildfire, a patch of shrubland might (all else being equal) transform into a deciduous forest under hydric (wet) conditions, or a pine forest under xeric (dry) conditions. Alternatively succession pathways might be disturbance-mediated: if fires are infrequent, incumbent resprouting oak trees may regenerate into an oak forest once more, whereas frequent fire may favour pine species whose seeds lie dormant in the soil awaiting stand-clearing fires to reduce light competition.

These succession pathways (among many other possibilities) are represented in Fig. 2 of the referenced [paper](https://doi.org/10.1016/j.envsoft.2009.03.013). They are also provided as a table in the paper's supplementary materials in a file called `1-s2.0-S1364815209000863-mmc1.doc`. In this short notebook we perform some rudimentary data cleansing to extract this data and record it in a more easily machine-readable `.csv` format.

In [1]:
import pandas as pd

Using Libreoffice (in headless mode), convert the supplementary materials document into a `.html` document which can be easily consumed by standard data analysis tools.

In [None]:
millington_table_src = os.path.join"1-s2.0-S1364815209000863-mmc1.doc"

In [22]:
%%bash
soffice --convert-to "html:XHTML Writer File:UTF8" 1-s2.0-S1364815209000863-mmc1.doc

1-s2.0-S1364815209000863-mmc1.doc
calculating-disturbance-frequencies.pdf
development-notes.org~
extract_Millington_succession_pathway.bup.html
extract_Millington_succession_pathway.html
extract_Millington_succession_pathway.ipynb
landcover-properties.csv
make_cypher.py
make_cypher.py~
make_cypher.pyc
Millington_succession.csv
Millington-Thesis-Transition-Table.pdf
model-specification.odp
test
#test_make_cypher.py#
test_make_cypher.py
test_make_cypher.py~
test_make_cypher.pyc
transition_table_to_cypher.py
transition_table_to_cypher.py~


Read in output `.html` file as a string

In [25]:
with open('1-s2.0-S1364815209000863-mmc1.html', 'r') as f:
    html_string = f.read()

print html_string[:500]
print '\n...\n'
print html_string[-500:]

IOError: [Errno 2] No such file or directory: '1-s2.0-S1364815209000863-mmc1.html'

While not immediately obvious from the above sample, the html stored in `html_string` contains a `<table>` element holding our data. The `pandas` module contains a `read_html` method which will extract this data as a list of dataframes.

In [7]:
df_list = pd.read_html(html_string)

In [8]:
print df_list[0].head()
print '\n...\n'
print df_list[0].tail()

           0           1       2     3    4          5      6   7   8
0  Start (S)  Succession  Aspect  Pine  Oak  Deciduous  Water  D  T
1          1           1       0     1    0          0      0   1   0
2          1           1       0     1    0          0      1   1   0
3          1           1       0     1    0          0      2   1   0
4          1           1       1     1    0          0      0   1   0

...

      0  1  2  3  4  5  6  7  8
747  11  0  1  0  1  1  2  5  2
748  11  0  1  1  0  0  2  5  2
749  11  0  1  1  0  1  2  5  2
750  11  0  1  1  1  0  2  5  2
751  11  0  1  1  1  1  2  5  2


This table encodes all the information which is needed by a simulation model to determine the succession trajectory of a patch of landscape under a particular set of conditions (see [Millington, Wainwright and Perry (2009)](https://doi.org/10.1016/j.envsoft.2009.03.013) for details). Briefly:

- **Start**: The code of the land cover class (given in the paper) which a patch of vegetation is in at present. Includes e.g. Pine, Oak, Shrubland, or burnt.
- **Succession**: Whether the cell is undergoing secondary succession (1) or regeneration succession (0). This is determined by whether or not there were mature resprouters present in  the cell prior to disturbance. If mature resprouters were present the cell undergoes regeneration succession, otherwise they undergo secondary succession.
- **Aspect**: The (relative) abundance of light available to a cell. This relevant because of a known tendency for pine saplings to struggle in low light conditions.
- **Pine, Oak, and Deciduous**: columns specify whether Pine, Oak and/or Deciduous seeds are present in a cell.
- **Water**: The (relative) abundance of water available in the cell.
- $\Delta D$: The *direction of transition*; i.e. the land-cover class which the cell is on track to transition into.
- $\Delta T$: The *time required to complete transition*; i.e. given a particular transition trajectory, the length of time the cell will sit in its current state before transitioning.

As a final data cleansing step, we rename the columns so data can be stored in ASCII format, and write to a csv file named `Millington_succession.csv`

In [10]:
df.head()

Unnamed: 0,start,succession,aspect,pine,oak,deciduous,water,delta_D,delta_T
0,Start (S),Succession,Aspect,Pine,Oak,Deciduous,Water,D,T
1,1,1,0,1,0,0,0,1,0
2,1,1,0,1,0,0,1,1,0
3,1,1,0,1,0,0,2,1,0
4,1,1,1,1,0,0,0,1,0


In [13]:
df = df_list[0].copy()
df.columns = ['start', 'succession', 'aspect', 'pine', 'oak', 'deciduous', 'water', 'delta_D', 'delta_T']
df.drop(0)
df.to_csv('Millington_succession.csv', index=False, encoding='ascii')
print df.head()

  start succession aspect pine oak deciduous water delta_D delta_T
1     1          1      0    1   0         0     0       1       0
2     1          1      0    1   0         0     1       1       0
3     1          1      0    1   0         0     2       1       0
4     1          1      1    1   0         0     0       1       0
5     1          1      1    1   0         0     1       1       0


Remove intermediate `.html` file

In [14]:
%%bash
rm '1-s2.0-S1364815209000863-mmc1.html'

**Query**: what do land cover classes 7, 8, 10 and 11 given in supplementary materials correspond to? They all transition *to* shrubland (land cover class 5) an so are similar to the 'burnt' class but differ in succession pathway and duration of time spent in class before transition to shrubland.

## Interpreting land cover classes

The land cover classes whose transitions are specified in the table are described in James Millington's PhD thesis. They are described as follows:

In [15]:
state_vals = pd.DataFrame({
    'state_num': range(1,12),
    'Land-cover': ['Pine', 'Transition Forest', 'Deciduous', 
                   'Holm Oak', 'Pasture', 'Holm Oak with Pasture',
                   'Cropland', 'Shrubland', 'Water/Quarry', 'Urban',
                   'Burnt']
    })

In [16]:
print state_vals

               Land-cover  state_num
0                    Pine          1
1       Transition Forest          2
2               Deciduous          3
3                Holm Oak          4
4                 Pasture          5
5   Holm Oak with Pasture          6
6                Cropland          7
7               Shrubland          8
8            Water/Quarry          9
9                   Urban         10
10                  Burnt         11


Mapping these to the land cover types which are relevant for my PhD we find:

In [17]:
map_df = pd.DataFrame({'code': ['WaterQuarry', 'Burnt', 'Barley', 'Wheat', 'DAL', 'Pine', 'TransForest', 
                       'Deciduous', 'Oak', 'Shrubland'],    
                       'state_num':[9, 11, 7, 7, 7, 1, 2, 3, 4, 5]            
                      })
print map_df

          code  state_num
0  WaterQuarry          9
1        Burnt         11
2       Barley          7
3        Wheat          7
4          DAL          7
5         Pine          1
6  TransForest          2
7    Deciduous          3
8          Oak          4
9    Shrubland          5


Note that Barley, Wheat and DAL (Depleated Agricultural Land) all correspond to the same state in the Millington2009 model: cropland. It may be decided that this part of the model needs to be refined for my purposes, but for the time being, I will assume all three of these land cover types will behave in the same way for succession purposes

In [18]:
print len(df.index)

751


In [19]:
tmp_df = pd.merge(df, map_df, 'left', left_on='start', right_on='state_num')
tmp_df = tmp_df.rename(columns = {'code':'start_code'})
tmp_df = pd.merge(tmp_df, map_df, 'left', left_on='delta_D', right_on='state_num')
tmp_df = tmp_df.rename(columns = {'code':'end_code'})
print 'No. records before dropping rows not corresponding to PhD states: '+str(len(tmp_df.index))
tmp_df = tmp_df[(tmp_df.start_code.notnull() & tmp_df.end_code.notnull())]
print 'No. records after dropping rows not corresponding to PhD states: '+str(len(tmp_df.index))
df = tmp_df[['start_code', 'end_code', 'succession', 'aspect', 'pine', 'oak', 'deciduous', 'water', 'delta_T']]

No. records before dropping rows not corresponding to PhD states: 751
No. records after dropping rows not corresponding to PhD states: 0


In [20]:
print df[['start_code', 'end_code']].drop_duplicates()

Empty DataFrame
Columns: [start_code, end_code]
Index: []


NOTE: there's a potential issue here because no state appears to transition TO Deciduous. Presumably I've dropped something important here. Something to check up on.

In [17]:
print df.aspect.unique()

[0 1]


## Build cypher query files
The next step is to convert the data in the above table into cypher files which can be loaded into the database. Each of these files will have the same basic structure:
1. Header (file description comments, priority)
2. Query creating succession trajectory
3. Sequence of several queries establishing all combinations of environmental conditions which CAUSE that succession trajectory to take place. 

I wil now develop functions which will create each of these components.

In [18]:
import datetime
str(datetime.date.today())

'2018-05-01'

In [19]:
from make_cypher import get_env_cond_query

In [21]:
get_env_cond_query(df.iloc[10])

'\n    MERGE \n      (ec:EnvironCondition {succession:"secondary", \n                             aspect:"north", \n                             pine:{2},\n                             oak:{3},\n                             deciduous:{4},\n                             water:"hydric",\n                             delta_t:"0"})\n    MATCH \n      (:LandCoverType {code:"Pine", model_ID:$model_ID})\n      <-[:SOURCE]-(traj:SuccessionTrajectory {model_ID:$model_ID})-[:TARGET]->\n      (:LandCoverType {code:"Pine", model_ID:$model_ID}) \n    MERGE \n      (ec)-[:CAUSES]->(traj);\n    '

In [23]:
pd.to_pickle(df, 'traj.pkl')