Skip to content

Commit

Permalink
Final check for internal documentation.
Browse files Browse the repository at this point in the history
Small changes and additions to both the GitHub main readme (README.md) & documentation (index.rst).
  • Loading branch information
winandh committed Oct 2, 2019
1 parent 29070f4 commit 89f35fe
Show file tree
Hide file tree
Showing 16 changed files with 132 additions and 176 deletions.
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,21 @@

An object-oriented python package for species distribution modelling using deep learning.
The package allows for a more intuitive and easy exploration of biodiversity patterns by
modelling preferences for a great number of environmental variables.
modelling preferences for a great number of environmental variables.

Instructions for installing and using the sdmdl package can be found [here](docs/index.rst).

### Case study

The functionality of this package and the estimates of environmental preferences it
obtains is demonstrated by way of several use cases:
obtains is demonstrated by way of a use case on domesticated crops and their wild progenitors.

* Habitat suitability predictions with which to scale visual object identifications
* Domesticated crops and their wild progenitors
* Mycorrhizal associations
* Secondary woodiness
The raw uninterpreted results of this case study can be found [here](https://zenodo.org/record/3460718#.XYuBJEYzaCo).

The project builds on the previous results obtained by:
### Acknowledgments

- [MAXENT modelling](https://github.com/naturalis/trait-geo-diverse-ungulates) by Elke Hendrix.
- [Deep learning](https://github.com/naturalis/trait-geo-diverse-dl) by Mark Rademaker.
- [Comparative analysis of abiotic niches in Ungulates](https://github.com/naturalis/trait-geo-diverse-ungulates) by E. Hendrix.
- [Ecological Niche Modelling Using Deep Learning](https://github.com/naturalis/trait-geo-diverse-dl) by M. Rademaker.

### Package layout

Expand All @@ -30,7 +31,7 @@ The project builds on the previous results obtained by:
- [README.md](README.md) - the README file, which you are now reading
- [requirements.txt](requirements.txt) - prerequisites to install this package, used by pip
- [setup.py](setup.py) - installer script
- [data](data)/ - contains (some of) the data for the use cases - **marked for deletion**
- [docs](docs)/ - contains project documentation files
- [data](data)/ - contains some files that are (currently) required for data preprocessing - **marked for deletion**
- [docs](docs)/ - contains documentation on package installation and usage
- [sdmdl](sdmdl)/ - the library code itself
- [tests](tests)/ - unit tests
21 changes: 4 additions & 17 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,6 @@
This folder structure holds the input data for the SDM analyses.
# data folder.

### Occurrences
Contains an empty 'config.yml' file. Once an sdmdl object has been created once this file will contain the file paths to
the detected occurrence tables and raster layers, and a number of model hyper parameters.

The substructure will include a folder ([filtered](filtered)) with files with
occurrences, one file per species. The files are formatted as per
[this](https://github.com/naturalis/trait-geo-diverse-ungulates/blob/master/data/filtered/Aepyceros_melampus.csv)
example. The data are collected from GBIF as DarwinCore archives
(store the DOI for each query!) from which we retain the following columns:

1. gbif_id
2. taxon_name
3. decimal_latitude
4. decimal_longitude

### GIS data

There will also be a folder ([GIS](GIS)) with GIS layers as input for the niche
modelling. The resolution will be 5 arcminutes. Which layers is to be determined.
The configuration file can be edited using a simple text editor (e.g. notepad ++).
98 changes: 2 additions & 96 deletions data/gis/README.md
Original file line number Diff line number Diff line change
@@ -1,97 +1,3 @@
# GIS datasets
## Climate
Both the [Bioclim](http://worldclim.org/version2) dataset and the [ENVIREM](https://deepblue.lib.umich.edu/data/concern/data_sets/gt54kn05f) dataset are used as climatic variables.
![](images/bioclim.PNG)
### Datasets Bioclim
1. BIO1 Annual Mean Temperature
2. BIO2 Mean Diurnal Range (Mean of monthly (max temp - min temp))
3. BIO3 Isothermality (BIO2/BIO7) (* 100)
4. BIO4 Temperature Seasonality (standard deviation *100)
5. BIO5 Max Temperature of Warmest Month
6. BIO6 Min Temperature of Coldest Month
7. BIO7 Temperature Annual Range (BIO5-BIO6)
8. BIO8 Mean Temperature of Wettest Quarter
9. BIO9 Mean Temperature of Driest Quarter
10. BIO10 Mean Temperature of Warmest Quarter
11. BIO11 Mean Temperature of Coldest Quarter
12. BIO12 Annual Precipitation
13. BIO13 Precipitation of Wettest Month
14. BIO14 Precipitation of Driest Month
15. BIO15 Precipitation Seasonality (Coefficient of Variation)
16. BIO16 Precipitation of Wettest Quarter
17. BIO17 Precipitation of Driest Quarter
18. BIO18 Precipitation of Warmest Quarter
19. BIO19 Precipitation of Coldest Quarter
# gis folder.

### Datasets ENVIREM
1. annualPET Annual potential evapotranspiration
2. aridityIndexThornthwaite Thornthwaite aridity index
3. climaticMoistureIndex Metric of relative wetness and aridity
4. continentality Average temp. of warmest and coldest month
5. embergerQ Emberger’s pluviothermic quotient
6. growingDegDays0 Sum of months with temperatures greater than 0 degrees
7. growingDegDays5 Sum of months with temperatures greater than 5 degrees
8. maxTempColdestMonth Maximum temp. of the coldest month
9. minTempWarmestMonth Minimum temp. of the warmest month
10. monthCountByTemp10 Sum of months with temperatures greater than 10 degrees
11. PETColdestQuarter Mean monthly PET of coldest quarter
12. PETDriestQuarter Mean monthly PET of driest quarter
13. PETseasonality Monthly variability in potential evapotranspiration
14. PETWarmestQuarter Mean monthly PET of warmest quarter
15. PETWettestQuarter Mean monthly PET of wettest quarter
16. thermInd Compensated thermicity index

## Topography
Median elevation variables were extracted from the [Harmonized World Soil Database ](http://www.fao.org/soils-portal/soil-survey/soil-maps-and-databases/harmonized-world-soil-database-v12/en/) and have a spatial resolution of 30 arcseconds. The topographic wetness index and the terrain roughness index are extracted from the [ENVIREM](https://deepblue.lib.umich.edu/data/concern/data_sets/gt54kn05f) dataset and have a spatial resolution of 30 arcseconds.
### Datasets
1. Slope
2. Aspect
![](images/slope.PNG)
## Soil
The soil characteristics are extracted from the [Land-Atmosphere Interaction Research Group](http://globalchange.bnu.edu.cn/research/soilw) with a spatial resolution of 5 arcminutes.

1. Bulk density
2. Clay percentage
3. pH CaCL
4. Organic carbon

![](images/ph.PNG)

## Ecoregions
Separate raster maps representing the world's terrestrial ecoregions were created from the [The Nature Conservancy's](http://maps.tnc.org/gis_data.html) world ecoregion's shapefile.
1. Boreal Forests and Taiga
2. Deserts and Xeric Shrublands
3. Flooded Grasslands and Savannas
4. Inland Water
5. Mangroves
6. Meditteranean Forests, Woodlands and Scrubs
7. Montane Grasslands and Shrublands
8. Rock and Ice
9. Temperature Broadleaf and Mixed Forests
10. Temperature Conifer Forests
11. Temperate Grasslands, Savannas and Shurblands
12. Tropical and Subtropical Coniferous Forests
13. Tropical and Subtropical Dry Broadleaf Forests
14. Tropical and Subtropical Grasslands, Savannas and Shrublands
15. Tropical and Subtropical Moist Broadleaf Forests
16. Tundra

## Ecoregion attributes
Additional attribute metrics per ecoregion from the The World Atlas of Conservation extracted as shapefiles from [Databasin](https://databasin.org/maps/new#datasets=43478f840ac84173979b22631c2ed672) and rasterized.
1. Habitat fragmentation
2. Human accessibility
3. Human appropriation
4. Mammal species richness
5. Plant species richness

## Species co-occurrence.
Species occurrence raster maps created for 124 species, list of species can be found in [Taxa list](https://github.com/naturalis/trait-geo-diverse-dl/blob/master/data_GIS_extended/data/SQL_filtered_gbif/taxa_list.txt)


# Stacked raster datasets
## env_stacked
The [env_stacked](env_stacked) folder contains the environmental variable rasters stacked into a single GeoTiff, the file itself is not uploaded on Github as it's size is too large. Next to this, the folder contains a text file containing the variable descriptions for each of the 186 bands in the GeoTiff.

## stacked raster clips
A clip was made of the GeoTiff for each species, based on its IUCN range.
However, this clip was not uploaded to Github as file sizes were too large.
Contains the 'world_locations_to_predict.csv' table that is used during the data preparations.
4 changes: 2 additions & 2 deletions data/gis/layers/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Layers folder that contains two subfolders and one empty map.
# layers folder.

The empty map is used during the data preperations.
Contains an empty map raster layer that is used during the data preparations.
5 changes: 3 additions & 2 deletions data/gis/layers/non-scaled/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#non-scaled layer folder.
# non-scaled folder.

Any layers that do not need to be scaled (values already normalized) can be drag and dropped into this folder.
Any raster layers (with the .tif file extension) that do not need to be scaled (categorical values or values that are
already normalized) can be dragged and dropped into this folder.
4 changes: 2 additions & 2 deletions data/gis/layers/scaled/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Scaled layers.
# scaled folder.

Any layers that need to be scaled (values not normalized) can be drag and dropped into this folder.
Any raster layers (with the .tif file extension) that need to be scaled can be dragged and dropped into this folder.
12 changes: 3 additions & 9 deletions data/occurrences/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,3 @@
- This directory is a placeholder for files with occurrence data.
- We are **not** going to store large sets of occurrence data for our use cases here,
they go in a [separate repository](https://github.com/rvosa/sdmdl-angiosperm-data).
- We are **not** going to redistribute this directory structure with the published
package, as we [exclude](https://github.com/naturalis/sdmdl/blob/e7d347a9b0ace43856770ee2dd7a48677194497a/setup.py#L18)
it from the package.
- We have
[marked](https://github.com/naturalis/sdmdl/blob/e95d908da3d9159f9b6e098f23dc9befd10fe863/README.md#L33)
this folder structure for **deletion**, hence, it is likely to disappear soon.
# occurrences folder.

Any occurrence tables (with the .csv or .xls file extension) can be dragged and dropped into this folder.
1 change: 0 additions & 1 deletion data/results/README.md

This file was deleted.

75 changes: 60 additions & 15 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,21 +14,39 @@ the SDMDL package works as follows:
model.train()
model.predict()
Installation
---------------------------------------------------------

The installation of the sdmdl package can be performed in three steps:

1. Install `GDAL <https://gdal.org/download.html>`_ separately as this is an external dependency that cannot be obtained through Python.
This step also includes the installation of the GDAL python package.
2. Install the package locally, by using this code snippet:

.. code:: python
import os
os.chdir('directory_path_of_repository_root')
pip install .
Requirements
---------------------------------------------------------

To create an sdmdl object and subsequently train deep learning models a few requirements need to be met.

1. Several input files (simply obtainable by copying or cloning the git repo).
2. `A set of environmental rasters <https://link.to.rasters/>`_ (.tif) which will serve as the source of data for the deep learning process.
2. `A set of environmental rasters <https://zenodo.org/record/3460541#.XYtaaEYzaCo>`_ (.tif) which will serve as the source of data for the deep learning process.
This project distinguishes between two types of environmental layers:

i. Scaled layers, that need to be scaled during the process of preparing the data.
ii. Non-scaled layers, that are already normalized or are categorical (0 = not present while 1 = present).
ii. Non-scaled layers, that are already normalized or are categorical (e.g. 0 = not present while 1 = present).

**Note:**
all environmental layers need to have the same affine transformation and resolution to be usable for data preparations.
This includes the file 'empty_land_map.tif' that is included in the git repo. This entails that the affine transformation
and resolution of the input rasters needs to match the affine transformation and resolution of 'empty_land_map.tif'.

3. `A set of occurrences <https://link.to.occurrences/>`_ (.csv or .xls) that will serve as training examples of where the species currently occurs.
3. `A set of occurrences <https://zenodo.org/record/3460530#.XYtV3UYzaCo>`_ (.csv or .xls) that will serve as training examples of where the species currently occurs.
To be detectable as occurrence files, these tables need to have two required columns:

i. 'decimalLatitude' or 'decimallatitude' holding the latitude for each occurrence.
Expand All @@ -45,35 +63,35 @@ To create an sdmdl object and subsequently train deep learning models a few requ
2. Non-scaled tif layers should be inserted into the ''root/data/gis/layers/non-scaled''
3. Occurrences should be inserted into the ''root/data/occurrences''

Configuration
---------------------------------------------------------

If these locations are not convenient it is possible to change these locations using the config.yml file in "root/data".
Config.yml is initiated the first time an sdmdl object is created. And holds any relevant information on:

1. Detected raster files.
2. Detected occurrence files
2. Detected occurrence files.
3. Model parameters:

a. **integer** random_seed, makes random processes repeatable.
b. **integer** pseudo_freq: number of sampled (pseudo) absences.
a. **integer** random_seed: makes a randomized process deterministic.
b. **integer** pseudo_freq: number of sampled pseudo absences.
c. **integer** batchsize: number of data points given to the model during training at once.
d. **integer** epoch: number of (training) iterations over the whole data set.
e. **integer** model_layers: number of nodes per layer. Adding extra items to the list makes the model deeper.
f. **float** model_dropout: dropout deactivates a percentage of nodes during training (0 = no nodes are turned off and 1 = all nodes are turned off)
f. **float** model_dropout: dropout deactivates a percentage of nodes during training (0 = no nodes are turned off and 1 = all nodes are turned off).
g. **boolean** Verbose: if True prints progress bars

**Note:** changes to the config file are not updated automatically
so any changes are not detected by the sdmdl object, for changes to take effect a new sdmdl objects needs to be created.
**Note:** changes to the config file are not updated automatically, for changes to take effect a new sdmdl objects needs to be created.

Example
---------------------------------------------------------

Once these steps are completed the model is ready for use:

**Step 1:** create a sdmdl object:

.. code:: python
model = sdmdl(location_of_repo)
or
model = sdmdl(location_of_repo, location_of_data, location_of_occurrences)
model = sdmdl('directory_path_of_repository_root')
**Step 2:** prepare data:

Expand All @@ -93,5 +111,32 @@ Once these steps are completed the model is ready for use:
model.predict()
**Step 5:** remove temporary data:

.. code:: python
model.clean()
Outputs
---------------------------------------------------------

The output of step 2 consists of the following files:
- several temporary files that are used as inputs for step 3 and 4.

The output of step 3 consists of the following files:
- Performance metrics for each species, can be found at 'root/results/_DNN_performance/DNN_eval.txt'.
- Model files that save the final state of the best performing model after training. For each species two files can
be found at 'root/results/species_name/' the files are named after their respective species, and have the file
extensions: .h5 and .json.
- Feature impact graph that shows the importance of individual variables. This graph is included for every species and
can be found at 'root/results/species_name/', the file is named after its respective species followed by 'feature
importance' and has the file extension: .png.

The output of step 4 consists of the following files:
- Prediction map that shows the global predicted distribution of a species, on a scale from 0 to 1 indicating the
probability of presence. One illustration is included for every species and can be found at 'root/results/species_name/',
the file is named after its respective species followed by 'predicted map color' and has the file extension .png.
- Raster file with the global predicted distribution of a species (on a scale from 0 to 1). This file is included to allow
further analysis using the species distribution. One raster file is included for every species and can
be found at 'root/results/species_name/', the file is named after its respective species followed by 'predicted map'
and has the file extension .tif.
4 changes: 2 additions & 2 deletions sdmdl/sdmdl/data_prep/prediction_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ def prepare_prediction_df(self):

def create_prediction_df(self):

"""Creates (global) prediction dataset by extracting all environmental variables at each occurrence combination
in the 'world_locations_to_predict.csv' file.
"""Creates (global) prediction dataset by extracting all environmental variables at each terrestrial location
dictated by the 'world_locations_to_predict.csv' file.
:param self: a class instance of PredictionData
Expand Down
3 changes: 2 additions & 1 deletion sdmdl/sdmdl/data_prep/presence_map.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@

# PresenceMap could include functionality on creating the empty map from scratch.
# This would require parameters in the config file for resolution and affine projection / spatial extent
# The empty land map raster can be generated from a simple shapefile.
# The empty land map raster can be generated from a simple shapefile (which could potentially be obtained from a remote
# location, so it will not have to be included in the data folder).

class PresenceMap:

Expand Down
4 changes: 2 additions & 2 deletions sdmdl/sdmdl/data_prep/raster_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def create_raster_stack(self):
# resolutions are detected and used by the model, THIS ALSO INCLUDES THE FILE 'empty_land_map.tif'. This
# means that the input raster files should match the spatial extent (Longitude_max = 180, Longitude_
# min = -180, Latitude_max = 90, Latitude_min = -60) of the already existing 'empty_land_map.tif' included
# in the repository. Alternatively the included 'empty_land_map.tif' can be edited to match the spatial
# extent of the users raster files.
# in the repository. Alternatively the included 'empty_land_map.tif' can be edited to match the affine
# transformation of the users raster files.

es.stack(self.gh.variables, self.gh.stack + '/stacked_env_variables.tif')
2 changes: 1 addition & 1 deletion sdmdl/sdmdl/gis.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def __init__(self, root):

def validate_gis(self):

"""Validates if certain required files and locations are present.
"""Validates if certain required files and directories are present.
:return: None. Set instance variables equal to the required file and directory paths.
If one of the required files is not found return error.
Expand Down
2 changes: 1 addition & 1 deletion sdmdl/sdmdl/occurrences.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def __init__(self, root):
def validate_occurrences(self):

"""Validates the presence of any .csv or .xls files recursively. Additionally collects some basic statistics on
the occurrences.
the species.
:return: None. Sets path and name instance variables to a list of file names and species names that have been
recursively found in self.root, also sets length instance variable to the number of species/files that have
Expand Down
Loading

0 comments on commit 89f35fe

Please sign in to comment.