Final check for internal documentation.

Small changes and additions to both the GitHub main readme (README.md) & documentation (index.rst).
naturalis · Oct 2, 2019 · 89f35fe · 89f35fe
1 parent 29070f4
commit 89f35fe
Show file tree

Hide file tree

Showing 16 changed files with 132 additions and 176 deletions.
diff --git a/README.md b/README.md
@@ -8,20 +8,21 @@
 
 An object-oriented python package for species distribution modelling using deep learning.
 The package allows for a more intuitive and easy exploration of biodiversity patterns by 
-modelling preferences for a great number of environmental variables. 
+modelling preferences for a great number of environmental variables.
+
+Instructions for installing and using the sdmdl package can be found [here](docs/index.rst).
+
+### Case study
 
 The functionality of this package and the estimates of environmental preferences it
-obtains is demonstrated by way of several use cases:
+obtains is demonstrated by way of a use case on domesticated crops and their wild progenitors.
 
-* Habitat suitability predictions with which to scale visual object identifications
-* Domesticated crops and their wild progenitors
-* Mycorrhizal associations
-* Secondary woodiness
+The raw uninterpreted results of this case study can be found [here](https://zenodo.org/record/3460718#.XYuBJEYzaCo). 
 
-The project builds on the previous results obtained by:
+### Acknowledgments
 
-- [MAXENT modelling](https://github.com/naturalis/trait-geo-diverse-ungulates) by Elke Hendrix.
-- [Deep learning](https://github.com/naturalis/trait-geo-diverse-dl) by Mark Rademaker.
+- [Comparative analysis of abiotic niches in Ungulates](https://github.com/naturalis/trait-geo-diverse-ungulates) by E. Hendrix.
+- [Ecological Niche Modelling Using Deep Learning](https://github.com/naturalis/trait-geo-diverse-dl) by M. Rademaker.
 
 ### Package layout
 
@@ -30,7 +31,7 @@ The project builds on the previous results obtained by:
 - [README.md](README.md) - the README file, which you are now reading
 - [requirements.txt](requirements.txt) - prerequisites to install this package, used by pip
 - [setup.py](setup.py) - installer script
-- [data](data)/ - contains (some of) the data for the use cases - **marked for deletion**
-- [docs](docs)/ - contains project documentation files
+- [data](data)/ - contains some files that are (currently) required for data preprocessing - **marked for deletion**
+- [docs](docs)/ - contains documentation on package installation and usage
 - [sdmdl](sdmdl)/ - the library code itself
 - [tests](tests)/ - unit tests
diff --git a/data/README.md b/data/README.md
@@ -1,19 +1,6 @@
-This folder structure holds the input data for the SDM analyses.
+# data folder.
 
-### Occurrences
+Contains an empty 'config.yml' file. Once an sdmdl object has been created once this file will contain the file paths to
+the detected occurrence tables and raster layers, and a number of model hyper parameters.
 
-The substructure will include a folder ([filtered](filtered)) with files with 
-occurrences, one file per species. The files are formatted as per
-[this](https://github.com/naturalis/trait-geo-diverse-ungulates/blob/master/data/filtered/Aepyceros_melampus.csv)
-example. The data are collected from GBIF as DarwinCore archives 
-(store the DOI for each query!) from which we retain the following columns:
-
-1. gbif_id
-2. taxon_name
-3. decimal_latitude
-4. decimal_longitude
-
-### GIS data
-
-There will also be a folder ([GIS](GIS)) with GIS layers as input for the niche 
-modelling. The resolution will be 5 arcminutes. Which layers is to be determined.
+The configuration file can be edited using a simple text editor (e.g. notepad ++).
diff --git a/data/gis/README.md b/data/gis/README.md
@@ -1,97 +1,3 @@
-# GIS datasets 
-## Climate 
-Both the [Bioclim](http://worldclim.org/version2) dataset and the [ENVIREM](https://deepblue.lib.umich.edu/data/concern/data_sets/gt54kn05f) dataset are used as climatic variables. 
-![](images/bioclim.PNG)
-### Datasets Bioclim 
-1. BIO1 Annual Mean Temperature
-2. BIO2 Mean Diurnal Range (Mean of monthly (max temp - min temp))
-3. BIO3 Isothermality (BIO2/BIO7) (* 100)
-4. BIO4 Temperature Seasonality (standard deviation *100)
-5. BIO5 Max Temperature of Warmest Month
-6. BIO6 Min Temperature of Coldest Month
-7. BIO7 Temperature Annual Range (BIO5-BIO6)
-8. BIO8 Mean Temperature of Wettest Quarter
-9. BIO9 Mean Temperature of Driest Quarter
-10. BIO10 Mean Temperature of Warmest Quarter
-11. BIO11 Mean Temperature of Coldest Quarter
-12. BIO12 Annual Precipitation
-13. BIO13 Precipitation of Wettest Month
-14. BIO14 Precipitation of Driest Month
-15. BIO15 Precipitation Seasonality (Coefficient of Variation)
-16. BIO16 Precipitation of Wettest Quarter
-17. BIO17 Precipitation of Driest Quarter
-18. BIO18 Precipitation of Warmest Quarter
-19. BIO19 Precipitation of Coldest Quarter
+# gis folder.
 
-### Datasets ENVIREM 
-1. annualPET Annual potential evapotranspiration
-2. aridityIndexThornthwaite Thornthwaite aridity index
-3. climaticMoistureIndex Metric of relative wetness and aridity
-4. continentality Average temp. of warmest and coldest month
-5. embergerQ Emberger’s pluviothermic quotient
-6. growingDegDays0 Sum of months with temperatures greater than 0 degrees
-7. growingDegDays5 Sum of months with temperatures greater than 5 degrees
-8. maxTempColdestMonth Maximum temp. of the coldest month
-9. minTempWarmestMonth Minimum temp. of the warmest month
-10. monthCountByTemp10 Sum of months with temperatures greater than 10 degrees
-11. PETColdestQuarter Mean monthly PET of coldest quarter
-12. PETDriestQuarter Mean monthly PET of driest quarter
-13. PETseasonality Monthly variability in potential evapotranspiration
-14. PETWarmestQuarter Mean monthly PET of warmest quarter
-15. PETWettestQuarter Mean monthly PET of wettest quarter
-16. thermInd Compensated thermicity index
-
-## Topography
-Median elevation variables were extracted from the [Harmonized World Soil Database ](http://www.fao.org/soils-portal/soil-survey/soil-maps-and-databases/harmonized-world-soil-database-v12/en/) and have a spatial resolution of 30 arcseconds. The topographic wetness index and the terrain roughness index are extracted from the [ENVIREM](https://deepblue.lib.umich.edu/data/concern/data_sets/gt54kn05f) dataset and have a spatial resolution of 30 arcseconds. 
-### Datasets 
-1. Slope
-2. Aspect
-![](images/slope.PNG)
-## Soil 
-The soil characteristics are extracted from the [Land-Atmosphere Interaction Research Group](http://globalchange.bnu.edu.cn/research/soilw) with a spatial resolution of 5 arcminutes. 
-
-1. Bulk density
-2. Clay percentage
-3. pH CaCL
-4. Organic carbon 
-
-![](images/ph.PNG)
-
-## Ecoregions
-Separate raster maps representing the world's terrestrial ecoregions were created from the [The Nature Conservancy's](http://maps.tnc.org/gis_data.html) world ecoregion's shapefile.
-1. Boreal Forests and Taiga
-2. Deserts and Xeric Shrublands
-3. Flooded Grasslands and Savannas
-4. Inland Water
-5. Mangroves
-6. Meditteranean Forests, Woodlands and Scrubs
-7. Montane Grasslands and Shrublands
-8. Rock and Ice
-9. Temperature Broadleaf  and Mixed Forests
-10. Temperature Conifer Forests
-11. Temperate Grasslands, Savannas and Shurblands
-12. Tropical and Subtropical Coniferous Forests
-13. Tropical and Subtropical Dry Broadleaf Forests
-14. Tropical and Subtropical Grasslands, Savannas and Shrublands
-15. Tropical and Subtropical Moist Broadleaf Forests
-16. Tundra
-
-## Ecoregion attributes
-Additional attribute metrics per ecoregion from the The World Atlas of Conservation extracted as shapefiles from [Databasin](https://databasin.org/maps/new#datasets=43478f840ac84173979b22631c2ed672) and rasterized.
-1. Habitat fragmentation
-2. Human accessibility
-3. Human appropriation
-4. Mammal species richness
-5. Plant species richness
-
-## Species co-occurrence.
-Species occurrence raster maps created for 124 species, list of species can be found in [Taxa list](https://github.com/naturalis/trait-geo-diverse-dl/blob/master/data_GIS_extended/data/SQL_filtered_gbif/taxa_list.txt)
-
-
-# Stacked raster datasets
-## env_stacked
-The [env_stacked](env_stacked) folder contains the environmental variable rasters stacked into a single GeoTiff, the file itself is not uploaded on Github as it's size is too large. Next to this, the folder contains a text file containing the variable descriptions for each of the 186 bands in the GeoTiff.
-
-## stacked raster clips
-A clip was made of the GeoTiff for each species, based on its IUCN range.
-However, this clip was not uploaded to Github as file sizes were too large.
+Contains the 'world_locations_to_predict.csv' table that is used during the data preparations.
diff --git a/data/gis/layers/README.md b/data/gis/layers/README.md
@@ -1,3 +1,3 @@
-# Layers folder that contains two subfolders and one empty map.
+# layers folder.
 
-The empty map is used during the data preperations.
+Contains an empty map raster layer that is used during the data preparations.
diff --git a/data/gis/layers/non-scaled/README.md b/data/gis/layers/non-scaled/README.md
@@ -1,3 +1,4 @@
-#non-scaled layer folder. 
+# non-scaled folder.
 
-Any layers that do not need to be scaled (values already normalized) can be drag and dropped into this folder.
+Any raster layers (with the .tif file extension) that do not need to be scaled (categorical values or values that are
+already normalized) can be dragged and dropped into this folder.
diff --git a/data/gis/layers/scaled/README.md b/data/gis/layers/scaled/README.md
@@ -1,3 +1,3 @@
-# Scaled layers.
+# scaled folder.
 
-Any layers that need to be scaled (values not normalized) can be drag and dropped into this folder.
+Any raster layers (with the .tif file extension) that need to be scaled can be dragged and dropped into this folder.
diff --git a/data/occurrences/README.md b/data/occurrences/README.md
@@ -1,9 +1,3 @@
-- This directory is a placeholder for files with occurrence data.
-- We are **not** going to store large sets of occurrence data for our use cases here,
-  they go in a [separate repository](https://github.com/rvosa/sdmdl-angiosperm-data).
-- We are **not** going to redistribute this directory structure with the published 
-  package, as we [exclude](https://github.com/naturalis/sdmdl/blob/e7d347a9b0ace43856770ee2dd7a48677194497a/setup.py#L18)
-  it from the package.
-- We have 
-  [marked](https://github.com/naturalis/sdmdl/blob/e95d908da3d9159f9b6e098f23dc9befd10fe863/README.md#L33)
-  this folder structure for **deletion**, hence, it is likely to disappear soon.
+# occurrences folder.
+
+Any occurrence tables (with the .csv or .xls file extension) can be dragged and dropped into this folder.
diff --git a/data/results/README.md b/data/results/README.md
diff --git a/docs/index.rst b/docs/index.rst
@@ -14,21 +14,39 @@ the SDMDL package works as follows:
  model.train()
  model.predict()
 
+Installation
+---------------------------------------------------------
+
+The installation of the sdmdl package can be performed in three steps:
+
+1. Install `GDAL <https://gdal.org/download.html>`_ separately as this is an external dependency that cannot be obtained through Python.
+   This step also includes the installation of the GDAL python package.
+2. Install the package locally, by using this code snippet:
+
+.. code:: python
+
+ import os
+ os.chdir('directory_path_of_repository_root')
+ pip install .
+
+Requirements
+---------------------------------------------------------
+
 To create an sdmdl object and subsequently train deep learning models a few requirements need to be met.
 
 1. Several input files (simply obtainable by copying or cloning the git repo).
-2. `A set of environmental rasters <https://link.to.rasters/>`_ (.tif) which will serve as the source of data for the deep learning process.
+2. `A set of environmental rasters <https://zenodo.org/record/3460541#.XYtaaEYzaCo>`_ (.tif) which will serve as the source of data for the deep learning process.
    This project distinguishes between two types of environmental layers:
 
    i. Scaled layers, that need to be scaled during the process of preparing the data.
-   ii. Non-scaled layers, that are already normalized or are categorical (0 = not present while 1 = present).
+   ii. Non-scaled layers, that are already normalized or are categorical (e.g. 0 = not present while 1 = present).
 
    **Note:**
    all environmental layers need to have the same affine transformation and resolution to be usable for data preparations.
    This includes the file 'empty_land_map.tif' that is included in the git repo. This entails that the affine transformation
    and resolution of the input rasters needs to match the affine transformation and resolution of 'empty_land_map.tif'.
 
-3. `A set of occurrences <https://link.to.occurrences/>`_ (.csv or .xls) that will serve as training examples of where the species currently occurs.
+3. `A set of occurrences <https://zenodo.org/record/3460530#.XYtV3UYzaCo>`_ (.csv or .xls) that will serve as training examples of where the species currently occurs.
    To be detectable as occurrence files, these tables need to have two required columns:
 
    i. 'decimalLatitude' or 'decimallatitude' holding the latitude for each occurrence.
@@ -45,35 +63,35 @@ To create an sdmdl object and subsequently train deep learning models a few requ
 2. Non-scaled tif layers should be inserted into the ''root/data/gis/layers/non-scaled''
 3. Occurrences should be inserted into the ''root/data/occurrences''
 
+Configuration
+---------------------------------------------------------
+
 If these locations are not convenient it is possible to change these locations using the config.yml file in "root/data".
 Config.yml is initiated the first time an sdmdl object is created. And holds any relevant information on:
 
 1. Detected raster files.
-2. Detected occurrence files
+2. Detected occurrence files.
 3. Model parameters:
 
-    a. **integer** random_seed, makes random processes repeatable.
-    b. **integer** pseudo_freq: number of sampled (pseudo) absences.
+    a. **integer** random_seed: makes a randomized process deterministic.
+    b. **integer** pseudo_freq: number of sampled pseudo absences.
     c. **integer** batchsize: number of data points given to the model during training at once.
     d. **integer** epoch: number of (training) iterations over the whole data set.
     e. **integer** model_layers: number of nodes per layer. Adding extra items to the list makes the model deeper.
-    f. **float** model_dropout: dropout deactivates a percentage of nodes during training (0 = no nodes are turned off and 1 = all nodes are turned off)
+    f. **float** model_dropout: dropout deactivates a percentage of nodes during training (0 = no nodes are turned off and 1 = all nodes are turned off).
     g. **boolean** Verbose: if True prints progress bars
 
-**Note:** changes to the config file are not updated automatically
-so any changes are not detected by the sdmdl object, for changes to take effect a new sdmdl objects needs to be created.
+**Note:** changes to the config file are not updated automatically, for changes to take effect a new sdmdl objects needs to be created.
+
+Example
+---------------------------------------------------------
 
-Once these steps are completed the model is ready for use:
 
 **Step 1:** create a sdmdl object:
 
 .. code:: python
 
- model = sdmdl(location_of_repo)
-
- or
-
- model = sdmdl(location_of_repo, location_of_data, location_of_occurrences)
+ model = sdmdl('directory_path_of_repository_root')
 
 **Step 2:** prepare data:
 
@@ -93,5 +111,32 @@ Once these steps are completed the model is ready for use:
 
  model.predict()
 
+**Step 5:** remove temporary data:
 
+.. code:: python
 
+ model.clean()
+
+Outputs
+---------------------------------------------------------
+
+The output of step 2 consists of the following files:
+  - several temporary files that are used as inputs for step 3 and 4.
+
+The output of step 3 consists of the following files:
+ - Performance metrics for each species, can be found at 'root/results/_DNN_performance/DNN_eval.txt'.
+ - Model files that save the final state of the best performing model after training. For each species two files can
+   be found at 'root/results/species_name/' the files are named after their respective species, and have the file
+   extensions: .h5 and .json.
+ - Feature impact graph that shows the importance of individual variables. This graph is included for every species and
+   can be found at 'root/results/species_name/', the file is named after its respective species followed by 'feature
+   importance' and has the file extension: .png.
+
+The output of step 4 consists of the following files:
+  - Prediction map that shows the global predicted distribution of a species, on a scale from 0 to 1 indicating the
+    probability of presence. One illustration is included for every species and can be found at 'root/results/species_name/',
+    the file is named after its respective species followed by 'predicted map color' and has the file extension .png.
+  - Raster file with the global predicted distribution of a species (on a scale from 0 to 1). This file is included to allow
+    further analysis using the species distribution. One raster file is included for every species and can
+    be found at 'root/results/species_name/', the file is named after its respective species followed by 'predicted map'
+    and has the file extension .tif.
diff --git a/sdmdl/sdmdl/data_prep/prediction_data.py b/sdmdl/sdmdl/data_prep/prediction_data.py
@@ -60,8 +60,8 @@ def prepare_prediction_df(self):
 
     def create_prediction_df(self):
 
-        """Creates (global) prediction dataset by extracting all environmental variables at each occurrence combination
-        in the 'world_locations_to_predict.csv' file.
+        """Creates (global) prediction dataset by extracting all environmental variables at each terrestrial location
+        dictated by the 'world_locations_to_predict.csv' file.
 
         :param self: a class instance of PredictionData
 

diff --git a/sdmdl/sdmdl/data_prep/presence_map.py b/sdmdl/sdmdl/data_prep/presence_map.py
@@ -6,7 +6,8 @@
 
 # PresenceMap could include functionality on creating the empty map from scratch.
 # This would require parameters in the config file for resolution and affine projection / spatial extent
-# The empty land map raster can be generated from a simple shapefile.
+# The empty land map raster can be generated from a simple shapefile (which could potentially be obtained from a remote
+# location, so it will not have to be included in the data folder).
 
 class PresenceMap:
 

diff --git a/sdmdl/sdmdl/data_prep/raster_stack.py b/sdmdl/sdmdl/data_prep/raster_stack.py
@@ -41,7 +41,7 @@ def create_raster_stack(self):
             # resolutions are detected and used by the model, THIS ALSO INCLUDES THE FILE 'empty_land_map.tif'. This
             # means that the input raster files should match the spatial extent (Longitude_max = 180, Longitude_
             # min = -180, Latitude_max = 90, Latitude_min = -60) of the already existing 'empty_land_map.tif' included
-            # in the repository. Alternatively the included 'empty_land_map.tif' can be edited to match the spatial
-            # extent of the users raster files.
+            # in the repository. Alternatively the included 'empty_land_map.tif' can be edited to match the affine
+            # transformation of the users raster files.
 
             es.stack(self.gh.variables, self.gh.stack + '/stacked_env_variables.tif')
diff --git a/sdmdl/sdmdl/gis.py b/sdmdl/sdmdl/gis.py
@@ -39,7 +39,7 @@ def __init__(self, root):
 
     def validate_gis(self):
 
-        """Validates if certain required files and locations are present.
+        """Validates if certain required files and directories are present.
 
         :return: None. Set instance variables equal to the required file and directory paths.
         If one of the required files is not found return error.

diff --git a/sdmdl/sdmdl/occurrences.py b/sdmdl/sdmdl/occurrences.py
@@ -36,7 +36,7 @@ def __init__(self, root):
     def validate_occurrences(self):
 
         """Validates the presence of any .csv or .xls files recursively. Additionally collects some basic statistics on
-        the occurrences.
+        the species.
 
          :return: None. Sets path and name instance variables to a list of file names and species names that have been
          recursively found in self.root, also sets length instance variable to the number of species/files that have