doc/Niche_Overlap.Rmd


Niche Models
============

Introduction
------------

The following R markdown file performs the steps needed to calculate niche 
overlap between the extant Ungulates. Here we implement two different approaches 
to calculate the niche per species: 

1. Maxent species distribution models, and 
2. Outlying Mean Index (OMI)

The following libraries and scripts need to be loaded:

```{r libraries}
library(raster, quietly = T)
library(knitr, quietly = T)
library(maxent, quietly = T)
library(maps, quietly = T)
library(rJava, quietly = T)
library(maptools, quietly = T)
library(jsonlite, quietly = T)
library(caret, quietly = T)
library(ENMeval, quietly = T)
library(repmis, quietly = T)
library(CoordinateCleaner, quietly = T)
library(dismo, quietly = T)
library(virtualspecies, quietly = T)
library(sp, quietly = T)
library(rgeos, quietly = T)
library(ape, quietly = T)
library(adehabitatMA, quietly = T)
library(ade4, quietly = T)
library(raster, quietly = T)
library(SDMTools, quietly = T)
library(factoextra, quietly = T)
library(ecospat, quietly = T)
library(phytools, quietly = T)
library(rphylopic, quietly = T)
library(whisker, quietly = T)
library(rgdal, quietly=T)
library(dplyr, quietly=T)

# our local code. the goal is that this will eventually be properly portable 
# code that lives inside the ./script folder of the repository
source("../script/MaxEnt_function.R")
```

In addition, here we define the global variables that we will reuse throughout 
the code:

```{r globals}
# this is the location of the root directory of the git repository in the local 
# file system. normally you would be running the code from within rstudio, using
# a project that is initialized within that directory, so the working directory 
# should automatically be set.
REPO_HOME <- paste(getwd(), "/../", sep = "")
```

Environmental data
------------------

The environmental data used in this research are based on climatic variables, 
topography, and soil characteristics. Climatic information about the present 
was extracted from the widely used Bioclim dataset, which includes 19 
bioclimatic layers. The datasets contain information such as 'precipitation 
in the driest quarter' or 'maximum temperatures of the coldest month' and are 
constructed based on monthly remote sensing data between 1950 and 2000 (Hijmans 
et al., 2005, Title et al., 2018). The dataset can directly be downloaded with 
the getData() function from the `raster` package. It is possible to adjust the 
spatial resolution `res` to 30 arc seconds, 2.5, 5, and 10 arc minutes.

```{r getdata}
# This function from the raster package loads bioclimatic layers. It first 
# attempts to do this from files in data/GIS/wc5, but if these are not found it 
# will attempt to download them and cache them locally. Since the files are 
# small enough to commit to github we will normally have these files already. 
# This could change if we move to a resolution that is finer than 5 arcminutes, 
# as the layer files will then grow above 15MB.
gis.layers <- raster::getData(
  "worldclim",
  var = "bio",
  res = 5,
  path = paste(REPO_HOME, "/data/GIS", sep = ""),
  download = T
)
```

The other environmental datasets that we used are the new ENVIREM variables that 
give additional climatic information to the Bioclim datasets. We used median 
elevation variables from the Harmonized World Soil Database (HWSD), which are 
based on NASA’s Shuttle Radar Topographic Mission to calculate worldwide slope 
and aspect. We used indirect height measures such as slope and aspect because 
the height variables are directly correlated with the temperature Bioclim 
datasets. To capture soil characteristics, we used organic carbon, pH CaCL, bulk 
density and clay percentage datasets obtained from the land-atmosphere 
interaction research group at Sun Yat-sen University.

```{r add_layers}
# The location of the TIFF file with the stacked layers described directly above
files.names <- list.files(paste(REPO_HOME, "/data/GIS/5_deg", sep = ""))

# Turn the file names into layer names: strip the prefix, which might include
# the resolution, and strip the file extension
gis.layers.names <- files.names
gis.layers.names <- gsub('current_5arcmin_','',gis.layers.names)
gis.layers.names <- gsub('.tif','',gis.layers.names)

# Combine the layer names with those we've already read from BIOCLIM
gis.layers.names <- c(names(gis.layers),gis.layers.names)

# Iterate over files
for (i in 1:length(files.names)) {
  
  # Stack with previously read layers
  gis.layers <- stack(
    gis.layers,
    
    # Read as raster
    raster(
      
      # Construct file name
      paste(REPO_HOME, "/data/GIS/5_deg/", files.names[i], sep = "")
    )
  )
}

# Apply all names
names(gis.layers) <- gis.layers.names
rm(gis.layers.names, files.names, i)
```

Taxon data
----------

In a project that includes many taxa, such as this one, it is helpful if we can
work incrementally by having separate input and output files per taxon. To set 
this up we first read in the list of taxa from `/data/filtered/taxa.txt` and 
create corresponding output directories:

```{r load_taxa}
# read listing
taxa <- scan(paste(REPO_HOME, "/data/filtered/taxa.txt", sep = ""), sep = "\n", what = character())

# make output directories, if needed
for (i in 1:length(taxa)) {
  taxon.dir.name <- sprintf("%s/results/per_species/%s/", REPO_HOME, taxa[i])
  if (!dir.exists(taxon.dir.name)) {
    dir.create(taxon.dir.name, recursive = T)
  }
}
rm(i, taxon.dir.name)
```

When calculating the species distribution models a dataframe containing all the 
occurence points is needed, which is created and filled below:

```{r combine_occurences}

# create an empty dataframe 
occurrences.df = data.frame(
  matrix(
    vector(), 0, 4,
    dimnames=list(
      c(), 
      c("gbif_id", "taxon_name", "decimal_latitude", "decimal_longitude")
    )
  ),
  stringsAsFactors=F
)

# loop to add the occurence data to the empty dataframe
for ( i in 1:length(taxa) ) {
  csv.file <- sprintf("%s/data/filtered/%s.csv", REPO_HOME, taxa[i])
  csv <- read.csv(csv.file, header = T)
  occurrences.df <- rbind(occurrences.df, csv)
}
rm(csv.file,csv,i)
```

Below, the raw occurrence data are plotted on a map, which goes to 
`/results/per_species/<taxon>/occurrences.png`:

```{r plot_points}
data(wrld_simpl)
for (i in 1:length(taxa)) {
  
  # construct the name of the occurrences file for the current taxon
  taxon <- taxa[i]
  occ.map.file <- sprintf("%s/results/per_species/%s/occurrences.png", REPO_HOME, taxon)

  # generate the map
  if (!file.exists(occ.map.file)) {
    taxon.name <- gsub( "_", " ", taxon)

    # select the longitude and latitude of the focal taxon from the data frame
    taxon.occurrences <- dplyr::filter(occurrences.df, taxon_name == taxon.name)
    lon <- dplyr::select(taxon.occurrences, decimal_longitude)$decimal_longitude
    lat <- dplyr::select(taxon.occurrences, decimal_latitude)$decimal_latitude

    # plot a simple map ±5 degrees long and lat around the occurrences
    png(file = occ.map.file)
    plot(
      wrld_simpl,
      axes = TRUE,
      xlim = c(min(lon) - 5, max(lon) + 5),
      ylim = c(min(lat) - 5, max(lat) + 5),
      col = "gray93",
      main = sprintf("Filtered occurrences for %s", taxon.name)
    )
    points(lon, lat, col = "red", pch = 20, cex = 0.75)
    box()
    dev.off()
    
    # clean up
    rm(taxon.name,taxon.occurrences,lat,lon)
  }
  rm(occ.map.file, taxon)
}
rm(i, wrld_simpl)
```


Maxent: species distribution modeling
-------------------------------------

For this research we used the Maximum Entropy (MaxEnt) machine learning 
algorithm version 3.3.3 to construct SDMs for all available species. Previous 
research has demonstrated that the MaxEnt technique performs well when using 
presence only data to estimate the relationships between environmental 
predictors and the occurrences of species (Elith et al., 2011, Philips et al., 
2006, Tognelli et al., 2009, Conolly et al., 2012). The widely-used machine 
learning algorithm is very efficient in the complex handling of response and 
predictor variable interactions and works well with little occurrence data 
points (Elith et al., 2011, Elith et al., 2006, Wisz et al., 2008).

Before the construction of the maxent model the environmental rasters are 
cropped to the extent of the occurrence points for a specific species + a buffer 
around the total extent. Buffers that are too small can result in 
underestimations of edge effects while buffers that are too large have the risk 
of losing track of favorable environmental conditions due to noise. In this 
research we set the extent of the buffer in proportion to the maximum distance 
between two occurrence points, in this way we account for species with a wide 
distribution and species with a small distribution. 

The cropped environmental data are checked for collinearity, because only 
uncorrelated environmental raster layers can be used in the SDM (Dormann et al., 
2013, Reas & Aguirre-Gutierrez, 2018). To remove correlated layers the 
removeCollinearity function in the virtualspecies version 1.4-4 R package was 
used (Leroy et al., 2016). Environmental variables with a Pearson’s R 
correlation coefficients above 0.7 were grouped and one variable within this 
group was randomly chosen, resulting in a raster stack with only uncorrelated 
environmental rasters.

Afterwards the occurrence dataset is split in k-fold partitions: a training 
dataset containing 75% of the data and a test dataset containing 25% of the 
data. The maxent model is constructed using the maxent function from the dismo 
R package (Hijmans & Elith, 2013). The function extracts abiotic environmental 
data for the training occurrence locations and 1000 random sampled background 
locations, resulting in a model maxent object that can be used to predict which 
other locations are suitable.

To assess the model performance we used the area under the receiver operating 
curve (ROC), also known as the AUC, which is often used to estimate the ability 
of models (Fourcade et al., 2014). The AUC was interpreted as the probability 
that a randomly chosen “presence point”, the location of a species occurrence, 
has a higher predicted probability of occurrence than a randomly chosen absence 
point (Reas & Aguirre-Gutierrez, 2018). To compare a presence point to an 
absence point 1000 absence points were randomly sampled within the study area. 
The test occurence datasets were used to evaluate the model performance with 
the evaluate function in the dismo R package (Hijmans & Elith, 2013). Only SDMs 
with an AUC value higher than 0.7 were used in the further steps, since these 
models are generally accepted as useful models (Swets et al., 2000).

The script below loops through each species in the list, assesses the niche 
model, and classifies it into one of two options:

1. "valid"
2. "invalid"

The Maxent function used in this loop can be viewed and adapted in our Github 
repository in this [script](../script/MaxEnt_function.R).

```{r maxent1}

# first we created two empty lists, one list for models with a low accuracy (AUC 
# below 0.7) and one list for models with a high accuracy (AUC above 0.7)
list_species_model_low_accuracy <- list()
list_species_model_high_accuracy <- list()

# length(taxa)
# iterate over n species
for (i in c(1:134, 136:150, 152:154)) {
  name <- taxa[i]
  valid.model.file <- sprintf("%s/results/per_species/%s/valid_maxent_model.rda", REPO_HOME, name)
  invalid.model.file <- sprintf("%s/results/per_species/%s/invalid_maxent_model.rda", REPO_HOME, name)

  # check if we have the model cached
  
  if (!file.exists(valid.model.file) && !file.exists(invalid.model.file)) {
    message(name)

    # read occurrences
    csv.file <- sprintf("%s/data/filtered/%s.csv", REPO_HOME, name)
    species_occurence <- read.csv(csv.file, header = T)

    # Run maxent:
    # - clip environment to ± 9 degrees lat/long relative to the occurrences
    # - remove colinear layers
    # - prepare k-folds cross-validation partition
    # - run dismo::maxent()
    species_model <- Maxent_function(species_occurence, currentEnv)
    # select species model from the output list
    spmod <- species_model[[1]] 
    # select trainingAUC
    trainingAUC <- spmod@results[[5,1]]
    
    ## null model 
    # - select occurence points within the clip environment 
    # - remove the species from the dataframe with all species 
    # - random sample x points from the dataframe 
    # - run the maxent model 100 times 
    name_underscore<- gsub( "_", " ", name)
    occurence_samples<- nrow(species_occurence)
    visited_areas <- subset(empty.df, empty.df$taxon_name != name_underscore)
    modelenvironment <- species_model[[2]]
    polygonextent<- species_model[[4]]

    AUCvalues<- nullModel_without_spatial(visited_areas, modelenvironment, polygonextent, occurence_samples, rep = 100)
    
    auc <- sort(unlist(AUCvalues), decreasing = FALSE)
    # select 95th number in the list 
    confidence.interval<- auc[95]
    
  
    # populate two stacks, one with the models with a low accuracy  and one with a high accuracy 
    if (trainingAUC > confidence.interval) {
      model <- species_model[[1]]
      list_species_model_high_accuracy[[name]] <- model
      save(model, file = valid.model.file)
    }
    else {
      model <- species_model[[1]]
      list_species_model_low_accuracy[[name]] <- model
      save(model, file = invalid.model.file)
    }

    # valid model was previously constructed, load it
  } else if (file.exists(valid.model.file)) {
    env <- new.env()
    nm <- load(valid.model.file, env)[[1]]
    list_species_model_high_accuracy[[name]] <- env[[nm]]

    # invalid model was previously constructed, load it
  } else if (file.exists(invalid.model.file)) {
    env <- new.env()
    nm <- load(invalid.model.file, env)[[1]]
    list_species_model_low_accuracy[[name]] <- env[[nm]]
  }
}

```

```{r list_aucvalues}

# combine two lists of valid and invalid models

output_AUC_valid <- data.frame(matrix(ncol = 3, nrow = length(list_species_model_high_accuracy)))
colnames(output_AUC_valid) <- c("taxon","trainingAUC","validation")
AUC.csv <- paste(REPO_HOME, "/results/maxent/AUCvalues.csv", sep="")

for (i in 1:length(list_species_model_high_accuracy)) {

open_species_model <- list_species_model_high_accuracy[[i]]

name <- names(list_species_model_high_accuracy[i])
name_underscore<- gsub( "_", " ", name)

 
trainingAUC<-open_species_model@results[[5,1]]

output_AUC_valid[i,] <- c(name_underscore, trainingAUC, "valid")

}

output_AUC_invalid <- data.frame(matrix(ncol = 3, nrow = length(list_species_model_low_accuracy)))
colnames(output_AUC_invalid) <- c("taxon","trainingAUC","validation")

for (i in 1:length(list_species_model_low_accuracy)) {

open_species_model <- list_species_model_low_accuracy[[i]]

name <- names(list_species_model_low_accuracy[i])
name_underscore<- gsub( "_", " ", name)

 
trainingAUC<-open_species_model@results[[5,1]]

output_AUC_invalid[i,] <- c(name_underscore, trainingAUC, "invalid")

}

combined_auc<- rbind(output_AUC_invalid, output_AUC_valid)

write.csv(combined_auc, file= AUC.csv)

```

The models are now stacked in a list but can be viewed and examined as follows:

```{r plot_importance and plot_response}

# Select the first model in the list
open_species_model <- list_species_model_high_accuracy[[1]]

# name of the dataset plotted
name <- paste("Variable importance for", names(list_species_model_high_accuracy[1]))

# plot the variable importance
plot(open_species_model, main = name)

# plot the response curve to see what the range of important values is per abiotic dataset
response(open_species_model)

# the model views can be saved in a loop
valid.model.variable.importance <- sprintf("%s/results/per_species/%s/valid_maxent_variable_importance.png", REPO_HOME, name)
valid.model.response.curve <- sprintf("%s/results/per_species/%s/valid_maxent_response_curve.png", REPO_HOME, name)

combined_lists<- append(list_species_model_low_accuracy, list_species_model_high_accuracy)


for (i in 1:length(list_species_model_high_accuracy)) {
  open_model <- list_species_model_high_accuracy[[i]]
  name <- names(list_species_model_high_accuracy[i])
  name2 <- paste("Variable importance")

  variable.map.file <- sprintf("%s/results/per_species/%s/valid_maxent_variable_importance.png", REPO_HOME, name)
  if (!file.exists(variable.map.file)) {
    png(file = variable.map.file)
    plot(open_model, main = name2)
    dev.off()
  }

  response.map.file <- sprintf("%s/results/per_species/%s/valid_maxent_response_curve.png", REPO_HOME, name)
  if (!file.exists(response.map.file)) {
    png(file = response.map.file, res = 80, units = "px")
    response(open_model)
    dev.off()
  }
}
```

The outcome of the species distribution models can be used to create a prediction of the areas 
that are suitable for the species. The script below shows a loop that creates prediction raster 
files for the models with a high prediction accuracy.

```{r plot_suitability}
# combine all prediction rasters, either by doing the projection or by reading from a file
prediction_rasters <- stack()
taxon.names<- taxa
taxon.names <- names(list_species_model_high_accuracy)
realms<- readOGR(paste(REPO_HOME, "/data/GIS/Realms/newRealms.shp", sep = ""))
realms<- spTransform(realms, CRS("+proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs"))
plot(realms)

for (i in c(1:134, 136:150, 152:154)) {
  #length(list_species_model_high_accuracy)
  # file name: load if exists, create otherwise
  valid.model.prediction <- sprintf("%s/results/per_species/%s/valid_maxent_prediction.rda", REPO_HOME, taxon.names[i])
  if (file.exists(valid.model.prediction)) {

    # load file
    message(sprintf("loading prediction for taxon %s from file %s", taxon.names[i], valid.model.prediction))
    env <- new.env()
    nm <- load(valid.model.prediction, envir = env)[[1]]
    prediction_rasters <- stack(prediction_rasters, env[[nm]])
  }
  else {
    message(sprintf("computing prediction for taxon %s, will write to file %s", taxon.names[i], valid.model.prediction))
    open_species_model <- list_species_model_high_accuracy[[i]]

    # select non correlated layers from model environment in the world environment
    noncorrelated <- names(open_species_model@presence)
    subsetworldEnv <- subset(currentEnv, noncorrelated)

    # make projection of the whole earth based on the maxent SDM
    species.pred <- dismo::predict(open_species_model, subsetworldEnv)
    
    # limit the predictions to zoogeographic regions (Realms)
    occurence.csv <- sprintf("%s/data/filtered/%s.csv", REPO_HOME, taxon.names[i])
    occ <- read.csv(occurence.csv, header = T)
    bindlonglat <- as.data.frame(cbind(occ[, c("decimal_longitude", "decimal_latitude")]))
    points <- bindlonglat
    points$decimal_longitude <- as.numeric(as.character(points$decimal_longitude))
    points$decimal_latitude <- as.numeric(as.character(points$decimal_latitude))
    coordinates(points) <- ~ decimal_longitude + decimal_latitude
    proj4string(points) <- CRS("+proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs")
    
    polys.sub <- realms[!is.na(sp::over(realms, sp::geometry(points))), ] 
    polys.sub <- gUnaryUnion(polys.sub)

    # set all the values outside the polygon to 0
    maskedprediction<- mask(species.pred, polys.sub, updatevalue = 0, updateNA=TRUE)
    
    # remove the 10% lowest prediction values under the occurence points 
    
    projection.values<-extract(maskedprediction, points, method="simple")

    sort.projection.values <- sort(unlist(projection.values), decreasing = FALSE)

    percentage.number<- round((0.1* nrow(occ)), digits = 0)
    
    threshold.value<- sort.projection.values[percentage.number]
    
    maskedprediction[maskedprediction < threshold.value] <- 0

    # save file, stack projection contents
    save(maskedprediction, file = valid.model.prediction)
    prediction_rasters <- stack(prediction_rasters, maskedprediction)
  }
}

names(prediction_rasters) <- taxon.names[c(1:134, 136:150, 152:154)]

```

Now that we have the prediction rasters we can write them to files:

```{r write_predictions}
for (i in 1:length(names(prediction_rasters))) {
  taxon.name <- names(prediction_rasters)[[i]]
  prediction.map.file <- sprintf("%s/results/per_species/%s/prediction_map.png", REPO_HOME, taxon.name)
  if (!file.exists(prediction.map.file)) {
    png(file = prediction.map.file)
    plot(prediction_rasters[[i]])
    dev.off()
  }

  prediction.occurence.map.file <- sprintf("%s/results/per_species/%s/prediction_occurence_map.png", REPO_HOME, taxon.name)

  if (!file.exists(prediction.occurence.map.file)) {
    # select correct occurence datasets
    csv.file <- sprintf("%s/data/filtered/%s.csv", REPO_HOME, taxon.name)
    occ <- read.csv(csv.file, header = T)
    lon <- occ$decimal_longitude
    lat <- occ$decimal_latitude


    png(file = prediction.occurence.map.file)
    plot(prediction_rasters[[i]])
    points(lon, lat, col = "red", pch = 20, cex = 0.2)
    box()
    dev.off()
  }
}
```

Write readme files to the github. 

```{r readme_files}
# location of a triangular, inverse distance matrix file
for ( i in 1:length(taxa)) {
  name <- taxa[i]
  readme.file <- sprintf('%s/results/per_species/%s/README.md', REPO_HOME, taxa[i])
  
  if(!file.exists(readme.file)) {
    taxa_remove<- gsub( "_", " ", name)
    
    template<- 
'# {{taxa_name}} 

![](image_taxa.png) 

## Distribution of occurence points 
The following map shows the distribution of the filtered occurence points for {{taxa_name}} used in the Maxent model. 

![](occurrences.png)
    
## Variable importance 
The variable importance graph shows the relative importance of the abiotic raster layers in the  Maxent model for {{taxa_name}}. 

![](valid_maxent_variable_importance.png)
    
## Response curve 
The response curve graphs show the response of the Maxent model for {{taxa_name}} to different values in the abiotic raster layers. 

![](valid_maxent_response_curve.png)
    
## prediction map 
The first map shows the predicted suitable areas on earth based on the niche preferences for {{taxa_name}} calculated in the Maxent model. The second map shows the suitable area map with the original occurence points.

![](prediction_map.png)
![](prediction_occurence_map.png)
    '
    data <- list( taxa_name = taxa_remove)
    
    text <- whisker.render(template, data)
    
    
    writeLines(text, readme.file)
  }
}

```
To assess whether the predicted suitable niche spaces differ per species we calculate their niche 
overlap. We use Schoener’s D to calculate niche overlap since it has been suggested to be the best 
suited index for maxent SDM outputs (Rödder & Engler, 2011). The index ranges from 0 which is no 
overlap to 1 which is a complete overlap and is based on the model prediction maps.

```{r calc_dist}
# location of a triangular, inverse distance matrix file
overlap.file <- paste(REPO_HOME, "/results/maxent/overlap.csv", sep = "")
if (!file.exists(overlap.file)) {

  # file doesn't exist, calculate Schoener's Distance and store file
  overlap <- calc.niche.overlap(prediction_rasters, stat = "D", maxent.args)
  write.table(overlap, file = overlap.file, sep = ",", quote = F)
} else {

  # file exists, read it
  overlap <- read.table(overlap.file, sep = ",")
}
```

Having calculated the Schoener's D distances, we can now use these to make a symmetrical distance
matrix, and perform a neighbor joining clustering to visualize the distances in niche space.

```{r cluster_dist}
dendrogram.file <- paste(REPO_HOME, "/results/maxent/dendrogram.tree", sep = "")
if (!file.exists(dendrogram.file)) {

  # create a distance matrix, i.e. the inverse of the overlap
  dmatrix <- as.dist(1 - overlap)

  # plot an unrooted dendrogram
  tree <- nj(dmatrix)
  write.tree(tree, file = dendrogram.file)
} else {
  tree <- read.tree(file = dendrogram.file)
}
```