tue_biome_classification.Rmd

---
title: "Biome classification"
output: 
  html_document:
    theme: readable
---

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, eval = FALSE,
                      echo=TRUE, warning=FALSE, message=FALSE,
                      tidy = TRUE, collapse = TRUE,
                      results = 'hold')
```


# Background
The vast majority of species occurrence information available, via 'big data' aggregators as GBIF are georeferenced point locations consisting of geographic coordinates. However, most methods for ancestral area estimation require species occurrences in a limited number of discrete geographic units. A manual classification of species based on expert knowledge or graphical-user-interface based GIS software are limited in the amount of data that can be processed and often hard to reproduce. SpeciesgeocodeR implements an easy-to-use function to classify species occurrence to discrete areas, accounting for issues in data quality. You can find detailed tutorials on the software [here](https://github.com/azizka/speciesgeocodeR/wiki) and articles disrobing the method [here](https://academic.oup.com/sysbio/article/66/2/145/2670075/SpeciesGeoCoder-Fast-Categorization-of-Species) and [here](http://www.biorxiv.org/content/early/2015/11/24/032755).


# Objectives
After this exercise you will be able to assign species occurrences to predefined areas in an automated way, taking into account caveats on data quality.

# Exercise
1. Load the cleaned occurrence file and classify each species as occurring or not occurring in the 14 WWF bioregions. (`read_csv`, `SpGeoCod`, `WwfLoad`)
2. Write the statistical and graphical summary to the working directory and check out the results. Which species and which occurrence records could not be classified? (`SpGeoCod`)
3. Write the area classification the working directory in a format suitable for downstream analyses with BioGeoBEARS. (`WriteOut`)
4. Control the occurrence threshold option of the SpGeoOut function. Which effect do you observe, which fraction seems plausible? (`SpGeoCod`)

# Possible questions for your project
* Which biome is most diverse for your group?
* How many species occur in multiple biomes?
* Which biomes are most commonly shared by species?
* Which fraction of records can not be classified to any biome?
* Which ecoregion is most important?

# Library setup
```{r}
require(speciesgeocodeR)
require(tidyverse)
require(rgdal)
```

# Tutorial

## 1. Biome classification
```{r}
# load data
dat <- read_csv("inst/occurrence_records_clean.csv")%>%
  data.frame()

# Load Olson et al 2001 biomes
biom <- WWFload(x = "inst")
names(biom)

# Classify species
class <- SpGeoCod(x = dat, y = biom, areanames = "BIOME")

```

# 2. Write results to disk
```{r}
# summary graphs
WriteOut(class, type = "graphs")

# summary tables
WriteOut(class, type = "stats")

# Per species maps, dont do this for very large groups with more than 100 species
#WriteOut(class, type = "maps")

```

## 3. Format for downstream analyses
```{r}
# BioGeoBEARS
WriteOut(class, type = "BioGeoBEARS")

```

## 4. Occurrence threshold
It might be advisable to only classify a species to a habitat if more than one or two occurrences are in this habitat. This is especially the case, if the quality of the occurrence data is unclear. You can use the `òcc.thresh` argument of `SpGeoCod` for this.

```{r}
# At least 20% of the records
class_thresh <- SpGeoCod(x = dat, y = biom, areanames = "BIOME", occ.thresh = 10)

class$polygon_table
class_thresh$polygon_table

```