src/05_occurrence_indicators_preprocessing.Rmd

---
title: "Occurrence indicators: pre-processing"
author:
- Damiano Oldoni
- Toon Van Daele
- Tim Adriaens
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    number_sections: true
---

This document describes the pre-processing of occurrence data of alien species. The outputs of this pipeline will be used as input for modelling.

# Setup

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
```

Load libraries:

```{r load_libraries, message = FALSE}
library(dplyr) # To do data science
library(readr)
library(magrittr)
library(purrr)
library(stringr)
library(tidyr)
library(tidylog) # To provide feedback on dplyr functions
library(progress) # To add progress bars
library(here) # To find files
library(lubridate) # To work with dates
library(rgbif) # To get taxa from publisehd unified checklist
```

# Get data

We get occurrence data as obtained in `trias-project/occ-processing` repository. 

## Occurrence data

### Alien species

We load occurrence data related to alien species in Belgium:

```{r read_data_cube, message = FALSE, warning = FALSE}
df <- read_csv(
  file = "https://zenodo.org/records/10527772/files/be_alientaxa_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    taxonKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
) %>% 
  write_csv("./data/output/UAT_processing/be_alientaxa_cube.csv")
```

Nota: these data have been also published on [Zenodo](https://doi.org/10.5281/zenodo.3632749).

We remove columns we will never use:

```{r remove_coords_uncertainty}
df <-
  df %>%
  select(-min_coord_uncertainty)
```

and rename column `n` to `obs`:

```{r rename_n_to_obs}
df <-
  df %>%
  rename(obs = n)
```

### Baseline

We get occurrence data of non alien species grouped at class level, which will be used for correcting the research bias effort:

```{r get_data_baseline}
df_bl <- read_csv(
  file = "https://zenodo.org/records/10527772/files/be_classes_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    classKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
)
```

Nota: these data have been also published on [Zenodo](https://doi.org/10.5281/zenodo.3632749).

We remove columns we will never use:

```{r remove_coords_uncertainty_baseline}
df_bl <-
  df_bl %>%
  select(-min_coord_uncertainty)
```

and rename column `n` to `cobs`:

```{r rename_n_to_obs_bl}
df_bl <-
  df_bl %>%
  rename(cobs = n)
```

## Checklist data

We read input data based on [Global Register of Introduced and Invasive Species - Belgium](https://www.gbif.org/dataset/6d9e952f-948c-4483-9807-575348147c7e), typically referred to as the _unified checklist_. These data contain taxonomic information as well distribution, description and species profiles. We limit to taxa introduced in Belgium (`locationId` : `ISO_3166:BE`)

```{r unified_checklist_tidy_df}
data_file <- here::here(
  "data",
  "output",
  "UAT_processing",
  "data_input_checklist_indicators.tsv"
)
taxa_df <-
  read_tsv(data_file,
           na = "",
           guess_max = 5000
  )
taxa_df <-
  taxa_df %>%
  filter(locationId == "ISO_3166:BE")
```

## Remove some taxa

### Taxa in unified checklist without occurrences

There is a group of taxa defined in the unified checklist which are not present in the occurrence data, i.e. the presence of these taxa, although confirmed by authoritative's checklists, is still not sustained by observations:

```{r taxa_without_occs}
alien_taxa_without_occs <-
  taxa_df %>%
  filter(!nubKey %in% df$taxonKey) %>%
  distinct(
    key,
    nubKey,
    canonicalName,
    first_observed,
    last_observed,
    class,
    kingdom
  ) %>%
  arrange(nubKey)
alien_taxa_without_occs
```

where `key` is the unique GBIF identifier of the taxon as published in the unified checklist, while `nubKey` is the unique GBIF identifier of the taxon in the GBIF Backbone Taxonomy.
We save this data.frame as `alien_taxa_without_occurrences.tsv` in `data/output`:

```{r save_alien_taxa_without_occs}
write_tsv(
  alien_taxa_without_occs,
  here::here("data", "output", "alien_taxa_without_occs.tsv"),
  na = ""
)
```

## Grid cells intersecting protected areas

We get information about intersection of grid cells with Belgian protected areas by reading file `intersect_EEA_ref_grid_protected_areas.tsv` in `data/output`:

```{r read_protected_areas}
df_prot_areas <- read_tsv(
  here::here(
    "data",
    "interim",
    "intersect_EEA_ref_grid_protected_areas.tsv"
  ),
  na = ""
)
```

Preview:

```{r preview_prot_areas}
df_prot_areas %>%
  head()
```

We are interested to areas included in Natura2000. We remove columns we are not interested to:

```{r select_cols_prot_areas}
df_prot_areas <-
  df_prot_areas %>%
  select(
    CELLCODE,
    natura2000
  )
```

#  Pre-processing

## Add informative columns

For better interpretation of the data, we retrieve scientific names (without authorship) of the species whose data we are going to analyze. We also retrieve the class of each alien species to link occurrence data with the related baseline data:

```{r get_canonical_names}
taxon_key <-
  df %>%
  distinct(taxonKey) %>%
  pull()
pb <- progress_bar$new(total = length(taxon_key))
spec_names <- map_df(
  taxon_key,
  function(k) {
    pb$tick()
    name_usage(key = k)$data
  }
) %>%
  select(
    taxonKey = key,
    canonicalName,
    scientificName,
    kingdomKey, classKey
  ) %>%
  mutate(canonicalName = ifelse(
    is.na(canonicalName), scientificName, canonicalName
  ))
```

In the rare case multiple taxa share the same scientific name, authorship is added:

```{r add_authorship_if_needed}
spec_names <-
  spec_names %>%
  group_by(canonicalName) %>%
  add_tally() %>%
  ungroup() %>%
  mutate(canonicalName = if_else(n > 1,
                                 scientificName,
                                 canonicalName)) %>%
  select(-c(n, scientificName))
```

We also retrieve kingdom and class:

```{r get_kingodm_class}
class_key <-
  spec_names %>%
  distinct(classKey) %>%
  filter(!is.na(classKey)) %>%
  pull()
pb <- progress_bar$new(total = length(class_key))
kingdom_class <- map_df(
  class_key,
  function(x) {
    pb$tick()
    name_usage(key = x)$data
  }
) %>%
  select(classKey, class, kingdomKey, kingdom)
```

and we add them:

```{r add_kingdom_class}
# add class
spec_names <-
  spec_names %>%
  left_join(kingdom_class %>%
              distinct(.data$classKey, .data$class),
            by = "classKey"
  )

# add kingdom
spec_names <-
  spec_names %>%
  left_join(kingdom_class %>%
              distinct(.data$kingdomKey, .data$kingdom),
            by = "kingdomKey"
  ) 
```

Preview:

```{r preview_species}
spec_names %>% head(10)
```

Taxa not assigned to any class:

```{r taxa_no_class}
spec_names %>%
  filter(is.na(classKey))
```

## Extract x-y coordinates from cellcodes

Extract coordinates from column `eea_cell_code`. For example, the coordinates of cellcode `1kmE3895N3116` are `x = 3895` and `y = 3116`:

```{r extract_x_y}
df_xy <-
  df %>%
  distinct(eea_cell_code) %>%
  bind_cols(
    tibble(
      x = unlist(str_extract_all(unique(df$eea_cell_code),
                                 pattern = "(?<=E)\\d+"
      )),
      y = unlist(str_extract_all(unique(df$eea_cell_code),
                                 pattern = "(?<=N)\\d+"
      ))
    ) %>%
      mutate_all(as.integer)
  )
```

Preview:

```{r x_y_preview}
df_xy %>% head()
```

Do the same for baseline data:

```{r remove corrupted eea_cell_code}
df_bl <- df_bl %>% 
  mutate(cellcode_length = nchar(eea_cell_code))

table(df_bl$cellcode_length, useNA = "ifany")

corrupt_bl_eea_cell_codes <- df_bl %>% 
  filter(cellcode_length != 13) %>% 
  write_csv("./data/interim/corrupt_bl_eea_cell_codes.csv")

df_bl <- df_bl %>% 
  filter(cellcode_length == 13)
```

```{r extract_x_y_baseline}
df_bl_xy <-
  df_bl %>%
  distinct(eea_cell_code) %>%
  bind_cols(
    tibble(
      x = unlist(str_extract_all(unique(df_bl$eea_cell_code),
                                 pattern = "(?<=E)\\d+"
      )),
      y = unlist(str_extract_all(unique(df_bl$eea_cell_code),
                                 pattern = "(?<=N)\\d+"
      ))
    ) %>%
      mutate_all(as.integer)
  )
```

Preview:

```{r x_y_preview_baseline}
df_bl_xy %>% head()
```

## Apply temporal and spatial restraints

Select alien species introduced after 1950 (`year_of_introduction`: 1950) or whose date of introduction is not known:

```{r select_aliens_after_year_of_introduction}
year_of_introduction <- 1950

recent_alien_species <-
  taxa_df %>%
  # lower limit on date of first introduction if present
  filter(first_observed >= year_of_introduction | is.na(first_observed)) %>%
  # remove duplicates due to other columns as pathways
  distinct(nubKey, first_observed, last_observed) %>%
  
  # get classkey info and filter out taxa without data
  inner_join(spec_names, by = c("nubKey" = "taxonKey")) %>%
  
  distinct(
    taxonKey = nubKey,
    .data$canonicalName,
    .data$first_observed,
    .data$last_observed,
    .data$class,
    .data$kingdom,
    .data$classKey,
    .data$kingdomKey
  )
```

Number of taxa recently introduced or whose introduction year is not specified:

```{r n_recent_taxa}
nrow(recent_alien_species)
```

List of alien species excluded as first introduction date is earlier than 1950:

```{r recent_alien_species}
old_introductions_taxa <-
  taxa_df %>%
  filter(first_observed < year_of_introduction) %>%
  anti_join(recent_alien_species %>%
              select(taxonKey),
            by = c("nubKey" = "taxonKey")
  ) %>%
  distinct(
    nubKey,
    first_observed,
    last_observed
  ) %>%
  inner_join(spec_names, by = c("nubKey" = "taxonKey")) %>%
  select(
    taxonKey = nubKey,
    canonicalName,
    first_observed,
    last_observed,
    class,
    kingdom,
    everything()
  )

old_introductions_taxa
```

We save them in file `taxa_introduced_in_BE_before_1950.tsv` in  `data/output`:

```{r save_old_introduced_taxa_as_tsv_file}
write_tsv(old_introductions_taxa,
          here::here(
            "data",
            "output",
            paste0(
              "taxa_introduced_in_BE_before_",
              year_of_introduction,
              ".tsv"
            )
          ),
          na = ""
)
```

We also limit the time series analysis on observations from 1950, start of state of invasion researches:

```{r define_year_range}
first_year <- 1950
```

Some observations can be published with erroneous coordinates (see [#trias-project/occ-processing#13](https://github.com/trias-project/occ-processing/issues/13)) or can be assigned to a grid cell out of Belgium and should therefore removed.

```{r define_x_y_range_values}
be_cellcodes <- df_prot_areas$CELLCODE
```

Apply spatial and temporal restraints defined above to both occurrence data:

```{r apply_restraints_time_space_occ_data}
df <-
  df %>%
  # recently introduced taxa
  filter(taxonKey %in% recent_alien_species$taxonKey) %>%
  left_join(df_xy, by = "eea_cell_code") %>%
  # recent history
  filter(year > first_year) %>%
  # in Belgium
  filter(eea_cell_code %in% be_cellcodes)
```

and baseline data:

```{r apply_restraints_time_space_baseline_data}
df_bl <-
  df_bl %>%
  left_join(df_bl_xy, by = "eea_cell_code") %>%
  # recent history
  filter(year > first_year) %>%
  # in Belgium
  filter(eea_cell_code %in% be_cellcodes)
```

## Add information about presence in protected areas

We add whether the grid cell intersects any of the Natura2000 Belgian protected areas.

```{r add_natura2000_col}
df <-
  df %>%
  left_join(df_prot_areas,
            by = c("eea_cell_code" = "CELLCODE")
  )
```

Preview:

```{r preview_df_with_natura2000_col}
df %>% head()
```

We do the same for baseline data:

```{r add_natura2000_col_baseline}
df_bl <-
  df_bl %>%
  left_join(df_prot_areas,
            by = c("eea_cell_code" = "CELLCODE")
  )
```

Preview:

```{r preview_df_bl_with_natura2000_col}
df_bl %>% head()
```

## Create time series

For each species, define cells with at least one observation:

```{r distinct_cells_taxa}
df_cc <- 
  df %>%
  group_by(taxonKey) %>%
  distinct(eea_cell_code) %>%
  ungroup()
```

For each species, identify the first year with at least one observation:

```{r begin_year_per_species}
df_begin_year <- 
  df %>%
  group_by(taxonKey) %>%
  summarize(begin_year = min(year))
```

For each species, combine `begin_year` and unique `eea_cell_code` as found above: 

```{r combine_begin_year_cells_taxa}
df_cc <- 
  df_cc %>%
  left_join(df_begin_year, by = "taxonKey") %>%
  select(taxonKey, begin_year, eea_cell_code)
```

Preview:

```{r preview_begin_year_cells_combo}
df_cc %>% head()
```

For each cell (`eea_cell_code`) and species (`taxonKey`) we can now create a time series:

```{r make_ts, cache=TRUE}
make_time_series <- function(eea_cell_code, taxonKey, begin_year, last_year ) {
  expand_grid(eea_cell_code = eea_cell_code,
              taxonKey = taxonKey,
              year = seq(from = begin_year, to = last_year))
  
}

# create timeseries slots
df_ts <- pmap_dfr(df_cc, 
                  .f = make_time_series, 
                  last_year = year(Sys.Date())
)

## Add data

# add occurrence data
df_ts <- 
  df_ts %>%
  left_join(df %>% select(taxonKey, year, eea_cell_code, obs), 
            by = c("taxonKey", "year", "eea_cell_code"))

# add membership to protected areas
df_ts <- 
  df_ts %>%
  left_join(df_prot_areas %>% select(CELLCODE, natura2000),
            by = c("eea_cell_code" = "CELLCODE"))
# add classKey
df_ts <- 
  df_ts %>%
  left_join(spec_names %>% 
              select(taxonKey, classKey), 
            by = "taxonKey")
```

## Research effort correction

To correct the effect of research effort of an alien speies, we calculate the number of observations at class level excluding the observations of the alien species itself:

```{r correct_cobs_research_effort_bias}
# add baseline data (at class level) diminished by obs of specific alien taxon
df_ts <- 
  df_ts %>%
  left_join(df_bl %>%
              select(year, eea_cell_code, classKey, cobs),
            by = c("year", "eea_cell_code", "classKey")) %>%
  mutate(cobs = cobs - obs)

# replace NAs with 0
df_ts <-
  df_ts %>%
  replace_na(list(cobs = 0, obs = 0))
```

Preview:

```{r preview_df_ts}
df_ts %>% head(n = 30)
```

Add column for presence (1) or absence (0)

```{r add_presence_absence}
df_ts <- 
  df_ts %>%
  mutate(pa_cobs = if_else(cobs > 0, 1, 0),
         pa_obs = if_else(obs > 0, 1, 0))
```

Arrange order columns:

```{r set_order_columns}
df_ts <-
  df_ts %>%
  select(taxonKey, 
         year, 
         eea_cell_code, 
         obs, 
         pa_obs, 
         cobs, 
         pa_cobs,
         classKey,
         natura2000)
```

# Save data

## Save timeseries and taxonomic metadata

Save `df_ts` as it will be used as start point of modelling pipelines:

```{r set creds}
Sys.setenv("AWS_DEFAULT_REGION" = "eu-west-1")

if(Sys.getenv("S3_BUCKET") != ""){
  UAT_bucket <- Sys.getenv("S3_BUCKET")
}else{
  UAT_bucket <- Sys.getenv("UAT_bucket")
}
```

```{r connect to bucket, eval=FALSE}
source("./src/connect_to_bucket.R")
connect_to_bucket(bucket_name = UAT_bucket)
```

```{r save_data_as_tsv}
# upload data to S3 bucket
aws.s3::s3save(df_ts, 
               object = "df_timeseries.Rdata", 
               bucket = UAT_bucket, 
  opts = list(show_progress = TRUE,
              multipart = TRUE))
```

Create time series data
input:  "df_timeseries.RData" and "grid.RData" from bucket
output: full_timeseries.RData

```{r}
alienSpecies::createTimeseries(
  # read grid.RData from bucket
  shapeData = alienSpecies::loadShapeData("grid.RData")$utm1_bel_with_regions, 
  bucket = UAT_bucket
)
```

We save also the taxonomic information:

```{r save_spec_names, eval = FALSE}
write_tsv(spec_names,
          path = here::here("data", 
                            "interim", 
                            "timeseries_taxonomic_info.tsv"),
          na = ""
)
```