# Data preprocessing

In [12]:
library(tidyverse) # for general data manipulation and visualization
library(readr)     # for reading csv files
library(jsonlite)  # for accessing web API

In [13]:
options(repr.plot.width=5, repr.plot.height=5)

### Data loading and inspection

For our purpose, I use a valuable mammalian brain size allometry [dataset](https://datadryad.org/stash/dataset/doi:10.5061/dryad.2r62k7s) avaialble on Dryad. 

Citation: 

Burger, Joseph Robert; George, Menshian Ashaki; Leadbetter, Claire; Shaikh, Farhin (2019), Data from: The allometry of brain size in mammals, Dryad, Dataset, https://doi.org/10.5061/dryad.2r62k7s

In [14]:
data<-read_csv("data/doi_10/BrainAllometry_Supplement_Data.csv")

Parsed with column specification:
cols(
  Binomial = col_character(),
  order = col_character(),
  family = col_character(),
  genus = col_character(),
  species = col_character(),
  Sample_size.brain = col_character(),
  Sample_size.body = col_character(),
  Sex = col_character(),
  Mean_brain_mass_g = col_double(),
  Mean_body_mass_g = col_double(),
  BrainReference1 = col_character(),
  BrainReference2 = col_character(),
  Brain.resid = col_double(),
  T_resid = col_double()
)


What's inside it? Whoa! It's pretty cool. More than 1500 species of mammals.

In [15]:
glimpse(data)

Rows: 1,552
Columns: 14
$ Binomial          <chr> "Chrysochloris_stuhlmanni", "Microgale_cowani", "Mi…
$ order             <chr> "Afrosoricida", "Afrosoricida", "Afrosoricida", "Af…
$ family            <chr> "Chrysochloridae", "Tenrecidae", "Tenrecidae", "Ten…
$ genus             <chr> "Chrysochloris", "Microgale", "Microgale", "Oryzori…
$ species           <chr> "stuhlmanni", "cowani", "dobsoni", "hova", "telfair…
$ Sample_size.brain <chr> ">2,<12", "1", "1", "1", "2", "1", "1", NA, "1", "1…
$ Sample_size.body  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Sex               <chr> "Both", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Mean_brain_mass_g <dbl> 1.06, 0.42, 0.56, 0.58, 0.62, 0.70, 0.79, 0.80, 0.8…
$ Mean_body_mass_g  <dbl> 50.05, 15.20, 32.60, 44.20, 87.50, 49.00, 50.40, 69…
$ BrainReference1   <chr> "Boddy2012", "Boddy2012", "Boddy2012", "Boddy2012",…
$ BrainReference2   <chr> "Stephanetal.1981; Mace.etal.1981", "Stephanetal.19…
$ Brain.resid       <dbl> 0.

Simplify the data frame, rename two columns to make it easier to use.

In [16]:
data <- data %>% 
select(
    order,
    family,
    genus,
    species,
    brain_mass = Mean_brain_mass_g,
    body_mass = Mean_body_mass_g)

### Look up the common names from GBIF
Unfortunately, the dataset doesn't include common names for the animals. We'll take advantage of the web service of the Global Biodiversity Information Facility (GBIF) to look them up.

The API of GBIF is available [here](https://www.gbif.org/developer/summary). Given the genus and species of an animal, I first get a usage ID from GBIF, and then look up the associated vernacular name.

In [17]:
gbif_usagekey <- function(genus, species) {
    api_call <- paste0("http://api.gbif.org/v1/species/match?verbose=true&name=", genus, "%20", species)
    r <- fromJSON(api_call)
    r$usageKey
}

gbif_usagekey_v <- Vectorize(gbif_usagekey) # vectorize it so that it can be applied to a dataframe

gbif_vernacular_name <- function(genus, species) {
    key <- gbif_usagekey(genus, species)
    api_call <- paste0("http://api.gbif.org/v1/species/", key, "/vernacularNames")
    r0 <- fromJSON(api_call)
    if (length(r0$results)==0) {
        return(NA)
    } else {
        r <- r0$results %>% filter(language=="eng") %>% select(vernacularName)
        res <- r[1,] # GBIF can return more than one names. For our purpose, we'll arbitrarily pick the first one.
        print(res)
        return(res)
    }
}

gbif_vernacular_name_v <- Vectorize(gbif_vernacular_name) # vectorize it so that it can be applied to a dataframe

Make the queries:

In [11]:
data2 <- data %>% mutate(common_name=gbif_vernacular_name_v(genus, species))

In [None]:
data2 %>% filter(is.na(common_name))

In [None]:
data3 <- data2 %>% filter(!is.na(common_name))

In [None]:
write.csv(data2, "data/brain_size_allometry_with_common_names.csv", quote=FALSE, row.names = FALSE)