# Processing an eDNA dataset to Darwin Core
## Reading the original dataset
### List all dataset files

In [1]:
list.files("../dataset", full.names = "TRUE")

### Read the ASV table

`../dataset/seqtab.txt` contains the ASV table, so it has one row per ASV, and the number of reads in a sample in different columns.

In [2]:
library(dplyr)

seqtab <- read.table("../dataset/seqtab.txt", sep = "\t", header = TRUE)
head(seqtab)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,asv,EE0493,EE0495
Unnamed: 0_level_1,<chr>,<int>,<int>
1,asv.1,0,0
2,asv.2,14,2447
3,asv.3,0,0
4,asv.4,0,0
5,asv.5,40587,1857
6,asv.6,7,10


### Read the taxonomy file

`../dataset/taxonomy.txt` contains a taxon name for each ASV.

In [32]:
taxonomy <- read.table("../dataset/taxonomy.txt", sep = "\t", header = TRUE)
head(taxonomy)

Unnamed: 0_level_0,asv,taxonomy
Unnamed: 0_level_1,<chr>,<chr>
1,asv.1,Eukaryota
2,asv.2,Clausocalanus_furcatus
3,asv.3,Eurotatoria
4,asv.4,Arthropoda
5,asv.5,Eukaryota
6,asv.6,Farranula_gibbula


These names originate from the reference database and will have to be matched to WoRMS later.

### Read the sample metadata

We also have an Excel file with sample info.

In [4]:
samples <- readxl::read_excel("../dataset/samples.xlsx")
samples

name,size,event_begin,area_name,area_longitude,area_latitude,area_uncertainty,parent_area_name,dna
<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
EE0493,1450,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,7.23
EE0495,1500,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,15.83


## Joining the tables

At this point we could start quality control on the individual tables, but if we first join and map the tables to Darwin Core occurrence terms, the quality control code will be easier to read.

### Event fields

Let's start with the sample table. This table has sample identifiers, time, coordinates, coordinate uncertainty, locality, and higher geography which can all be mapped to Darwin Core.

In [20]:
event <- samples %>%
    select(
        eventID = name,
        materialSampleID = name,
        eventDate = event_begin,
        locality = area_name,
        decimalLongitude = area_longitude,
        decimalLatitude = area_latitude,
        coordinateUncertaintyInMeters = area_uncertainty,
        higherGeography = parent_area_name,
        sampleSizeValue = size
    ) %>%
    mutate(sampleSizeUnit = "ml")
event

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,sampleSizeValue,sampleSizeUnit
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,1500,ml


### Occurrence fields

Next is the ASV table. This table is in a wide format with ASVs as rows and samples as columns. We will convert this to a long format, with one row per occurrence and the number of sequence reads as `organismQuantity`. We will use the sample identifier as `eventID` and the combination of sample identifier and ASV number as the `occurrenceID`.

In [37]:
library(tidyr)

occurrence <- seqtab %>%
    gather(eventID, organismQuantity, 2:3) %>%
    filter(organismQuantity > 0) %>%
    mutate(
        occurrenceID = paste0(eventID, "_", asv),
        organismQuantityType = "sequence reads"
    )
occurrence

asv,eventID,organismQuantity,occurrenceID
<chr>,<chr>,<int>,<chr>
asv.2,EE0493,14,EE0493_asv.2
asv.5,EE0493,40587,EE0493_asv.5
asv.6,EE0493,7,EE0493_asv.6
asv.7,EE0493,29367,EE0493_asv.7
asv.8,EE0493,72378,EE0493_asv.8
asv.9,EE0493,35970,EE0493_asv.9
asv.10,EE0493,10,EE0493_asv.10
asv.11,EE0493,4375,EE0493_asv.11
asv.12,EE0493,6853,EE0493_asv.12
asv.14,EE0493,2299,EE0493_asv.14


We can now add the taxonomic names to ourt occurrence table.

In [36]:
taxonomy <- taxonomy %>%
    select(asv, verbatimIdentification = taxonomy)

In [38]:
occurrence <- occurrence %>%
    left_join(taxonomy, by = "asv")
occurrence

asv,eventID,organismQuantity,occurrenceID,verbatimIdentification
<chr>,<chr>,<int>,<chr>,<chr>
asv.2,EE0493,14,EE0493_asv.2,Clausocalanus_furcatus
asv.5,EE0493,40587,EE0493_asv.5,Eukaryota
asv.6,EE0493,7,EE0493_asv.6,Farranula_gibbula
asv.7,EE0493,29367,EE0493_asv.7,Eukaryota
asv.8,EE0493,72378,EE0493_asv.8,Metazoa
asv.9,EE0493,35970,EE0493_asv.9,Eukaryota
asv.10,EE0493,10,EE0493_asv.10,Metazoa
asv.11,EE0493,4375,EE0493_asv.11,Eukaryota
asv.12,EE0493,6853,EE0493_asv.12,Eukaryota
asv.14,EE0493,2299,EE0493_asv.14,Eukaryota


### Joining event and occurrence fields

In [40]:
occurrence <- event %>%
    left_join(occurrence, by = "eventID")
occurrence

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,sampleSizeValue,sampleSizeUnit,asv,organismQuantity,occurrenceID,verbatimIdentification
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.2,14,EE0493_asv.2,Clausocalanus_furcatus
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.5,40587,EE0493_asv.5,Eukaryota
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.6,7,EE0493_asv.6,Farranula_gibbula
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.7,29367,EE0493_asv.7,Eukaryota
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.8,72378,EE0493_asv.8,Metazoa
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.9,35970,EE0493_asv.9,Eukaryota
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.10,10,EE0493_asv.10,Metazoa
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.11,4375,EE0493_asv.11,Eukaryota
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.12,6853,EE0493_asv.12,Eukaryota
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.14,2299,EE0493_asv.14,Eukaryota


## Quality control

### Taxon matching

Let's first match the taxa with WoRMS. This can be done using the `obistools` package. Before matching with WoRMS we will remove underscores from the scientific names.

In [45]:
taxon_names <- stringr::str_replace(occurrence$verbatimIdentification, "_", " ")

In [50]:
matched <- obistools::match_taxa(taxon_names, ask = FALSE) %>%
    select(scientificName, scientificNameID)
matched

433 names, 0 without matches, 10 with multiple matches



Unnamed: 0_level_0,scientificName,scientificNameID
Unnamed: 0_level_1,<chr>,<chr>
106,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503
163,,
168,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477
163.1,,
265,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
163.2,,
265.1,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
163.3,,
163.4,,
163.5,,


In [51]:
occurrence <- bind_cols(occurrence, matched)
occurrence

Unnamed: 0_level_0,eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,sampleSizeValue,sampleSizeUnit,asv,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>
106,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.2,14,EE0493_asv.2,Clausocalanus_furcatus,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503
163,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.5,40587,EE0493_asv.5,Eukaryota,,
168,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.6,7,EE0493_asv.6,Farranula_gibbula,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477
163.1,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.7,29367,EE0493_asv.7,Eukaryota,,
265,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.8,72378,EE0493_asv.8,Metazoa,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
163.2,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.9,35970,EE0493_asv.9,Eukaryota,,
265.1,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.10,10,EE0493_asv.10,Metazoa,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
163.3,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.11,4375,EE0493_asv.11,Eukaryota,,
163.4,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.12,6853,EE0493_asv.12,Eukaryota,,
163.5,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.14,2299,EE0493_asv.14,Eukaryota,,


In [53]:
occurrence %>%
    filter(is.na(scientificNameID)) %>%
    group_by(verbatimIdentification) %>%
    summarize(n = n()) %>%
    arrange(desc(n))

verbatimIdentification,n
<chr>,<int>
Eukaryota,8664
undef_Eukaryota,447
,283
undef_Oomycota,30
Navicula,9
undef_Bacteria_bacteria,7
Clathria_genus,5
Nitzschia_sp._BOLD:AAO7110,4
Aerococcus_urinae,3
Holothuria,3


Normally we have to resolve these names one by one, but for this exercise we will just fix the most common errors. For example, records annotated as eukaryotes can be populated with scientificName `Incertae sedis` and scientificNameID `urn:lsid:marinespecies.org:taxname:12`.

In [56]:
occurrence <- occurrence %>%
    mutate(
        scientificName = case_when(verbatimIdentification %in% c("Eukaryota", "undef_Eukaryota", "") ~ "Incertae sedis", .default = scientificName),
        scientificNameID = case_when(verbatimIdentification %in% c("Eukaryota", "undef_Eukaryota", "") ~ "urn:lsid:marinespecies.org:taxname:12", .default = scientificNameID)
    )

In [57]:
occurrence %>%
    filter(is.na(scientificNameID)) %>%
    group_by(verbatimIdentification) %>%
    summarize(n = n()) %>%
    arrange(desc(n))

verbatimIdentification,n
<chr>,<int>
undef_Oomycota,30
Navicula,9
undef_Bacteria_bacteria,7
Clathria_genus,5
Nitzschia_sp._BOLD:AAO7110,4
Aerococcus_urinae,3
Holothuria,3
Pinnularia_acrosphaeria,3
Austrosciara_hyalipennis,2
Cladonema_digitatum,2


In [58]:
occurrence

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,sampleSizeValue,sampleSizeUnit,asv,organismQuantity,occurrenceID,verbatimIdentification,scientificName,scientificNameID
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.2,14,EE0493_asv.2,Clausocalanus_furcatus,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.5,40587,EE0493_asv.5,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.6,7,EE0493_asv.6,Farranula_gibbula,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.7,29367,EE0493_asv.7,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.8,72378,EE0493_asv.8,Metazoa,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.9,35970,EE0493_asv.9,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.10,10,EE0493_asv.10,Metazoa,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.11,4375,EE0493_asv.11,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.12,6853,EE0493_asv.12,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml,asv.14,2299,EE0493_asv.14,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12


### Location