# Processing an eDNA dataset to Darwin Core
## Reading the original dataset
### List all dataset files

In [1]:
list.files("../dataset", full.names = "TRUE")

### Read the ASV table

`../dataset/seqtab.txt` contains the ASV table, so it has one row per ASV, and the number of reads in a sample in different columns.

In [2]:
library(dplyr)

seqtab <- read.table("../dataset/seqtab.txt", sep = "\t", header = TRUE)
head(seqtab)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,asv,EE0493,EE0495
Unnamed: 0_level_1,<chr>,<int>,<int>
1,asv.1,0,0
2,asv.2,14,2447
3,asv.3,0,0
4,asv.4,0,0
5,asv.5,40587,1857
6,asv.6,7,10


### Read the taxonomy file

`../dataset/taxonomy.txt` contains a taxon name for each ASV.

In [3]:
taxonomy <- read.table("../dataset/taxonomy.txt", sep = "\t", header = TRUE)
head(taxonomy)

Unnamed: 0_level_0,asv,taxonomy
Unnamed: 0_level_1,<chr>,<chr>
1,asv.1,Eukaryota
2,asv.2,Clausocalanus_furcatus
3,asv.3,Eurotatoria
4,asv.4,Arthropoda
5,asv.5,Eukaryota
6,asv.6,Farranula_gibbula


These names originate from the reference database and will have to be matched to WoRMS later.

### Read the sample metadata

We also have an Excel file with sample info.

In [4]:
samples <- readxl::read_excel("../dataset/samples.xlsx")
samples

name,size,event_begin,area_name,area_longitude,area_latitude,area_uncertainty,parent_area_name,dna
<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
EE0493,1450,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,7.23
EE0495,1500,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,15.83


## Joining the tables

At this point we could start quality control on the individual tables, but if we first join and map the tables to Darwin Core occurrence terms, the quality control code will be easier to read.

### Event fields

Let's start with the sample table. This table has sample identifiers, time, coordinates, coordinate uncertainty, locality, and higher geography which can all be mapped to Darwin Core.

In [20]:
event <- samples %>%
    select(
        eventID = name,
        materialSampleID = name,
        eventDate = event_begin,
        locality = area_name,
        decimalLongitude = area_longitude,
        decimalLatitude = area_latitude,
        coordinateUncertaintyInMeters = area_uncertainty,
        higherGeography = parent_area_name,
        sampleSizeValue = size
    ) %>%
    mutate(sampleSizeUnit = "ml")
event

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,sampleSizeValue,sampleSizeUnit
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,1450,ml
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,1500,ml


### Occurrence fields

Next is the ASV table. This table is in a wide format with ASVs as rows and samples as columns. We will convert this to a long format, with one row per occurrence and the number of sequence reads as `organismQuantity`. We will use the sample identifier as `eventID` and the combination of sample identifier and ASV number as the `occurrenceID`.

In [28]:
library(tidyr)

seqtab %>%
    gather(eventID, organismQuantity, 2:3) %>%
    filter(organismQuantity > 0) %>%
    mutate(occurrenceID = paste0(eventID, "_", asv))

asv,eventID,organismQuantity,occurrenceID
<chr>,<chr>,<int>,<chr>
asv.2,EE0493,14,EE0493_asv.2
asv.5,EE0493,40587,EE0493_asv.5
asv.6,EE0493,7,EE0493_asv.6
asv.7,EE0493,29367,EE0493_asv.7
asv.8,EE0493,72378,EE0493_asv.8
asv.9,EE0493,35970,EE0493_asv.9
asv.10,EE0493,10,EE0493_asv.10
asv.11,EE0493,4375,EE0493_asv.11
asv.12,EE0493,6853,EE0493_asv.12
asv.14,EE0493,2299,EE0493_asv.14


## Quality control

### Taxon matching

Let's first match the taxa with WoRMS. This can be done using the `obistools` package. Before matching with WoRMS we will remove underscores from the scientific names.

In [5]:
taxonomy <- taxonomy %>%
    mutate(taxonomy = stringr::str_replace(taxonomy, "_", " "))

In [6]:
matched <- obistools::match_taxa(taxonomy$taxonomy, ask = FALSE)

668 names, 0 without matches, 11 with multiple matches



In [7]:
taxonomy <- bind_cols(taxonomy, matched)

In [15]:
taxonomy %>%
    filter(is.na(scientificNameID)) %>%
    group_by(taxonomy) %>%
    summarize(n = n()) %>%
    arrange(desc(n))

taxonomy,n
<chr>,<int>
undef Oomycota,31
Cypretta maya,21
undef Bacteria_bacteria,19
Navicula,8
Clathria genus,6
Aerococcus urinae,5
Paravannella minima,4
Calonectria colhounii,3
Culex impudicus,3
Nitzschia sp._BOLD:AAO7110,3


Normally we have to resolve these names one by one, but for this exercise we will just fix the most common errors. For example, records annotated as eukaryotes can be populated with scientificName `Incertae sedis` and scientificNameID `urn:lsid:marinespecies.org:taxname:12`.

In [14]:
taxonomy <- taxonomy %>%
    mutate(
        scientificName = case_when(taxonomy %in% c("Eukaryota", "undef Eukaryota", "") ~ "Incertae sedis", .default = scientificName),
        scientificNameID = case_when(taxonomy %in% c("Eukaryota", "undef Eukaryota", "") ~ "urn:lsid:marinespecies.org:taxname:12", .default = scientificNameID)
    )

In [16]:
taxonomy %>%
    filter(is.na(scientificNameID)) %>%
    group_by(taxonomy) %>%
    summarize(n = n()) %>%
    arrange(desc(n))

taxonomy,n
<chr>,<int>
undef Oomycota,31
Cypretta maya,21
undef Bacteria_bacteria,19
Navicula,8
Clathria genus,6
Aerococcus urinae,5
Paravannella minima,4
Calonectria colhounii,3
Culex impudicus,3
Nitzschia sp._BOLD:AAO7110,3


In [17]:
head(taxonomy)

Unnamed: 0_level_0,asv,taxonomy,scientificName,scientificNameID,match_type,acceptedNameUsageID
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
248.0,asv.1,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12,,
162.0,asv.2,Clausocalanus furcatus,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503,exact,104503.0
252.0,asv.3,Eurotatoria,Eurotatoria,urn:lsid:marinespecies.org:taxname:368537,exact,368537.0
49.0,asv.4,Arthropoda,Arthropoda,urn:lsid:marinespecies.org:taxname:1065,exact,1065.0
248.1,asv.5,Eukaryota,Incertae sedis,urn:lsid:marinespecies.org:taxname:12,,
255.0,asv.6,Farranula gibbula,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477,exact,346477.0
