# Processing an eDNA dataset to Darwin Core

<div class="alert alert-warning">
If you are new yo R, make sure to read these info boxes.
</div>

First load some dependencies and create the output directory:

In [2]:
library(dplyr)
library(readxl)

# make sure to change the output directory to your own
output_dir <- "../dwc/ngisiange"
dir.create(output_dir)

# limit number of rows in notebook output
options(repr.matrix.max.rows = 10, repr.matrix.max.cols = 20)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


“'../dwc/ngisiange' already exists”


<div class="alert alert-warning">
In R, packages can be loaded using <code>library()</code>. If a package is not installed, you can install it from CRAN using <code>install.packages()</code>. The <code>dplyr</code> package is a commonly used package for data manipulation, <code>readxl</code> is required for reading the Excel file.
</div>

## Reading the original dataset
### List all dataset files

In [3]:
list.files("../dataset", full.names = "TRUE")

<div class="alert alert-warning">
In a relative file path, <code>..</code> indicates the parent directory.
</div>

### Read the ASV table

`../dataset/seqtab.txt` contains the ASV table, so it has one row per ASV, and the number of reads in a sample in different columns.

In [5]:
seqtab <- read.table("../dataset/seqtab.txt", sep = "\t", header = TRUE)
seqtab

asv,EE0493,EE0495
<chr>,<int>,<int>
asv.1,0,0
asv.2,14,2447
asv.3,0,0
asv.4,0,0
asv.5,40587,1857
⋮,⋮,⋮
asv.16981,0,0
asv.16982,0,0
asv.16983,0,0
asv.16984,0,0


<div class="alert alert-warning">
<code>read.table()</code> reads a delimited text file to a data frame. <code>sep = "\t"</code> means that the file is tab delimited.
</div>

### Read the taxonomy file

`../dataset/taxonomy.txt` contains a taxon name for each ASV.

In [6]:
taxonomy <- read.table("../dataset/taxonomy.txt", sep = "\t", header = TRUE)
taxonomy

asv,taxonomy
<chr>,<chr>
asv.1,Eukaryota
asv.2,Clausocalanus_furcatus
asv.3,Eurotatoria
asv.4,Arthropoda
asv.5,Eukaryota
⋮,⋮
asv.16981,Metazoa
asv.16982,Metazoa
asv.16983,Metazoa
asv.16984,Eukaryota


These names originate from the reference database and will have to be matched to WoRMS later.

### Read the sample metadata

We also have an Excel file with sample info.

In [7]:
samples <- read_excel("../dataset/samples.xlsx")
samples

name,size,event_begin,area_name,area_longitude,area_latitude,area_uncertainty,parent_area_name,dna,depth,temperature
<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>
EE0493,1450,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,7.23,10,26.3
EE0495,1500,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,15.83,12,25.1


## Joining the tables

At this point we could start quality control on the individual tables, but if we first join and map the tables to Darwin Core occurrence terms, the quality control code will be easier to read.

### Event fields

Let's start with the sample table. This table has sample identifiers, time, coordinates, coordinate uncertainty, locality, and higher geography which can all be mapped to Darwin Core. Keep the `dna` field for later.

In [8]:
event <- samples %>%
    select(
        eventID = name,
        materialSampleID = name,
        eventDate = event_begin,
        locality = area_name,
        decimalLongitude = area_longitude,
        decimalLatitude = area_latitude,
        coordinateUncertaintyInMeters = area_uncertainty,
        higherGeography = parent_area_name,
        minimumDepthInMeters = depth,
        maximumDepthInMeters = depth,
        sampleSizeValue = size,
        dna,
        temperature
    ) %>%
    mutate(sampleSizeUnit = "ml")
event

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,minimumDepthInMeters,maximumDepthInMeters,sampleSizeValue,dna,temperature,sampleSizeUnit
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,1450,7.23,26.3,ml
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,1500,15.83,25.1,ml


<div class="alert alert-warning">
The code above uses several <code>dlyr</code> functions. <code>select()</code> selects and optionally renames a set of columns from the dataframe. <code>mutate()</code> creates a new column. <code>%>%</code> is the pipe operator which is used to string functions together.
</div>

### Occurrence fields

Next is the ASV table. This table is in a wide format with ASVs as rows and samples as columns. We will convert this to a long format, with one row per occurrence and the number of sequence reads as `organismQuantity`. We will use the sample identifier as `eventID` and the combination of sample identifier and ASV number as the `occurrenceID`.

<div class="alert alert-warning">
To do from a wide to a long table, use the <code>gather()</code> function from the <code>tidyr</code> package. <code>paste0()</code> is used to combine character strings.

In [9]:
library(tidyr)

occurrence <- seqtab %>%
    gather(eventID, organismQuantity, 2:3) %>%
    filter(organismQuantity > 0) %>%
    mutate(
        occurrenceID = paste0(eventID, "_", asv),
        organismQuantityType = "sequence reads"
    )
occurrence

asv,eventID,organismQuantity,occurrenceID,organismQuantityType
<chr>,<chr>,<int>,<chr>,<chr>
asv.2,EE0493,14,EE0493_asv.2,sequence reads
asv.5,EE0493,40587,EE0493_asv.5,sequence reads
asv.6,EE0493,7,EE0493_asv.6,sequence reads
asv.7,EE0493,29367,EE0493_asv.7,sequence reads
asv.8,EE0493,72378,EE0493_asv.8,sequence reads
⋮,⋮,⋮,⋮,⋮
asv.16949,EE0495,1,EE0495_asv.16949,sequence reads
asv.16958,EE0495,1,EE0495_asv.16958,sequence reads
asv.16961,EE0495,1,EE0495_asv.16961,sequence reads
asv.16962,EE0495,1,EE0495_asv.16962,sequence reads


We can now add the taxonomic names to our occurrence table.

In [12]:
taxonomy <- taxonomy %>%
    select(asv, verbatimIdentification = taxonomy)

“[1m[22mUsing an external vector in selections was deprecated in tidyselect 1.1.0.
[36mℹ[39m Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(taxonomy)

  # Now:
  data %>% select(all_of(taxonomy))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.”


ERROR: [1m[33mError[39m in `select()`:[22m
[33m![39m Can't select columns with `taxonomy`.
[31m✖[39m `taxonomy` must be numeric or character, not a <data.frame> object.


In [11]:
occurrence <- occurrence %>%
    left_join(taxonomy, by = "asv")
occurrence

asv,eventID,organismQuantity,occurrenceID,organismQuantityType,verbatimIdentification
<chr>,<chr>,<int>,<chr>,<chr>,<chr>
asv.2,EE0493,14,EE0493_asv.2,sequence reads,Clausocalanus_furcatus
asv.5,EE0493,40587,EE0493_asv.5,sequence reads,Eukaryota
asv.6,EE0493,7,EE0493_asv.6,sequence reads,Farranula_gibbula
asv.7,EE0493,29367,EE0493_asv.7,sequence reads,Eukaryota
asv.8,EE0493,72378,EE0493_asv.8,sequence reads,Metazoa
⋮,⋮,⋮,⋮,⋮,⋮
asv.16949,EE0495,1,EE0495_asv.16949,sequence reads,Metazoa
asv.16958,EE0495,1,EE0495_asv.16958,sequence reads,Eukaryota
asv.16961,EE0495,1,EE0495_asv.16961,sequence reads,Eukaryota
asv.16962,EE0495,1,EE0495_asv.16962,sequence reads,Mantoniella_squamata


<div class="alert alert-warning">
<code>left_join()</code> joins two dataframes by matching columns. The <code>by</code> argument specifies the columns to match on.
</div>

### Joining event and occurrence fields

In [13]:
occurrence <- event %>%
    left_join(occurrence, by = "eventID")
occurrence

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,minimumDepthInMeters,maximumDepthInMeters,sampleSizeValue,dna,temperature,sampleSizeUnit,asv,organismQuantity,occurrenceID,organismQuantityType,verbatimIdentification
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,1450,7.23,26.3,ml,asv.2,14,EE0493_asv.2,sequence reads,Clausocalanus_furcatus
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,1450,7.23,26.3,ml,asv.5,40587,EE0493_asv.5,sequence reads,Eukaryota
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,1450,7.23,26.3,ml,asv.6,7,EE0493_asv.6,sequence reads,Farranula_gibbula
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,1450,7.23,26.3,ml,asv.7,29367,EE0493_asv.7,sequence reads,Eukaryota
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,1450,7.23,26.3,ml,asv.8,72378,EE0493_asv.8,sequence reads,Metazoa
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,1500,15.83,25.1,ml,asv.16949,1,EE0495_asv.16949,sequence reads,Metazoa
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,1500,15.83,25.1,ml,asv.16958,1,EE0495_asv.16958,sequence reads,Eukaryota
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,1500,15.83,25.1,ml,asv.16961,1,EE0495_asv.16961,sequence reads,Eukaryota
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,1500,15.83,25.1,ml,asv.16962,1,EE0495_asv.16962,sequence reads,Mantoniella_squamata


## Adding metadata

Populate `samplingProtocol` with a link the the eDNA Expeditions protocol.

In [14]:
occurrence$samplingProtocol <- "https://github.com/BeBOP-OBON/UNESCO_protocol_collection"

## Quality control

### Taxon matching

Let's first match the taxa with WoRMS. This can be done using the `obistools` package. Before matching with WoRMS we will remove underscores from the scientific names.

In [15]:
taxon_names <- stringr::str_replace(occurrence$verbatimIdentification, "_", " ")

Now match the names, this can take a few minutes.

In [16]:
matched <- obistools::match_taxa(taxon_names, ask = FALSE) %>%
    select(scientificName, scientificNameID)

matched

433 names, 0 without matches, 10 with multiple matches



Unnamed: 0_level_0,scientificName,scientificNameID
Unnamed: 0_level_1,<chr>,<chr>
106,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503
163,,
168,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477
163.1,,
265,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
⋮,⋮,⋮
265.1833,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
163.8662,,
163.8663,,
262.3,Mantoniella squamata,urn:lsid:marinespecies.org:taxname:134563


In [17]:
occurrence <- bind_cols(occurrence, matched)
occurrence

Unnamed: 0_level_0,eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,minimumDepthInMeters,maximumDepthInMeters,⋯,temperature,sampleSizeUnit,asv,organismQuantity,occurrenceID,organismQuantityType,verbatimIdentification,samplingProtocol,scientificName,scientificNameID
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
106,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.2,14,EE0493_asv.2,sequence reads,Clausocalanus_furcatus,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503
163,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.5,40587,EE0493_asv.5,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,,
168,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.6,7,EE0493_asv.6,sequence reads,Farranula_gibbula,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477
163.1,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.7,29367,EE0493_asv.7,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,,
265,EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.8,72378,EE0493_asv.8,sequence reads,Metazoa,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
265.1833,EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16949,1,EE0495_asv.16949,sequence reads,Metazoa,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
163.8662,EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16958,1,EE0495_asv.16958,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,,
163.8663,EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16961,1,EE0495_asv.16961,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,,
262.3,EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16962,1,EE0495_asv.16962,sequence reads,Mantoniella_squamata,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Mantoniella squamata,urn:lsid:marinespecies.org:taxname:134563


In [15]:
non_matches <- occurrence %>%
    filter(is.na(scientificNameID)) %>%
    group_by(verbatimIdentification) %>%
    summarize(n = n()) %>%
    arrange(desc(n))

write.table(non_matches, file = file.path(output_dir, "nonmatches.txt"), sep = "\t", row.names = FALSE, na = "", quote = FALSE)

non_matches

verbatimIdentification,n
<chr>,<int>
Eukaryota,8664
undef_Eukaryota,447
,283
undef_Oomycota,30
Navicula,9
⋮,⋮
Lobophora_brown_algae,1
Nitzschia_diatoms,1
Synchaetomella_acerina,1
undef_Gastropoda,1


Normally we have to resolve these names one by one, but for this exercise we will just fix the most common errors. For example, records annotated as eukaryotes can be populated with scientificName `Incertae sedis` and scientificNameID `urn:lsid:marinespecies.org:taxname:12`.

In [18]:
occurrence <- occurrence %>%
    mutate(
        scientificName = case_when(verbatimIdentification %in% c("Eukaryota", "undef_Eukaryota", "") ~ "Incertae sedis", .default = scientificName),
        scientificNameID = case_when(verbatimIdentification %in% c("Eukaryota", "undef_Eukaryota", "") ~ "urn:lsid:marinespecies.org:taxname:12", .default = scientificNameID)
    )

In [20]:
occurrence %>%
    filter(is.na(scientificNameID)) %>%
    group_by(verbatimIdentification) %>%
    summarize(n = n()) %>%
    arrange(desc(n))

verbatimIdentification,n
<chr>,<int>
undef_Oomycota,30
Navicula,9
undef_Bacteria_bacteria,7
Clathria_genus,5
Nitzschia_sp._BOLD:AAO7110,4
⋮,⋮
Lobophora_brown_algae,1
Nitzschia_diatoms,1
Synchaetomella_acerina,1
undef_Gastropoda,1


In [21]:
occurrence

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,minimumDepthInMeters,maximumDepthInMeters,⋯,temperature,sampleSizeUnit,asv,organismQuantity,occurrenceID,organismQuantityType,verbatimIdentification,samplingProtocol,scientificName,scientificNameID
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.2,14,EE0493_asv.2,sequence reads,Clausocalanus_furcatus,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.5,40587,EE0493_asv.5,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.6,7,EE0493_asv.6,sequence reads,Farranula_gibbula,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.7,29367,EE0493_asv.7,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,24/04/2023,Ile esprit,46.22536,9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.8,72378,EE0493_asv.8,sequence reads,Metazoa,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16949,1,EE0495_asv.16949,sequence reads,Metazoa,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16958,1,EE0495_asv.16958,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16961,1,EE0495_asv.16961,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0495,EE0495,02/04/2023,Settlement beach,46.20605,9.400901,98,Aldabra Atoll,12,12,⋯,25.1,ml,asv.16962,1,EE0495_asv.16962,sequence reads,Mantoniella_squamata,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Mantoniella squamata,urn:lsid:marinespecies.org:taxname:134563


### Location

Now let's check the coordinates by plotting the distinct coordinate pairs on a map.

In [22]:
library(leaflet)

stations <- occurrence %>%
    distinct(locality, decimalLongitude, decimalLatitude)

stations

leaflet() %>%
    addTiles() %>%
    addMarkers(lng = stations$decimalLongitude, lat = stations$decimalLatitude, popup = stations$locality)

locality,decimalLongitude,decimalLatitude
<chr>,<dbl>,<dbl>
Ile esprit,46.22536,9.42518
Settlement beach,46.20605,9.400901


There's clearly something wrong with the coordinates. Longitude looks fine, let's try flipping latitude.

In [23]:
occurrence <- occurrence %>%
    mutate(decimalLatitude = -decimalLatitude)

stations <- occurrence %>%
    distinct(locality, decimalLongitude, decimalLatitude)
stations

leaflet() %>%
    addTiles() %>%
    addMarkers(lng = stations$decimalLongitude, lat = stations$decimalLatitude, popup = stations$locality)

locality,decimalLongitude,decimalLatitude
<chr>,<dbl>,<dbl>
Ile esprit,46.22536,-9.42518
Settlement beach,46.20605,-9.400901


Now fix the occurrence table.

### Time

Now check the event dates.

In [24]:
obistools::check_eventdate(occurrence)

level,row,field,message
<chr>,<int>,<chr>,<chr>
error,1,eventDate,eventDate 24/04/2023 does not seem to be a valid date
error,2,eventDate,eventDate 24/04/2023 does not seem to be a valid date
error,3,eventDate,eventDate 24/04/2023 does not seem to be a valid date
error,4,eventDate,eventDate 24/04/2023 does not seem to be a valid date
error,5,eventDate,eventDate 24/04/2023 does not seem to be a valid date
⋮,⋮,⋮,⋮
error,13371,eventDate,eventDate 02/04/2023 does not seem to be a valid date
error,13372,eventDate,eventDate 02/04/2023 does not seem to be a valid date
error,13373,eventDate,eventDate 02/04/2023 does not seem to be a valid date
error,13374,eventDate,eventDate 02/04/2023 does not seem to be a valid date


It looks like `eventDate` is in the wrong format. Use the `lubridate` package to parse the current date format and change it.

In [25]:
library(lubridate)

occurrence <- occurrence %>%
    mutate(eventDate = format_ISO8601(parse_date_time(eventDate, "%d/%m/%Y"), precision = "ymd", usetz = FALSE))

unique(occurrence$eventDate)


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




In [23]:
head(occurrence)

eventID,materialSampleID,eventDate,locality,decimalLongitude,decimalLatitude,coordinateUncertaintyInMeters,higherGeography,minimumDepthInMeters,maximumDepthInMeters,⋯,temperature,sampleSizeUnit,asv,organismQuantity,occurrenceID,organismQuantityType,verbatimIdentification,samplingProtocol,scientificName,scientificNameID
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
EE0493,EE0493,2023-04-24,Ile esprit,46.22536,-9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.2,14,EE0493_asv.2,sequence reads,Clausocalanus_furcatus,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Clausocalanus furcatus,urn:lsid:marinespecies.org:taxname:104503
EE0493,EE0493,2023-04-24,Ile esprit,46.22536,-9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.5,40587,EE0493_asv.5,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,2023-04-24,Ile esprit,46.22536,-9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.6,7,EE0493_asv.6,sequence reads,Farranula_gibbula,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Farranula gibbula,urn:lsid:marinespecies.org:taxname:346477
EE0493,EE0493,2023-04-24,Ile esprit,46.22536,-9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.7,29367,EE0493_asv.7,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Incertae sedis,urn:lsid:marinespecies.org:taxname:12
EE0493,EE0493,2023-04-24,Ile esprit,46.22536,-9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.8,72378,EE0493_asv.8,sequence reads,Metazoa,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Metazoa,urn:lsid:marinespecies.org:taxname:1486573
EE0493,EE0493,2023-04-24,Ile esprit,46.22536,-9.42518,20,Aldabra Atoll,10,10,⋯,26.3,ml,asv.9,35970,EE0493_asv.9,sequence reads,Eukaryota,https://github.com/BeBOP-OBON/UNESCO_protocol_collection,Incertae sedis,urn:lsid:marinespecies.org:taxname:12


### Missing fields

Let's check if any required fields are missing.

In [28]:
obistools::check_fields(occurrence)

In [27]:
occurrence <- occurrence %>%
    mutate(
        occurrenceStatus = "present",
        basisOfRecord = "MaterialSample"
    )

## MeasurementOrFact

We have several measurements that can be added to the MeasurementOrFact extension: sequence reads, sample volume, and DNA extract concentration.

In [30]:
mof_reads <- occurrence %>%
    select(occurrenceID, measurementValue = organismQuantity) %>%
    mutate(
        measurementType = "sequence reads"
    )

mof_samplesize <- occurrence %>%
    select(occurrenceID, measurementValue = sampleSizeValue, measurementUnit = sampleSizeUnit) %>%
    mutate(
        measurementType = "sample size",
        measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/",
        measurementUnit = "ml",
        measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/VVML/"
    )

mof_dna <- occurrence %>%
    select(occurrenceID, measurementValue = dna) %>%
    mutate(
        measurementType = "DNA concentration",
        measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/A260DNAX/",
        measurementUnit = "ng/μl",
        measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UNUL/"
    )

mof_temperature <- occurrence %>%
    select(occurrenceID, measurementValue = temperature) %>%
    mutate(
        measurementType = "seawater temperature",
        measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01/",
        measurementUnit = "degrees Celsius",
        measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UPAA/"
    )

mof <- bind_rows(mof_reads, mof_samplesize, mof_dna)
mof

occurrenceID,measurementValue,measurementType,measurementUnit,measurementTypeID,measurementUnitID
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>
EE0493_asv.2,14,sequence reads,,,
EE0493_asv.5,40587,sequence reads,,,
EE0493_asv.6,7,sequence reads,,,
EE0493_asv.7,29367,sequence reads,,,
EE0493_asv.8,72378,sequence reads,,,
⋮,⋮,⋮,⋮,⋮,⋮
EE0495_asv.16949,15.83,DNA concentration,ng/μl,http://vocab.nerc.ac.uk/collection/P01/current/A260DNAX/,http://vocab.nerc.ac.uk/collection/P06/current/UNUL/
EE0495_asv.16958,15.83,DNA concentration,ng/μl,http://vocab.nerc.ac.uk/collection/P01/current/A260DNAX/,http://vocab.nerc.ac.uk/collection/P06/current/UNUL/
EE0495_asv.16961,15.83,DNA concentration,ng/μl,http://vocab.nerc.ac.uk/collection/P01/current/A260DNAX/,http://vocab.nerc.ac.uk/collection/P06/current/UNUL/
EE0495_asv.16962,15.83,DNA concentration,ng/μl,http://vocab.nerc.ac.uk/collection/P01/current/A260DNAX/,http://vocab.nerc.ac.uk/collection/P06/current/UNUL/


## DNADerivedData
### Reading sequence data

In [31]:
library(Biostrings)

fasta_file <- readDNAStringSet("../dataset/sequences.fasta")
fasta <- data.frame(asv = names(fasta_file), DNA_sequence = paste(fasta_file))
fasta

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:lubridate’:

    intersect, setdiff, union


The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors

Loading required package: stats4


Attaching package: ‘S4Vectors’


The following objects are masked from ‘package:lubridate’:

    second, second<-


The follow

asv,DNA_sequence
<chr>,<chr>
asv.1,TTTATCAGGTATTGAAAGTCATTCAGGAGGATCTGTTGATTTAGCTATTTTTAGTTTACATTTAGCAGGTGCATCATCTTTATTAGGTGCTATGAATTTTATTACTACTGTAATAAATATGAGAGCTCCTGGTATTAAAATGCATAGAATTCCATTGTTCGTTTGAGCGTTATTTATAACTGCAATATTATTATTATTATCTTTACCAGTTTTAGCAGGAGCAATAACAATGTTATTAACTGATAGAAATTTTAATACTACTTTCTTTGATCCAGCAGGTGGAGGAGATCCAATTTTATATCAACATTTATTT
asv.2,TCTATCAAGAAACGTGGCCCATGCAGGAGGGTCTGTAGACTTTGCAATTTTTTCTCTACACTTAGCAGGTGTTAGATCAATTTTAGGGGCTGTAAACTTTATTAGTACGTTAGGTAATCTACGGGTATTCGGAATACTTTTAGACCGTATCCCTTTATTTGCCTGGTCCGTCTTGGTGACAGCTATTTTATTGTTATTGTCTTTACCTGTACTTGCCGGTGCCATTACGATATTATTAACTGACCGAAATTTAAATACTTCATTTTATGATGTTGGAGGGGGAGGAGACCCTATTCTCTACCAACACCTATTT
asv.3,TTTGTCAAATTCAGTTTATCATTTTGGTGGATCAATTGATTTGGCAATTTTTAGATTACATGTAGCTGGTGTATCTTCAATTTTAGGTGGGATTAATTTTATTACTACATGTTTGAAAGGGAAAATTAGATATGTTTTGAGTTTAGAGTTTTTAACTTTATTTGTATGGGCTATAGTTGTTACTAGATTTTTGTTAGTTTTGAGGTTACCAGTTTTAGCTGGTGGGATTACTATATTATTATTAGATCGAAATTTTGGTTCTTCTTTTTTTGATCCTAGTGGGGGTGGGAATCCTATTTTATATCAACATTTGTTT
asv.4,TTTATCAAGAAATTTATCTCATAGAGGCCCAGCAGTAGATATGGCTATTTTTTCTCTTCATTTAGCAGGTGCATCTTCTATTTTAGGCTCTATTAATTTTATATCAACAATTAAAAATATACGACCAAAAGAAATGAACCCAGAAAATGTTCCACTATTTGTGTGGGCAGTAGTGATTACTACTCTTCTTTTACTTCTTTCTCTTCCAGTCTTAGCCGGGGCTATTACTATACTTTTAACAGACCGTAATTTAAATACATCCTTCTTTGATCCTGCCGGAGGGGGAGATCCAATTCTATACCAGCACTTATTT
asv.5,ATTAGCAAGTATTGCATTCCACTCAGGAGGAGCAGTTGATTGTGCAATTTTTGCTCTTCACGTTGCAGGAGCTTCATCAATTCTTGGAGCGGTAAACTTCATTACAACTGTAATGAACATGCGAGCTCCAGGAATTAGCCTTCACCGAATGCCTTTATTCGTATGAAGTGTTTTCGTAACTGTTGTACTTCTTTTATTAGCTGTACCTGTACTTGCAGGAGCAATTACAATGCTATTAACAGATCGAAACTTCAACACAACTTTCTTTGATCCAGCTGGAGGAGGAGATCCTGTACTTTATCAGCACCTTTTC
⋮,⋮
asv.16981,GATCAGTTAAAAGCATAGTCAGAGCACCGGCAAGAACTGGTAAGGATAACAAAAGCAAATAAGCAGTTATCAAAACAGCTCAAGCAAATAAGGGAATACGGTGCTCTAACATACCCAAACAGCGCATATTAGTAATAGTAGCAATAAAAT
asv.16982,ATATCTACGCATTTCACCGCTCCACCTAGAGTTCCAGGCACCTCTACCAACCTCGAGCACGGCAGTACACCAAGCAGTTCCACGTTTGAGACGTGGGATTTCACAAGGTACTTACCGAGCCGTCTACACGCGCTTTAAGCCCAGTGATTC
asv.16983,TGATCTATTTTAGTTACTACGATTCTTCTTTTACTATCATTACCAGTTCTTGCCGGAGCAATCACAATGCTTCTTTTAGATCGAAATTTTAACACTTCTTTCTTCGACCCAGCAAGAGGGGGAGACCCTGTATTATATCAACATTTATTT
asv.16984,TGATCAATTTTAGTTACAGCTTTTTTATTATTACTTTCTCTACCAGTTCTAGCGGGGGCAATAACAATGCTCTTGACCGATAGAAATTTTAATACCGCTTTTTTTGATCCAGCGGGAGGAGGGGATCCTATTTTATACCAACATCTATTC


In [32]:
dna <- occurrence %>%
    select(occurrenceID, asv, concentration = dna) %>%
    left_join(fasta, by = "asv")

dna

occurrenceID,asv,concentration,DNA_sequence
<chr>,<chr>,<dbl>,<chr>
EE0493_asv.2,asv.2,7.23,TCTATCAAGAAACGTGGCCCATGCAGGAGGGTCTGTAGACTTTGCAATTTTTTCTCTACACTTAGCAGGTGTTAGATCAATTTTAGGGGCTGTAAACTTTATTAGTACGTTAGGTAATCTACGGGTATTCGGAATACTTTTAGACCGTATCCCTTTATTTGCCTGGTCCGTCTTGGTGACAGCTATTTTATTGTTATTGTCTTTACCTGTACTTGCCGGTGCCATTACGATATTATTAACTGACCGAAATTTAAATACTTCATTTTATGATGTTGGAGGGGGAGGAGACCCTATTCTCTACCAACACCTATTT
EE0493_asv.5,asv.5,7.23,ATTAGCAAGTATTGCATTCCACTCAGGAGGAGCAGTTGATTGTGCAATTTTTGCTCTTCACGTTGCAGGAGCTTCATCAATTCTTGGAGCGGTAAACTTCATTACAACTGTAATGAACATGCGAGCTCCAGGAATTAGCCTTCACCGAATGCCTTTATTCGTATGAAGTGTTTTCGTAACTGTTGTACTTCTTTTATTAGCTGTACCTGTACTTGCAGGAGCAATTACAATGCTATTAACAGATCGAAACTTCAACACAACTTTCTTTGATCCAGCTGGAGGAGGAGATCCTGTACTTTATCAGCACCTTTTC
EE0493_asv.6,asv.6,7.23,TAAGTGGAAACCTTTCCCACTCAGGAGCTTCCGTAGACTACGCAATTTTCTCTCTTCACTTAGCAGGAGTTTCATCTTTGTTAGGAGCTGTGAATTTTATTAGCACTCTTAGAAATCTTCGAGTTTTTGGGATAATGCTAGACCGTTTACCTCTATTCGCTTGAGCAGTTTTAGTTACTGCTATTTTACTTCTTCTTTCTTTACCAGTATTAGCTGGGGCTATCACTATGCTGCTCACTGACCGAAATTTTAATACGTCTTTTTACGACCCAAGAGGGGGAGGAGACCCACTTCTTTACCAGCACTTATTT
EE0493_asv.7,asv.7,7.23,ACTAGCAAGTATTGCATTCCACTCAGGAGGAGCAGTTGATTGTGCAATTTTCGCTCTTCACGTTGCAGGAGCTTCTTCAATTCTTGGAGCAGTAAACTTCATTACAACTGTAATGAACATGCGAGCTCCAGGAATTAGTCTTCACCGAATGCCTCTATTCGTATGAAGTATCTTTGTAACTGTTGTACTTCTTTTATTAGCTGTACCTGTACTTGCAGGAGCAATTACAATGCTATTAACAGATCGAAACTTTAACACAACTTTCTTTGACCCAGCAGGGGGAGGAGATCCTGTACTTTACCAGCACCTTTTC
EE0493_asv.8,asv.8,7.23,CCTAGCAGGTAATCTTGCTCACGCAGGACCTTCTGTAGACTTAGCTATTTTTTCGCTTCACCTGGCTGGGATTTCATCCATCTTGGGTGCCCTTAACTTTATTACTACAGTTATTAATATGCGATGAAAAGGACTCCGTCTAGAACGAATCCCGTTATTTGTATGAGCCGTAGTAATCACCGCAGTCCTTTTACTTCTATCGCTTCCAGTTCTTGCTGGAGCAATTACTATGCTCCTGACCGACCGTAATTTAAATACTGCATTCTTCGACCCTGCCGGAGGGGGAGACCCTATCTTATACCAGCATCTCTTT
⋮,⋮,⋮,⋮
EE0495_asv.16949,asv.16949,15.83,ATATCTACGAATTTCACCTCTACACTAGGAATTCCACACTCCCCTCCCGGATTCTAGATGAACAGTTTTAAAGGCAGTTCCCAGGTTGAGCCCGGGGCTTTCACCTCTAACTTGTCCATCCGCCTACACGCCCTTTACGCCCAGTGATTC
EE0495_asv.16958,asv.16958,15.83,TGGTCGGTCTTAATTACAGCTTTTCTGTTGCTACTTTCTCTTCCGGTTTTGGCTGGTGGTATTACCATGCTGCTGACCGACAGAAACTTCAATACCACTTTCTTTGACCCCGCCGGTGGTGGTGATCCTGTGCTTTACCAGCATTTGTTC
EE0495_asv.16961,asv.16961,15.83,TGAGCTGTTTTCATTACTGCATTTTTATTACTACTTTCTTTACCAGTATTAGCTGGAGCAATTACTATGTTATTAACAGATAGAAATTTTAACACTTCATTTTTTGACCCTGCTGGAGGAGGAGACCCTGTTTTATACCAGCATTTGTTT
EE0495_asv.16962,asv.16962,15.83,TGGAGTGTGCTTATTACAGCATTCCTTCTGCTTCTTTCTCTTCCAGTACTTGCGGGAGCAATTACAATGCTTCTCACCGATCGTAACTTTTCAACAAGTTTCTTCGATCCAAGTGGAGGAGGTGATCCAATTTTATATCAGCACCTTTTC


### Adding metadata

We have a file with some sequencing metadata, print the file contents and add the corresponding fields to the DNADerivedData table.

In [33]:
cat(paste0(readLines("../dataset/metadata.txt"), collapse = "\n"))

eDNA Expeditions sequencing info

target gene: COI
forward primer: GGWACWGGWTGAACWGTWTAYCCYCC
reverse primer: TANACYTCNGGRTGNCCRAARAAYCA
forward primer name: mlCOIintF
reverse primer name: dgHCO2198
primer reference: doi:10.1186/1742-9994-10-34
library layout: paired
sequencing platform: Illumina NovaSeq6000

In [34]:
dna <- dna %>%
    mutate(
        concentrationUnit = "ng/μl",
        lib_layout = "paired",
        target_gene = "COI",
        pcr_primers = "FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA",
        seq_meth = "Illumina NovaSeq6000",
        ref_db = "https://github.com/iobis/edna-reference-databases",
        pcr_primer_forward = "GGWACWGGWTGAACWGTWTAYCCYCC",
        pcr_primer_reverse = "TANACYTCNGGRTGNCCRAARAAYCA",
        pcr_primer_name_forward = "mlCOIintF",
        pcr_primer_name_reverse = "dgHCO2198",
        pcr_primer_reference = "doi:10.1186/1742-9994-10-34"
    ) %>%
    select(-asv)

dna

occurrenceID,concentration,DNA_sequence,concentrationUnit,lib_layout,target_gene,pcr_primers,seq_meth,ref_db,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
EE0493_asv.2,7.23,TCTATCAAGAAACGTGGCCCATGCAGGAGGGTCTGTAGACTTTGCAATTTTTTCTCTACACTTAGCAGGTGTTAGATCAATTTTAGGGGCTGTAAACTTTATTAGTACGTTAGGTAATCTACGGGTATTCGGAATACTTTTAGACCGTATCCCTTTATTTGCCTGGTCCGTCTTGGTGACAGCTATTTTATTGTTATTGTCTTTACCTGTACTTGCCGGTGCCATTACGATATTATTAACTGACCGAAATTTAAATACTTCATTTTATGATGTTGGAGGGGGAGGAGACCCTATTCTCTACCAACACCTATTT,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
EE0493_asv.5,7.23,ATTAGCAAGTATTGCATTCCACTCAGGAGGAGCAGTTGATTGTGCAATTTTTGCTCTTCACGTTGCAGGAGCTTCATCAATTCTTGGAGCGGTAAACTTCATTACAACTGTAATGAACATGCGAGCTCCAGGAATTAGCCTTCACCGAATGCCTTTATTCGTATGAAGTGTTTTCGTAACTGTTGTACTTCTTTTATTAGCTGTACCTGTACTTGCAGGAGCAATTACAATGCTATTAACAGATCGAAACTTCAACACAACTTTCTTTGATCCAGCTGGAGGAGGAGATCCTGTACTTTATCAGCACCTTTTC,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
EE0493_asv.6,7.23,TAAGTGGAAACCTTTCCCACTCAGGAGCTTCCGTAGACTACGCAATTTTCTCTCTTCACTTAGCAGGAGTTTCATCTTTGTTAGGAGCTGTGAATTTTATTAGCACTCTTAGAAATCTTCGAGTTTTTGGGATAATGCTAGACCGTTTACCTCTATTCGCTTGAGCAGTTTTAGTTACTGCTATTTTACTTCTTCTTTCTTTACCAGTATTAGCTGGGGCTATCACTATGCTGCTCACTGACCGAAATTTTAATACGTCTTTTTACGACCCAAGAGGGGGAGGAGACCCACTTCTTTACCAGCACTTATTT,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
EE0493_asv.7,7.23,ACTAGCAAGTATTGCATTCCACTCAGGAGGAGCAGTTGATTGTGCAATTTTCGCTCTTCACGTTGCAGGAGCTTCTTCAATTCTTGGAGCAGTAAACTTCATTACAACTGTAATGAACATGCGAGCTCCAGGAATTAGTCTTCACCGAATGCCTCTATTCGTATGAAGTATCTTTGTAACTGTTGTACTTCTTTTATTAGCTGTACCTGTACTTGCAGGAGCAATTACAATGCTATTAACAGATCGAAACTTTAACACAACTTTCTTTGACCCAGCAGGGGGAGGAGATCCTGTACTTTACCAGCACCTTTTC,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
EE0493_asv.8,7.23,CCTAGCAGGTAATCTTGCTCACGCAGGACCTTCTGTAGACTTAGCTATTTTTTCGCTTCACCTGGCTGGGATTTCATCCATCTTGGGTGCCCTTAACTTTATTACTACAGTTATTAATATGCGATGAAAAGGACTCCGTCTAGAACGAATCCCGTTATTTGTATGAGCCGTAGTAATCACCGCAGTCCTTTTACTTCTATCGCTTCCAGTTCTTGCTGGAGCAATTACTATGCTCCTGACCGACCGTAATTTAAATACTGCATTCTTCGACCCTGCCGGAGGGGGAGACCCTATCTTATACCAGCATCTCTTT,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
EE0495_asv.16949,15.83,ATATCTACGAATTTCACCTCTACACTAGGAATTCCACACTCCCCTCCCGGATTCTAGATGAACAGTTTTAAAGGCAGTTCCCAGGTTGAGCCCGGGGCTTTCACCTCTAACTTGTCCATCCGCCTACACGCCCTTTACGCCCAGTGATTC,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
EE0495_asv.16958,15.83,TGGTCGGTCTTAATTACAGCTTTTCTGTTGCTACTTTCTCTTCCGGTTTTGGCTGGTGGTATTACCATGCTGCTGACCGACAGAAACTTCAATACCACTTTCTTTGACCCCGCCGGTGGTGGTGATCCTGTGCTTTACCAGCATTTGTTC,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
EE0495_asv.16961,15.83,TGAGCTGTTTTCATTACTGCATTTTTATTACTACTTTCTTTACCAGTATTAGCTGGAGCAATTACTATGTTATTAACAGATAGAAATTTTAACACTTCATTTTTTGACCCTGCTGGAGGAGGAGACCCTGTTTTATACCAGCATTTGTTT,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34
EE0495_asv.16962,15.83,TGGAGTGTGCTTATTACAGCATTCCTTCTGCTTCTTTCTCTTCCAGTACTTGCGGGAGCAATTACAATGCTTCTCACCGATCGTAACTTTTCAACAAGTTTCTTCGATCCAAGTGGAGGAGGTGATCCAATTTTATATCAGCACCTTTTC,ng/μl,paired,COI,FWD:GGWACWGGWTGAACWGTWTAYCCYCC;REV:TANACYTCNGGRTGNCCRAARAAYCA,Illumina NovaSeq6000,https://github.com/iobis/edna-reference-databases,GGWACWGGWTGAACWGTWTAYCCYCC,TANACYTCNGGRTGNCCRAARAAYCA,mlCOIintF,dgHCO2198,doi:10.1186/1742-9994-10-34


## Output

Write text files and compress.

In [35]:
occurrence <- occurrence %>%
    select(-asv, -dna, -temperature)

write.table(occurrence, file = file.path(output_dir, "occurrence.txt"), sep = "\t", row.names = FALSE, na = "", quote = FALSE)
write.table(mof, file = file.path(output_dir, "measurementorfact.txt"), sep = "\t", row.names = FALSE, na = "", quote = FALSE)
write.table(dna, file = file.path(output_dir, "dnaderiveddata.txt"), sep = "\t", row.names = FALSE, na = "", quote = FALSE)

In [36]:
library(dwcawriter)

archive <- list(
    eml = '<eml:eml packageId="https://obis.org/dummydataset/v1.0" scope="system" system="http://gbif.org" xml:lang="en" xmlns:dc="http://purl.org/dc/terms/" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.2/eml.xsd">
        <dataset>
        <title xml:lang="en">Dummy Dataset</title>
        </dataset>
    </eml:eml>',
    core = list(
        name = "occurrence",
        type = "https://rs.gbif.org/core/dwc_occurrence_2022-02-02.xml",
        index = which(names(occurrence) == "occurrenceID"),
        data = occurrence
    ),
    extensions = list(
        list(
            name = "measurementorfact",
            type = "https://rs.gbif.org/extension/obis/extended_measurement_or_fact_2023-08-28.xml",
            index = which(names(mof) == "occurrenceID"),
            data = mof
        ),
        list(
            name = "dnaderiveddata",
            type = "https://rs.gbif.org/extension/gbif/1.0/dna_derived_data_2022-02-23.xml",
            index = which(names(dna) == "occurrenceID"),
            data = dna
        )
    )
)

write_dwca(archive, file.path(output_dir, "archive.zip"))