Skip to content

Run write_dwc() from CSV files #348

@peterdesmet

Description

@peterdesmet

Rationale

In the March 31 VLIZ-INBO meeting I suggested to have write_dwc() run from the CSV files generated by download_acoustic_dataset() rather than from the database. This is in line with the idea that a Marine Data Archive for an animal acoustic project will contain both the source and Darwin Core Archive data:

# source data, generated with download_acoustic_dataset()
datapackage.json
animals.csv
tags.csv
acoustic_detections.csv
archival_data.csv
deployments.csv
receivers.csv
projects.csv

# DwC-A data, generated with write_dwc()
dwc_occurrence.csv
dwc_emof.csv
meta.xml

Running from the CSV files has several advantages:

  • Only need to query data from DB once (in download_acoustic_dataset()).
  • Darwin Core data will always be consistent with CSV files. Currently it is possible that there is drift between the two, e.g. when write_dwc() is ran weeks later (and DB data are updated) or when the scientific_name argument was used in download_acoustic_dataset() (which is not available in write_dwc())
  • Can update datapackage.json to reference Darwin Core files.
  • Once you have the CSV files, it's faster to run write_dwc()

The process would thus be:

  1. Run download_acoustic_dataset()
  2. Quality assurance
  3. Fix errors in database
  4. Repeat step 1-3 until all is correct
  5. Run write_dwc() on local CSV files

Implementation

Implementation would be similar to https://inbo.github.io/movepub/reference/write_dwc.html, where a Frictionless Data Package is provided.

  • Discuss with @PietrH what branch to use

Parameters

  • package (no default): a frictionless::read_package(). Alternatively, we ask the user for an input directory.
  • connection: remove
  • animal_project_code: remove, context is provided by package
  • directory (no default): output directory
  • contact (cf. movepub), not sure this is needed.
  • rights_holder (default NULL)
  • license (default "CC-BY")

Error checking

  • Check that all required resources are available. I assume those will be at least animals, detections.

Transformation

  • Convert [dwc_occurrence.sql(https://github.com/inbo/etn/blob/main/inst/sql/dwc_occurrence.sql) to dplyr
  • Test that all necessary information is available in the source CSVs. If not, then download_acoustic_dataset() should be updated

Testing

Sub-issues

Metadata

Metadata

Labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions