Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

Data Set Archiving

John Kerfoot edited this page Feb 15, 2023 · 11 revisions

WikiData Set Archiving

Description of the DAC and NCEI data set archiving process.

Contents

Background

The National Center for Environmental Information archives real-time and delayed-mode DAC glider data sets. Archived data sets are assigned a permanent accession record that identifies all versions of each data set. The Glider DAC creates archival packages which are picked up by NCEI according to a pre-defined archiving schedule. Archival packages are distributed via the NCEI GeoPortal. This web application provides both Graphical User Interface search and an API.

NCEI provides information on searching both via the GUI and the API. However, we have created a simplified set of searchable web pages containing all Glider DAC accession records.

Archival Process

Data sets archived by NCEI are assigned an accession record which is a permanent identifier for the archival package and all subsequent revisions to the data set. The accession record to a single archive location containing all versions of both the real-time and delayed-mode data set. For the hypothetical data set with id peggyo-19731210T2000, all of the following versions of the data set belong to the same accession record:

  1. Uncorrected, low resolution data set submitted in real-time
  2. QC'd, low resolution data set submitted just after recovery
  3. QC'd, high resolution data set submitted after recovery
  4. QC'd with new algorithms, high resolution data set submitted after #3

The accession record will point to the same archival package, which will contain 4 versions of the data set, all available under the Lineage tab on the NCEI GeoPortal Web Application.

Glider DAC Archiving Process

Each data set located at the DAC is archived if the following criteria are met:

  • Marked as 'Complete' on the data set registration page
  • Marked as 'Submit to NCEI on Completion' on the data set registration page
  • All NetCDF files submitted by the user pass the IOOS Glider DAC compliance check

Assuming these criteria are met, all individual profile-based NetCDF files submitted by the data provider are aggregated into a TrajectoryProfile discrete sampling geometry NetCDF-3 file. An md5 sum is calculated and stored as an extended file attribute, user.md5sum, on the NetCDF file. The file naming convention for this file is:

    DATASET_ID.ncCF.nc3.nc

and are stored in the following location:

    /data/data/pub_erddap/USERNAME/DATASET_ID

which is symlinked to:

    /data/data/archive

An md5 checksum is calculated for each aggregated NetCDF data set and is written to a file with the following naming convention:

    DATASET_ID.ncCF.nc3.nc.md5

The .md5 files are written (no symlink) to:

    /data/data/archive

NCEI picks up the archival packages from:

    /home/ncei/archive

which is a symlink to:

    /data/data/archive

NCEI Archiving Process

ssh access to the DAC data server is provided to NCEI. The NCEI archiving process is as follows:

  1. NCEI logs into the DAC production server via ssh

  2. Archival packages are contained in:

     /home/ncei/archive
    
  3. An archive package consists of

    • An aggregated NetCDF file (DATASET_ID.ncCF.nc3.nc)
    • An file containing the md5 hash of the NetCDF file
  4. New archival packages are transferred to NCEI and the md5 is recalculated after transfer. These 2 hashes are compared to ensure successful transfer of the NetCDF file.

  5. For existing NCEI accessions, the md5 hash is compared to the one currently at NCEI to determine if the file contents have changed. Typical changes include:

    • Addition of new or modification of existing metadata
    • Additional profiles added to the aggregation

    If the md5's match up, there is no new data in the archival package. If the md5's don't match up, it is assumed that the contents of the package have changed. The new archival package is transferred to NCEI and the accession is updated with version 2 (or 3, 4, etc.) of the data set. This provides a versioning of the same data set that preserves the processing history of the data set.

Archival Process Schedule

The following is the schedule of NCEI archiving processes as of 2023-02-15:

  1. Monday - Sunday @ 19:10 UTC: Comparison of NCEI archive contents with archival packages available at the DAC in order to determine which, if any, new archival packages are ready for archive.
  2. Sunday - Thursday @08:50 UTC: New archive packages (determined in Step #1) are downloaded, archived and assigned accession records. The new archival packages are then distributed via NCEI GeoPortal.

Existing Issues

  1. If a data set fails the compliance check, there is no automated process for creating an archival package after the data set has been corrected using extra_atts.json.

Proposed Fixes

  1. Split archiving into separate pipelines for real-time and delayed-mode data sets.
  2. Delayed-mode data set archiving is run once/week and, potentially, becomes a manual process to ensure that the data set contains all necessary metadata. This requires DAC initiated communication with the data provider to ensure compliance.
  3. Real-time data set archiving is run once/day. Errors in the archiving process are written to formatted log files listing all outstanding issues. Subsequent attempts at archiving the data set may require communication with the data provider before proceeding with a new attempt.
  4. Store information on archived data set packages in a database?