Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

Internal DAC Administration Space

John Kerfoot edited this page Jun 8, 2023 · 7 revisions

WikiInternal DAC Administration

Contents

Determining Data Set QC Status

Added 2023-05-10 @benjwadams

Files are periodically inspected to see if they require QC. This is determined by looking at the geophysical variables and referencing the variables (if any) in the ancillary_variables attributes. These variables are then checked to see if there is an Linux extended attribute (referred to as an xattr hereafter) set for user.qc_run. If this xattr exists, the file is considered to be already checked for QC and is omitted. If the xattr does not exist, a standard name to see if there are any standard names ending with “status_flag” OR “quality_flag”. If there are no variables matching these criteria, automatic QC is scheduled. The xattr method is much quicker at checking whether a file has been QCed instead of opening and inspecting the netCDF files individually. The current process could be improved even more by determining if existing completed deployments have already been QCed (e.g. at the folder level)

If automatic QC is run, QARTOD is run on flat line, gross range, spike test, and rate of change test. Of these, only the gross range test has configuration per geophysical variable, contained in the data/qc_config.yml file. The flat line test is also defined in the qc_config.yml file listed below, but is set to a constant value across the various geophysical variables. The rate of change and spike test are derived per profile based upon computed stastical measure such as mean and standard deviation. Note that these values are computed per profile rather than across the entire deployment. Xattr for qc_run is then set

If automatic QC is not run due to already existing QC flags, the xattr for user_qc is set since the QC has already been run.

The script to generate the ERDDAP catalog adds in the QC variables. If there are no or few QC variables present, none are added, or a reduced subset are added, respectively. Unlike the QARTOD application scripts, the ERDDAP QC variables are determined by the name of the variable. In particular, if a variable starts with “qartod”, or ends with “_qc”, it is considered to be a QC variable. It may be desirable to unify the behavior of the QC variable detection between the QARTOD code and the script to generate datasets.xml.

There is also a variable ending in _qc that does not adhere to QARTOD flag conventions and is not generated as part of the QARTOD QC process. This should likely be retired in favor of the qartod_<geophysical_variable>_primary_flag which corresponds to the aggregate flag or standard name of aggregate_quality_flag in CF.

Proposed QC Process

The following is a proposal for the entire QC application process as it applies to active real-time data sets. The application of DAC supplied QC for all other data sets will done in a separate process to be determined.

  • Determine the names of all active real-time data sets. This should be done by querying the database for all registered data sets with the following conditions:
    • "delayed_mode" = False or null
    • "completed" = False
    • data set name does not end in "_delayed"

For all of the data sets that meet this criteria, we need to create a list of data provider submitted NetCDF files that need to be QC'd by the DAC qc package. The following shell script, run as user glider, creates these lists (one for each data set) that contains the list of files that have not been previously QC'd:

/home/glider/qc/bin/build_deployment_qc_queue.sh

Each list, provide there are files that need to be QC'd, is written to:

/home/glider/qc/queue

Full documentation on build_deployment_qc_queue.sh can be viewed using:

> /home/glider/qc/bin/build_deployment_qc_queue.sh -h

These files should be used as inputs to feed the list of files that need to be qc'd.

Catalog Description

ERDDAP serves data sets, which are aggregations of individual files, that are described in the datasets.xml catalog file. The datasets.xml file is an XML file that describes the contents and metadata of a glider data set in a format that ERDDAP can understand. In addition to new data sets, XML descriptions of existing data sets may need to be modified to reflect the addition of missing or incorrect metadata associated with the data set.

The datasets.xml file consists of 3 parts:

  1. Header
    • <erddapDatasets> opening tag
    • Tags specifying IP blacklists, the maximum number of simultaneous connections allowed from a single IP, etc.
  2. Body
    • Individual data set descriptions enclosed in <dataset /> tags
      • <reloadEveryNMinutes />
        • high reload frequency for real-time incomplete data sets
        • low reload frequency for real-time incomplete (active) data sets
        • low reload frequency for delayed-mode data sets regardless of complete/incomplete status
      • METADATA
        • use of extra_atts.json file to correct metadata and add missing metadata
      • <metadataFrom>first|last</metadata>
        • Tells ERDDAP to read all data set metadata, not specifically defined in the <dataset /> tag, from either the earliest (first) or latest (last) data provider submitted NetCDF file.
      • Attributes, not present in the source files, can be added to an ERDDAP dataset via the <addAttributes /> element.
      • Variable descriptions are provided through one or more <dataVariable > elements that describe the variable name, type and attribute.
  3. Footer
    • Closing </erddapDatasets> tag

Building the ERDDAP Catalog

Creation of the individual <dataset /> elements must be performed on demand in order to reflect changes in the source files submitted by the data provider. For example, one or more new variables can be included in a file submitted by the data provider during the course of a real-time deployment. If these new variables are not described in the <dataset /> element, they will not be displayed in the resulting aggregation.

Typically, only a small fraction of data sets are classified as real-time active data sets. The rest of the data sets fall into the following classifications:

  • Delayed mode: typically submitted in a single large batch upload
  • Real-Time Completed: data sets for which files are no longer being submitted

The following process is followed in order to create an ERDDAP catalog that accurately reflects the contents of all submitted data sets.

Catalog Creation

IMPORTANT: All data sets will be QC'd by the DAC. Variables created by the DAC QC process can therefore be "hardcoded" into the <dataset /> template. In the event that a newly registered data set has had one or more files uploaded by the provider and the DAC QC process has not yet run, these variables will show up in the ERDDAP data set but will contain all _FillValue.

The following is a description of the proposed steps to create an accurate datasets.xml file. This system can be either triggered by an event (i.e.: upload of one or more new NetCDF files by a provider) or scheduled via the crontab. The creation of individual <dataset /> element files can also be done asynchronously to further speed up the entire process, but the creation of the datasets.xml file must be done to ensure no race conditions exist that would result in an incomplete <dataset /> element being written to the file.

  1. Determine the names of all active real-time data sets. This should be done by querying the database for all registered data sets with the following conditions:
    • "delayed_mode" = False or null
    • "completed" = False
    • data set name does not end in "_delayed"

All data sets that do not satisfy this criteria should be ignored.

  1. Either by event-based notification (inotify) or comparing the most recent mod time of all files submitted by the data provider to the "latest_file_mtime" field of the data set record, determine if the <dataset /> element should be re-created.

  2. Get the list of the X largest files submitted by the data provider, regardless of mtime. Since each NetCDF file represents a single profile, the sizes of the files will not vary much due to the number of observation records stored in each file. The variation in file size will be due almost entirely to the overhead associated with defining variables and their attributes. The largest files will likely contain the largest number of variables.

  3. Extract and create the union of all variables from the files identified in #3. This can be done a variety of ways, but I have found that dumping the files as valid CDL and pulling out the variable names, data types and _FillValues in the shell is fastest. I have a shell script that does this. Store this variable set in a CSV or YAML file.

  4. Read in the file from #4 and render the XML containing a <dataVariable /> element for each of the variables. Include the FillValue as an attribute. This will prevent FillValues from being displayed in the observation records in the event that a submitted NetCDF file is missing one or more parameters. This is based on the idea that we currently ERDDAP will only read metadata for variables from the most recently modified file. I have a template for this to use directly or as an example to build a more custom one for the DAC.

  5. The rendered <dataset /> XML element should be written to a file and stored in a directory somewhere on the system. All <dataset /> elements that will be included in the resulting datasets.xml file must be stored in this directory. For the purposes of this exercise, assume the directory is:

     /data/data/datasets-xml
    
  6. After steps 1 - 6 have been performed for all data sets identified in step 1, the following workflow should be used to create the datasets.xml file:

     > cp datasets.xml datasets.xml.previous
     > cat header.xml >> datasets.xml.tmp
     > cat /data/data/datasets-xml/*.xml >> datasets.xml.tmp
     > cat footer.xml >> datasets.xml.tmp
     > mv datasets.xml.tmp datasets.xml