Skip to content
Manuel Luciano edited this page Feb 21, 2024 · 62 revisions

Purpose

This document describes how iDigBio identifies known data quality issues of ingested specimen data and represents them in the iDigBio Search API. During the ingestion process, iDigBio often encounters data that are missing, inconsistent, factually incorrect, or out of compliance with meta-data standards and controlled vocabularies. For example, Taxonomic Names are added from the GBIF Backbone Taxonomy. To facilitate indexing, corrections are made to these data and they are flagged in the search API. Another example is replacement of common misspellings (e.g. "Flordia" instead of "Florida").

The following summary indicates the frequency that various flags have been assigned to records in iDigBio:

http://search.idigbio.org/v2/summary/top/records?top_fields=[%22flags%22]&count=1000

General guidelines for flag names:

  1. a flag named with "added" means the field was empty in the provided data and iDigBio added a value to help fully populate the record. This enhances searching and discovery.
  2. a flag named with "replaced" means the field contained data from the provider and iDigBio attempted to make it more consistent by replacing the value. Note that the original data values are always available in the raw data. The replaced values are designed to enhance searching and discovery.
  3. a flag named with "truncated" means part of a record was removed. This can happen when a field name contains unsupported characters, such as dots (periods, ".").

Flags

The table below describes the flags that might be added to records in iDigBio:

Flag Definition
datecollected_bounds Date Collected out of bounds (Not between 1500-01-02 and the date of Indexing). Date Collected is generally composed from dwc:year, dwc:month, dwc:day or as specified in dwc:eventDate.
dwc_acceptednameusageid_added Accepted Name Usage ID (dwc:acceptedNameUsageID) added where none was provided.
dwc_basisofrecord_invalid Darwin Core Basis of Record (dwc:basisOfRecord) missing or not a value from controlled vocabulary.
dwc_basisofrecord_paleo_conflict Darwin Core Basis of Record (dwc:basisOfRecord) is not FossilSpecimen but the record contains paleo context terms
dwc_basisofrecord_removed Darwin Core Basis of Record (dwc:basisOfRecord) removed because of invalid value.
dwc_class_added Darwin Core Class (dwc:class) added where none was provided.
dwc_class_replaced Darwin Core Class (dwc:class) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_continent_added Darwin Core Continent (dwc:continent) added where none was provided.
dwc_continent_replaced Darwin Core Continent (dwc:continent) replaced with a standardized value.
dwc_country_added Darwin Core Country (dwc:country) added where none was provided.
dwc_country_replaced Darwin Core Country (dwc:country) replaced with a standardized value from Getty Thesaurus of Geographic Names.
dwc_datasetid_added Darwin Core Dataset ID (dwc:datasetID) added where none was provided.
dwc_datasetid_replaced Darwin Core Dataset ID (dwc:datasetID) replaced with value from ? TBD
dwc_family_added Darwin Core Family (dwc:family) added where none was provided.
dwc_family_replaced Darwin Core Family (dwc:family) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_genus_added Darwin Core Genus (dwc:genus) added where none was provided.
dwc_genus_replaced Darwin Core Genus (dwc:genus) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_infraspecificepithet_added Darwin Core Infraspecific Epithet (dwc:infraspecificEpithet) added where none was provided.
dwc_infraspecificepithet_replaced Darwin Core Infraspecific Epithet (dwc:infraspecificEpithet) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_kingdom_added Darwin Core Kingdom (dwc:kingdom) added where none was provided.
dwc_kingdom_replaced Darwin Core Kingdom (dwc:kingdom) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_kingdom_suspect Darwin Core Kingdom (dwc:kingdom) not replaced with a standardized value from GBIF Backbone Taxonomy due to insufficient confidence level.
dwc_multimedia_added TBD
dwc_order_added Darwin Core Order (dwc:order) added where none was provided.
dwc_order_replaced Darwin Core Order (dwc:order) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_originalnameusageid_added Darwin Core Original Name Usage ID (dwc:originalNameUsageID) added where none was provided.
dwc_parentnameusageid_added Darwin Core Parent Name Usage ID (dwc:parentNameUsageID) added where none was provided.
dwc_phylum_added Darwin Core Phylum (dwc:phylum) added where none was provided.
dwc_phylum_replaced Darwin Core Phylum (dwc:phylum) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_scientificnameauthorship_added Darwin Core Scientific Name Authorship (dwc:scientificNameAuthorship) added where none was provided.
dwc_specificepithet_added Darwin Core Specific Epithet (dwc:specificEpithet) added where none was provided.
dwc_specificepithet_replaced Darwin Core Specific Epithet (dwc:specificEpithet) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_stateprovince_replaced Darwin Core State or Province (dwc:stateProvince) replaced with a standardized value.
dwc_taxonid_added Darwin Core Taxon ID (dwc:taxonID) added where none was provided.
dwc_taxonid_replaced Darwin Core Taxon ID (dwc:taxonID) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_taxonomicstatus_added Darwin Core Taxonomic Status (dwc:taxonomicStatus) added where none was provided.
dwc_taxonomicstatus_replaced Darwin Core Taxonomic Status (dwc:taxonomicStatus) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_taxonrank_added Darwin Core Taxon Rank (dwc:taxonRank) added where none was provided.
dwc_taxonrank_invalid The supplied Darwin Core Taxon Rank (dwc:taxonRank) is not contained in controlled vocabulary (Taxonomic Rank GBIF Vocabulary).
dwc_taxonrank_removed Darwin Core Taxon Rank (dwc:taxonRank) removed because it is not contained in controlled vocabulary (Taxonomic Rank GBIF Vocabulary).
dwc_taxonrank_replaced Darwin Core Taxon Rank (dwc:taxonRank) replaced with a standardized value from GBIF Backbone Taxonomy.
dwc_taxonremarks_added Darwin Core Taxon Remarks (dwc:taxonRemarks) added none was provided.
dwc_taxonremarks_replaced Darwin Core Taxon Remarks (dwc:taxonRemarks) replaced with a standardized value from GBIF Backbone Taxonomy.
gbif_canonicalname_added GBIF Canonical Name added from GBIF Backbone Taxonomy.
gbif_genericname_added GBIF Generic Name added from GBIF Backbone Taxonomy.
gbif_reference_added GBIF Reference added from GBIF Backbone Taxonomy
gbif_taxon_corrected A match in GBIF Backbone Taxonomy was found. Inverse of taxon_match_failed flag.
gbif_vernacularname_added GBIF Vernacular Name (common name) added.
geopoint_0_coord Geographic Coordinate contains literal '0' values.
geopoint_bounds Geographic Coordinate out of bounds (valid range is -90 to 90 lat, -180 to 180 long)
geopoint_datum_error Geographic Coordinate Datum (dwc:geodeticDatum) is Unknown or coordinate cannot be converted to WGS84.
geopoint_datum_missing Geographic Coordinate is missing Geodetic Datum (dwc:geodeticDatum) (Assumed to be WGS84).
geopoint_low_precision Geographic Coordinate contains a Low Precision value.
geopoint_pre_flip Geographic Coordinate latitude and longitude replaced with swapped values. Prior to examining other factors, the magnitude of latitude was determined to be greater than 180, and the longitude was less than 90.
geopoint_similar_coord Geographic Coordinate latitude and longitude are similar (+/- lat == +/- lon) and likely have data entry issue.
idigbio_isocountrycode_added iDigBio ISO 3166-1 alpha-3 Country Code added.
idigbio_obis_extendedmeasurementorfact_truncated Record truncated due to problematic field name.
idigbio_chrono_chronometricage_truncated Record truncated due to problematic field name.
rev_geocode_both_sign Geographic Coordinate Latitude and Longitude negated to place point in correct country.
rev_geocode_corrected Geographic Coordinate placed within stated country by reverse geocoding process.
rev_geocode_eez Geographic Coordinate is outside land boundaries of stated country but does fall inside the country's exclusive economic zone water boundary (approx. 200 miles from shore) based on reverse geocoding process.
rev_geocode_eez_corrected The reverse geocoding process was able to find a coordinate operation that placed the point within the stated country's exclusive economic zone.
rev_geocode_failure Geographic Coordinate could not be reverse geocoded to a particular country.
rev_geocode_flip Geographic Coordinate Latitude and Longitude replaced with swapped values to place point in stated country by reverse geocoding process.
rev_geocode_flip_both_sign Geographic Coordinate Latitude and Longitude replaced with both swapped and negated values to place point in stated country by reverse geocoding process.
rev_geocode_flip_lat_sign Geographic Coordinate Latitude and Longitude replaced with swapped values, Latitude negated, to place point in stated country by reverse geocoding process.
rev_geocode_flip_lon_sign Geographic Coordinate Latitude and Longitude replaced with swapped values, Longitude negated, to place it in stated country by reverse geocoding process.
rev_geocode_lat_sign Geographic Coordinate Latitude negated to place point in stated country by reverse geocoding process.
rev_geocode_lon_sign Geographic Coordinate had its Longitude negated to place it in stated country.
rev_geocode_mismatch Geographic Coordinate did not reverse geocode to stated country.
scientificname_added Scientific Name (dwc:scientificName) added where none was provided with the value constructed by concatenation of stated genus and species.
taxon_match_failed Unable to match a taxon in GBIF Backbone Taxonomy. Inverse of gbif_taxon_corrected flag.

Query Examples

Searching records for the flag scientificname_added:

{
  "flags":"scientificname_added"
}
http://search.idigbio.org/v2/search/records?rq={%22flags%22:%22scientificname_added%22}

Searching my recordset records that are flagged with scientificname_added:

{
  "flags":"scientificname_added",
  "recordset":"c38b867b-05f3-4733-802e-d8d2d3324f84"
}
http://search.idigbio.org/v2/search/records?rq={%22flags%22:%22scientificname_added%22,%22recordset%22:%22c38b867b-05f3-4733-802e-d8d2d3324f84%22}