Skip to content

Latest commit

 

History

History
81 lines (67 loc) · 10.7 KB

gbif-storage.md

File metadata and controls

81 lines (67 loc) · 10.7 KB

Data are stored in Parquet files in Azure Blob Storage in the West Europe Azure region, in the following blob container:

https://ai4edataeuwest.blob.core.windows.net/gbif

Within that container, the periodic occurrence snapshots are stored in occurrence/YYYY-MM-DD, where YYYY-MM-DD corresponds to the date of the snapshot.

The snapshot includes all CC-BY licensed data published through GBIF that have coordinates which passed automated quality checks.

Each snapshot contains a citation.txt with instructions on how best to cite the data, and the data files themselves in Parquet format: occurrence.parquet/*.

Therefore, the data files for the first snapshot are at

https://ai4edataeuwest.blob.core.windows.net/gbif/occurrence/2021-04-13/occurrence.parquet/*

and the citation information is at

https://ai4edataeuwest.blob.core.windows.net/gbif/occurrence/2021-04-13/citation.txt

The Parquet file schema is described below. Most field names correspond to terms from the Darwin Core standard, and have been interpreted by GBIF's systems to align taxonomy, location, dates etc. Additional information may be retrived using the GBIF API.

Field¹ Type Nullable Description
gbifid BigInt N GBIF's identifier for the occurrence
datasetkey String (UUID) N GBIF's UUID for the dataset containing this occurrence
publishingorgkey String (UUID) N GBIF's UUID for the organization publishing this occurrence.
occurrencestatus String N See dwc:occurrenceStatus. Either the value PRESENT or ABSENT. Many users will wish to filter for PRESENT data.
basisofrecord String N See dwc:basisOfRecord. One of PRESERVED_SPECIMEN, FOSSIL_SPECIMEN, LIVING_SPECIMEN, OBSERVATION, HUMAN_OBSERVATION, MACHINE_OBSERVATION, MATERIAL_SAMPLE, LITERATURE, UNKNOWN.
kingdom String Y See dwc:kingdom. This field has been aligned with the GBIF backbone taxonomy.
phylum String Y See dwc:phylum. This field has been aligned with the GBIF backbone taxonomy.
class String Y See dwc:class. This field has been aligned with the GBIF backbone taxonomy.
order String Y See dwc:order. This field has been aligned with the GBIF backbone taxonomy.
family String Y See dwc:family. This field has been aligned with the GBIF backbone taxonomy.
genus String Y See dwc:genus. This field has been aligned with the GBIF backbone taxonomy.
species String Y See dwc:species. This field has been aligned with the GBIF backbone taxonomy.
infraspecificepithet String Y See dwc:infraspecificEpithet. This field has been aligned with the GBIF backbone taxonomy.
taxonrank String Y See dwc:taxonRank. This field has been aligned with the GBIF backbone taxonomy.
scientificname String Y See dwc:scientificName. This field has been aligned with the GBIF backbone taxonomy.
verbatimscientificname String Y The scientific name as provided by the data publisher
verbatimscientificnameauthorship String Y The scientific name authorship provided by the data publisher.
taxonkey Integer Y The numeric identifier for the taxon in GBIF's backbone taxonomy corresponding to scientificname.
specieskey Integer Y The numeric identifier for the taxon in GBIF's backbone taxonomy corresponding to species.
typestatus String Y See dwc:typeStatus.
countrycode String Y See dwc:countryCode. GBIF's interpretation has set this to an ISO 3166-2 code.
locality String Y See dwc:locality.
stateprovince String Y See dwc:stateProvince.
decimallatitude Double See dwc:decimalLatitude. GBIF's interpretation has normalized this to a WGS84 coordinate.
decimallongitude Double See dwc:decimalLongitude. GBIF's interpretation has normalized this to a WGS84 coordinate.
coordinateuncertaintyinmeters Double Y See dwc:coordinateUncertaintyInMeters.
coordinateprecision Double Y See dwc:coordinatePrecision.
elevation Double Y See dwc:elevation. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
elevationaccuracy Double Y See dwc:elevationAccuracy. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
depth Double Y See dwc:depth. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
depthaccuracy Double Y See dwc:depthAccuracy. If provided by the data publisher, GBIF's interpretation has normalized this value to metres.
eventdate String Y See dwc:eventDate. GBIF's interpretation has normalized this value to an ISO 8601 date with a local time.
year Integer Y See dwc:year.
month Integer Y See dwc:month.
day Integer Y See dwc:day.
individualcount Integer Y See dwc:individualCount.
establishmentmeans String Y See dwc:establishmentMeans.
occurrenceid String See dwc:occurrenceID.
institutioncode String See dwc:institutionCode.
collectioncode String See dwc:collectionCode.
catalognumber String See dwc:catalogNumber.
recordnumber String Y See dwc:recordNumber.
recordedby String Y See dwc:recordedBy.
identifiedby String Y See dwc:identifiedBy.
dateidentified String Y See dwc:dateIdentified. An ISO 8601 date.
mediatype String array N⁴ See dwc:mediaType. May contain StillImage, MovingImage or Sound (from enumeration, detailing whether the occurrence has this media available.
issue String array N⁴ A list of issues encountered by GBIF in processing this record. More details are available on these issues and flags in this blog post.
license String N See dwc:license. Either CC0_1_0 or CC_BY_4_0. CC_BY_NC_4_0 records are not present in this snapshot.
rightsholder String Y See dwc:rightsHolder.
lastinterpreted String N The ISO 8601 date when the record was last processed by GBIF. Data are reprocessed for several reasons, including changes to the backbone taxonomy, so this date is not necessarily the date the occurrence record last changed.

¹ Field names are lower case, but in later snapshots this may change to camelCase, for consistency with Darwin Core and the GBIF API.

² Occurrences without coordinates are excluded from this snapshot, although this may change in the future.

³ Either occurrenceID, or institutionCode + collectionCode + catalogNumber, or both, will be present on every record.

⁴ The array may be empty.