Data are stored in Parquet files in Azure Blob Storage in the West Europe Azure region, in the following blob container:
https://ai4edataeuwest.blob.core.windows.net/gbif
Within that container, the periodic occurrence snapshots are stored in occurrence/YYYY-MM-DD
, where YYYY-MM-DD
corresponds to the date of the snapshot.
The snapshot includes all CC-BY licensed data published through GBIF that have coordinates which passed automated quality checks.
Each snapshot contains a citation.txt
with instructions on how best to cite the data, and the data files themselves in Parquet format: occurrence.parquet/*
.
Therefore, the data files for the first snapshot are at
https://ai4edataeuwest.blob.core.windows.net/gbif/occurrence/2021-04-13/occurrence.parquet/*
and the citation information is at
https://ai4edataeuwest.blob.core.windows.net/gbif/occurrence/2021-04-13/citation.txt
The Parquet file schema is described below. Most field names correspond to terms from the Darwin Core standard, and have been interpreted by GBIF's systems to align taxonomy, location, dates etc. Additional information may be retrived using the GBIF API.
Field¹ | Type | Nullable | Description |
---|---|---|---|
gbifid | BigInt | N | GBIF's identifier for the occurrence |
datasetkey | String (UUID) | N | GBIF's UUID for the dataset containing this occurrence |
publishingorgkey | String (UUID) | N | GBIF's UUID for the organization publishing this occurrence. |
occurrencestatus | String | N | See dwc:occurrenceStatus. Either the value PRESENT or ABSENT . Many users will wish to filter for PRESENT data. |
basisofrecord | String | N | See dwc:basisOfRecord. One of PRESERVED_SPECIMEN , FOSSIL_SPECIMEN , LIVING_SPECIMEN , OBSERVATION , HUMAN_OBSERVATION , MACHINE_OBSERVATION , MATERIAL_SAMPLE , LITERATURE , UNKNOWN . |
kingdom | String | Y | See dwc:kingdom. This field has been aligned with the GBIF backbone taxonomy. |
phylum | String | Y | See dwc:phylum. This field has been aligned with the GBIF backbone taxonomy. |
class | String | Y | See dwc:class. This field has been aligned with the GBIF backbone taxonomy. |
order | String | Y | See dwc:order. This field has been aligned with the GBIF backbone taxonomy. |
family | String | Y | See dwc:family. This field has been aligned with the GBIF backbone taxonomy. |
genus | String | Y | See dwc:genus. This field has been aligned with the GBIF backbone taxonomy. |
species | String | Y | See dwc:species. This field has been aligned with the GBIF backbone taxonomy. |
infraspecificepithet | String | Y | See dwc:infraspecificEpithet. This field has been aligned with the GBIF backbone taxonomy. |
taxonrank | String | Y | See dwc:taxonRank. This field has been aligned with the GBIF backbone taxonomy. |
scientificname | String | Y | See dwc:scientificName. This field has been aligned with the GBIF backbone taxonomy. |
verbatimscientificname | String | Y | The scientific name as provided by the data publisher |
verbatimscientificnameauthorship | String | Y | The scientific name authorship provided by the data publisher. |
taxonkey | Integer | Y | The numeric identifier for the taxon in GBIF's backbone taxonomy corresponding to scientificname . |
specieskey | Integer | Y | The numeric identifier for the taxon in GBIF's backbone taxonomy corresponding to species . |
typestatus | String | Y | See dwc:typeStatus. |
countrycode | String | Y | See dwc:countryCode. GBIF's interpretation has set this to an ISO 3166-2 code. |
locality | String | Y | See dwc:locality. |
stateprovince | String | Y | See dwc:stateProvince. |
decimallatitude | Double | Y² | See dwc:decimalLatitude. GBIF's interpretation has normalized this to a WGS84 coordinate. |
decimallongitude | Double | Y² | See dwc:decimalLongitude. GBIF's interpretation has normalized this to a WGS84 coordinate. |
coordinateuncertaintyinmeters | Double | Y | See dwc:coordinateUncertaintyInMeters. |
coordinateprecision | Double | Y | See dwc:coordinatePrecision. |
elevation | Double | Y | See dwc:elevation. If provided by the data publisher, GBIF's interpretation has normalized this value to metres. |
elevationaccuracy | Double | Y | See dwc:elevationAccuracy. If provided by the data publisher, GBIF's interpretation has normalized this value to metres. |
depth | Double | Y | See dwc:depth. If provided by the data publisher, GBIF's interpretation has normalized this value to metres. |
depthaccuracy | Double | Y | See dwc:depthAccuracy. If provided by the data publisher, GBIF's interpretation has normalized this value to metres. |
eventdate | String | Y | See dwc:eventDate. GBIF's interpretation has normalized this value to an ISO 8601 date with a local time. |
year | Integer | Y | See dwc:year. |
month | Integer | Y | See dwc:month. |
day | Integer | Y | See dwc:day. |
individualcount | Integer | Y | See dwc:individualCount. |
establishmentmeans | String | Y | See dwc:establishmentMeans. |
occurrenceid | String | Y³ | See dwc:occurrenceID. |
institutioncode | String | Y³ | See dwc:institutionCode. |
collectioncode | String | Y³ | See dwc:collectionCode. |
catalognumber | String | Y³ | See dwc:catalogNumber. |
recordnumber | String | Y | See dwc:recordNumber. |
recordedby | String | Y | See dwc:recordedBy. |
identifiedby | String | Y | See dwc:identifiedBy. |
dateidentified | String | Y | See dwc:dateIdentified. An ISO 8601 date. |
mediatype | String array | N⁴ | See dwc:mediaType. May contain StillImage , MovingImage or Sound (from enumeration, detailing whether the occurrence has this media available. |
issue | String array | N⁴ | A list of issues encountered by GBIF in processing this record. More details are available on these issues and flags in this blog post. |
license | String | N | See dwc:license. Either CC0_1_0 or CC_BY_4_0 . CC_BY_NC_4_0 records are not present in this snapshot. |
rightsholder | String | Y | See dwc:rightsHolder. |
lastinterpreted | String | N | The ISO 8601 date when the record was last processed by GBIF. Data are reprocessed for several reasons, including changes to the backbone taxonomy, so this date is not necessarily the date the occurrence record last changed. |
¹ Field names are lower case, but in later snapshots this may change to camelCase, for consistency with Darwin Core and the GBIF API.
² Occurrences without coordinates are excluded from this snapshot, although this may change in the future.
³ Either occurrenceID
, or institutionCode
+ collectionCode
+ catalogNumber
, or both, will be present on every record.
⁴ The array may be empty.