# German Mammals GBIF

Using the Data from GBIF to create the list of German Mammals

In [41]:
from pygbif import species as species
from pygbif import occurrences as occ 

# we refer to the occurrence module as occ

import os



Finding the keys for the different taxon levels can be a little tricky if you do not know where or how to look.

For example here we want to find mammals in Germany. We can check the documentation of occ.search() on by following this link: https://pygbif.readthedocs.io/en/latest/docs/usecases.html

There we find that our class key is an integer "classKey – [int] Class classification key". But we do not have a list of classes and there corrisponding integers. Which is confusing (if this exists please send me the link). If you are like me you will try entering a string like 'mammalia' anyway only to get a trackback. So what you can do instead is to the URL for the taxon level of interest and pull the key from there for mammals that is this https://www.gbif.org/species/359. the classKey = 359. 

occ.search() allows us to specify several parameters. Including country, here the documentation is fairly straight forward and is as follows: 'country – [str] The 2-letter country code (as per ISO-3166-1) of the country in which the occurrence was recorded. See here http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2´ 

We can visit the link and find that the two letter string for Germany is 'DE', or that Samoa is 'WS'. 

There are several other arguments that are imporant
1. limit-which gives the number of returned records the default is 300 and the limit is 1000.
2. offset-which indicates where to start from
3. q - allows search with a word or phrase

Alright now lets try it by looking at the first 10 mammal records.

In [42]:
occ.search(classKey=359,country='DE', limit=10)

{'offset': 0,
 'limit': 10,
 'endOfRecords': False,
 'count': 767703,
 'results': [{'key': 5028885891,
   'datasetKey': 'aa6c5ee6-d4d7-4a65-a04f-379cffbf4842',
   'publishingOrgKey': '2754e9c0-0e43-4f65-968a-6f16b9c378ce',
   'installationKey': 'dcceb601-2fb0-49dc-9cd2-7c00056f2b2c',
   'hostingOrganizationKey': '2754e9c0-0e43-4f65-968a-6f16b9c378ce',
   'publishingCountry': 'DE',
   'protocol': 'BIOCASE',
   'lastCrawled': '2025-05-09T16:19:17.871+00:00',
   'lastParsed': '2025-05-09T16:32:59.595+00:00',
   'crawlId': 345,
   'extensions': {},
   'basisOfRecord': 'HUMAN_OBSERVATION',
   'occurrenceStatus': 'PRESENT',
   'taxonKey': 5220126,
   'kingdomKey': 1,
   'phylumKey': 44,
   'classKey': 359,
   'orderKey': 731,
   'familyKey': 5298,
   'genusKey': 2440927,
   'speciesKey': 5220126,
   'acceptedTaxonKey': 5220126,
   'scientificName': 'Capreolus capreolus (Linnaeus, 1758)',
   'acceptedScientificName': 'Capreolus capreolus (Linnaeus, 1758)',
   'kingdom': 'Animalia',
   'phylum

The above output is not particularly readable, nor is it in the table format I would like for a list of species, we also have no idea how many records exist. 

To find the number of records lets use occ.count() however, here there is not argument classKey instead we use taxonKey

In [43]:
occ.count(taxonKey=359,country='DE')

767703

76,703 records is a lot. 

In [44]:
occ.count(taxonKey=359,country='DE', isGeoreferenced=True)

690231

We can see there are fewer records if we specify that we need the recored to be georeferanced.

Lets see if we can download the data and then simplify the output.

An interesting quirk of the occ.download() method is that filters need to be passed as parameters. Using either 

In [45]:
# Set GBIF credentials
os.environ["GBIF_USER"] = "your_gbif_username"
os.environ["GBIF_PWD"] = "your_gbif_password"
os.environ["GBIF_EMAIL"] = "your_gbif_email"

# Create download
download_key = occ.download(
    [
        'taxonKey = 359',
        'country = DE',
        'hasCoordinate = true'
    ],
    format="DWCA",  # or "SIMPLE_CSV", "SPECIES_LIST"
    user=os.environ["GBIF_USER"],
    pwd=os.environ["GBIF_PWD"],
    email=os.environ["GBIF_EMAIL"]
)



Exception: error: , with error status code 401check your number of active downloads.

In [None]:
print(download_key)
download_key = download_key[0]
print(download_key)
occ.download_meta(download_key)

{'key': '0030743-250426092105405',
 'doi': '10.15468/dl.rtjz43',
 'license': 'http://creativecommons.org/licenses/by-nc/4.0/legalcode',
 'request': {'predicate': {'type': 'and',
   'predicates': [{'type': 'equals',
     'key': 'TAXON_KEY',
     'value': '359',
     'matchCase': False},
    {'type': 'equals', 'key': 'COUNTRY', 'value': 'DE', 'matchCase': False},
    {'type': 'equals',
     'key': 'HAS_COORDINATE',
     'value': 'true',
     'matchCase': False}]},
  'sendNotification': True,
  'format': 'DWCA',
  'type': 'OCCURRENCE',
  'verbatimExtensions': []},
 'created': '2025-05-14T14:20:48.086+00:00',
 'modified': '2025-05-14T14:43:03.634+00:00',
 'eraseAfter': '2025-11-14T14:20:47.960+00:00',
 'status': 'SUCCEEDED',
 'downloadLink': 'https://api.gbif.org/v1/occurrence/download/request/0030743-250426092105405.zip',
 'size': 202831810,
 'totalRecords': 690231,
 'numberDatasets': 601}

Because our request succeeded: 'status': 'SUCCEEDED' we can use the occ.download_get() to download the zipfile

In [None]:
occ.download_get(
    download_key
)

INFO:Download file size: 202831810 bytes
INFO:On disk at ./0030743-250426092105405.zip


{'path': './0030743-250426092105405.zip',
 'size': 202831810,
 'key': '0030743-250426092105405'}

The download worked so now we have a zipfile

In [None]:
import zipfile

with zipfile.ZipFile(f"0030743-250426092105405.zip", "r") as zip_ref:
    zip_ref.extractall("gbif_data")




In [53]:
import pandas as pd

df = pd.read_csv("gbif_data/occurrence.txt", delimiter="\t")
df.head()



  df = pd.read_csv("gbif_data/occurrence.txt", delimiter="\t")


Unnamed: 0,gbifID,accessRights,bibliographicCitation,language,license,modified,publisher,references,rightsHolder,type,...,publishedByGbifRegion,level0Gid,level0Name,level1Gid,level1Name,level2Gid,level2Name,level3Gid,level3Name,iucnRedListCategory
0,296571211,,,,CC_BY_4_0,,GEO-Tag der Artenvielfalt,,,,...,EUROPE,DEU,Germany,DEU.10_1,Nordrhein-Westfalen,DEU.10.40_1,Rhein-Erft-Kreis,DEU.10.40.9_1,Pulheim,LC
1,296571162,,,,CC_BY_4_0,,GEO-Tag der Artenvielfalt,,,,...,EUROPE,DEU,Germany,DEU.10_1,Nordrhein-Westfalen,DEU.10.40_1,Rhein-Erft-Kreis,DEU.10.40.9_1,Pulheim,LC
2,296571180,,,,CC_BY_4_0,,GEO-Tag der Artenvielfalt,,,,...,EUROPE,DEU,Germany,DEU.10_1,Nordrhein-Westfalen,DEU.10.40_1,Rhein-Erft-Kreis,DEU.10.40.9_1,Pulheim,LC
3,164193672,,,,CC_BY_4_0,,GEO-Tag der Artenvielfalt,,,,...,EUROPE,DEU,Germany,DEU.13_1,Sachsen-Anhalt,DEU.13.12_1,Salzlandkreis,DEU.13.12.8_1,Könnern,LC
4,164250194,,,,CC_BY_4_0,,GEO-Tag der Artenvielfalt,,,,...,EUROPE,DEU,Germany,DEU.7_1,Hessen,DEU.7.11_1,Kassel,DEU.7.11.16_1,Kaufungen,LC


check the documentation on this page to find the column names and the meaning
https://dwc.tdwg.org/terms/

In [55]:
selectedColumns = [
    'order',
    'family',
    'scientificName',
    'vernacularName',
    'decimalLatitude',
    'decimalLongitude'
]
dfv1 = df[selectedColumns]
dfv1.head()

Unnamed: 0,order,family,scientificName,vernacularName,decimalLatitude,decimalLongitude
0,Rodentia,Sciuridae,"Sciurus vulgaris Linnaeus, 1758",,50.949373,6.757536
1,Artiodactyla,Cervidae,"Capreolus capreolus (Linnaeus, 1758)",,50.949373,6.757536
2,Lagomorpha,Leporidae,"Lepus europaeus Pallas, 1778",,50.949373,6.757536
3,Rodentia,Cricetidae,"Microtus arvalis (Pallas, 1779)",,51.707617,11.681411
4,Soricomorpha,Talpidae,"Talpa europaea Linnaeus, 1758",,51.288684,9.610235


The above is super nice but we only want the unique species 

In [56]:
# Collapse vernacular names per species
species_list = (
    dfv1.groupby(['order', 'family','scientificName'], as_index=False)
      .agg({'vernacularName': lambda x: ', '.join(sorted(set(x.dropna())))})
)
species_list.head()

#rowCount =  len(species_list)
#print(f"Number of species: {rowCount}")

Unnamed: 0,order,family,scientificName,vernacularName
0,Afrosoricida,Tenrecidae,"Echinops telfairi Martin, 1838",
1,Afrosoricida,Tenrecidae,"Tenrec ecaudatus (Schreber, 1778)",
2,Artiodactyla,Anoplotheriidae,"Anoplotherium Cuvier, 1804",
3,Artiodactyla,Anoplotheriidae,"Dacrytherium ovinum (Owen, 1857)",
4,Artiodactyla,Anoplotheriidae,"Diplobune Rütimeyer, 1862",


Now there is a list of species 

In [51]:
species_list.to_csv("MammalsOfGermany.csv", index=False)

the data is not clean and needs further processing, the scientific name for the same species is reported in several formates. I will try a second method for hopefully cleaner data

In [58]:
selectedColumns = [
    'order',
    'family',
    'genus',
    'speciesEpithet',
    'vernacularName',
    'decimalLatitude',
    'decimalLongitude'
]
df = df[selectedColumns]
df.head()

KeyError: "['speciesEpithet'] not in index"