# Final assignment
Hannah Weiser \
Heidelberg University \
Institute of Geography \
Advanced Geoscripting \
Summer term 2020

The goal of the project is to perform an explorative data analysis on specific species of digger wasps and their prey. We are using the GBIF (Global Biodiversity Information Facility) API. GBIF provides free and open access to biodiversity data (https://www.gbif.org/).

First, we import the packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pygbif import occurrences as occ  # python client for the GBIF API 
from pygbif import maps
import mplleaflet
import geopandas as gpd
%matplotlib inline

In [2]:
isodontia_dict = occ.search(scientificName = 'Isodontia mexicana')
isodontia_occs = isodontia_dict['results']

Let's display all keys in the first occurece.

In [3]:
display(isodontia_occs[0].keys())

dict_keys(['key', 'datasetKey', 'publishingOrgKey', 'installationKey', 'publishingCountry', 'protocol', 'lastCrawled', 'lastParsed', 'crawlId', 'extensions', 'basisOfRecord', 'occurrenceStatus', 'taxonKey', 'kingdomKey', 'phylumKey', 'classKey', 'orderKey', 'familyKey', 'genusKey', 'speciesKey', 'acceptedTaxonKey', 'scientificName', 'acceptedScientificName', 'kingdom', 'phylum', 'order', 'family', 'genus', 'species', 'genericName', 'specificEpithet', 'taxonRank', 'taxonomicStatus', 'dateIdentified', 'decimalLongitude', 'decimalLatitude', 'coordinateUncertaintyInMeters', 'stateProvince', 'year', 'month', 'day', 'eventDate', 'issues', 'modified', 'lastInterpreted', 'references', 'license', 'identifiers', 'media', 'facts', 'relations', 'gadm', 'geodeticDatum', 'class', 'countryCode', 'recordedByIDs', 'identifiedByIDs', 'country', 'rightsHolder', 'identifier', 'http://unknown.org/nick', 'verbatimEventDate', 'datasetName', 'gbifID', 'collectionCode', 'verbatimLocality', 'occurrenceID', 'tax

What's the coordinate system of the first occurence?

In [4]:
print(isodontia_occs[0]['geodeticDatum'])

WGS84


For further analyses, we want to work with a dataframe instead of a dictionary.

In [5]:
df_isodontia = pd.DataFrame.from_dict(isodontia_occs)
display(df_isodontia.head())
df_isodontia = df_isodontia.astype({'eventDate':'datetime64', 'dateIdentified':'datetime64'})

Unnamed: 0,key,datasetKey,publishingOrgKey,installationKey,publishingCountry,protocol,lastCrawled,lastParsed,crawlId,extensions,...,sex,lifeStage,eventRemarks,taxonRemarks,organismRemarks,collectionID,individualCount,continent,dataGeneralizations,higherClassification
0,2597845815,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T15:56:23.096+0000,238,{},...,,,,,,,,,,
1,2603300247,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T16:12:28.193+0000,238,{},...,,,,,,,,,,
2,2626578683,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T16:13:51.389+0000,238,{},...,,,,,,,,,,
3,2634131701,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T15:57:04.229+0000,238,{},...,,,,,,,,,,
4,2634531585,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T16:13:38.922+0000,238,{},...,,,,,,,,,,


Let's also save this immediately, so our workflow will be really reproducible.

In [6]:
df_isodontia.to_csv("gbif_occ_isodontia_mexicana.csv")

For now, we are only interested in the locations and the event date and only want occurences within europe. We should convert our dataframe to a geodataframe.

For plotting later, we have to make sure our Lat and Long columns do not contain NaN values, otherwise we will not get a plot with mplleaflet basemap.
Let's check if there are NaNs.

In [7]:
print(df_isodontia[['decimalLatitude', 'decimalLongitude']].isnull().sum().sum())

2


Looks like we have an entry with NaN values. We remove it with the `pd.dropna()` method.

In [8]:
df_isodontia = df_isodontia.dropna(subset=['decimalLatitude', 'decimalLongitude'])
gdf_isodontia = gpd.GeoDataFrame(df_isodontia, geometry=gpd.points_from_xy(df_isodontia.decimalLongitude, df_isodontia.decimalLatitude))
display(gdf_isodontia.head())

Unnamed: 0,key,datasetKey,publishingOrgKey,installationKey,publishingCountry,protocol,lastCrawled,lastParsed,crawlId,extensions,...,lifeStage,eventRemarks,taxonRemarks,organismRemarks,collectionID,individualCount,continent,dataGeneralizations,higherClassification,geometry
0,2597845815,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T15:56:23.096+0000,238,{},...,,,,,,,,,,POINT (-100.38803 20.52219)
1,2603300247,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T16:12:28.193+0000,238,{},...,,,,,,,,,,POINT (-97.88640 30.20881)
2,2626578683,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T16:13:51.389+0000,238,{},...,,,,,,,,,,POINT (-92.92551 45.00075)
3,2634131701,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T15:57:04.229+0000,238,{},...,,,,,,,,,,POINT (7.83430 48.01588)
4,2634531585,50c9509d-22c7-4a22-a47d-8c48425ef4a7,28eb1a3f-1c15-4a95-931a-4af90ecb574d,997448a8-f762-11e1-a439-00145eb45e9a,US,DWC_ARCHIVE,2020-11-01T14:40:23.781+0000,2020-11-01T16:13:38.922+0000,238,{},...,,,,,,,,,,POINT (8.68147 49.39040)


In [9]:
print(gdf_isodontia[['decimalLatitude', 'decimalLongitude']].isnull().sum().sum())

0


Let's see if we can plot it now. We will colour the points of occurences by the eventDate to see if we can get an idea how the species was spreading. 

In [10]:
fig, ax = plt.subplots()
gdf_isodontia.plot(ax=ax, c=gdf_isodontia['eventDate'], cmap='Blues')
#plt.colorbar()
mplleaflet.display(fig=fig)

The get_offset_position function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
  offset_order = offset_dict[collection.get_offset_position()]


We now want to look at one of the preys of this species and its distribution. Let's get the data. We will write a function, so we can easily retrieve the data for many species.

In [11]:
def species_occurences_to_gdf(species_name):
    occ_dict = occ.search(scientificName=species_name)
    occ_df = pd.DataFrame.from_dict(occ_dict['results'])
    occ_df = occ_df.astype({'dateIdentified':'datetime64'})
    occ_df = occ_df.dropna(subset=['decimalLatitude', 'decimalLongitude'])
    occ_df.to_csv("gbif_occ_" + species_name.replace(" ","_") + ".csv")
    occ_gdf = gpd.GeoDataFrame(occ_df, geometry=gpd.points_from_xy(occ_df.decimalLongitude, occ_df.decimalLatitude))
    return occ_gdf

In [12]:
colors=["blue", "orange", "red"]
plots = []
fig, ax = plt.subplots()
species_list = ['Isodontia mexicana', 'Meconema meridionale', 'Meconema thalassinum']
for i, species in enumerate(species_list): 
    df = species_occurences_to_gdf(species)
    plots.append(df.plot(ax=ax, c=colors[i]))

#plt.legend(handles=plots, labels=species)
#plt.show()
mplleaflet.display(fig=fig)

The get_offset_position function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
  offset_order = offset_dict[collection.get_offset_position()]


(Unfortunately, mplleaflet does not let us plot legends.)