![PANGAEA_Banner.png](https://github.com/pangaea-data-publisher/community-workshop-material/raw/master/banner.png)

# **pangaeapy practical**
## **How to search and download data from PANGAEA**

Version: 0.1.0<br>
By: Kathrin Riemann-Campe and Michael Oellermann<br>
Last updated: 2024-05-02

This notebook will guide you how to retrieve diverse earth- and environmental data and its metadata from the [PANGAEA data repository](https://www.pangaea.de) using Python. It uses the [PangaeaPy package](https://pypi.org/project/pangaeapy/), to facilitate the data download.

Run this notebook in:
* [GoogleColab](https://colab.research.google.com/github/pangaea-data-publisher/community-workshop-material/blob/master/Python/PANGAEApy_practical/PANGAEApy_practical.ipynb): <a target="_blank" href="https://colab.research.google.com/github/pangaea-data-publisher/community-workshop-material/blob/master/Python/PANGAEApy_practical/pangaeapy_practical.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Content of this notebook
1. Import libraries
2. Query for data in PANGAEA
3. Get metadata
4. Download datasets
5. Download binary files

# 1. Import libraries

In [None]:
### general libraries
import os
import pandas as pd
import numpy as np
import requests 
from urllib.request import urlopen, urlretrieve

In [None]:
### plotting
from matplotlib import pyplot as plt
import plotly.express as px

In [None]:
### PANGAEApy
## if you need to install PANGAEApy use pip
!pip install pangaeapy # comment to not install pangaeapy

## if you need to upgrade PANGAEApy use 
#!pip install pangaeapy --upgrade # Uncomment to upgrade pangaeapy

## check version of PANGAEApy
#!pip show pangaeapy

## for details see https://pypi.org/project/pangaeapy/ 

import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet

to call the PANGAEApy documentation uncomment one of the following lines

In [None]:
#help(pan)
### or 
#help(pan.panquery)
### or
#help(pan.pandataset)

In [None]:
# ignore warnings in this script
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning))
warnings.simplefilter(action='ignore', category=FutureWarning)

# 2. Query for data in PANGAEA

AIM: What data can I find for a particular topic such as a species, location or author?

This mirrors the query via the [PANGAEA website](https://pangaea.de/)

## 2.1 Simple query
Note:
* limit = the maximum number of datasets to be returned from query is 500.
    * default limit = 10
    * To download > 500 use the offset attribute e.g. pan.PanQuery("Triticum", limit = 500, offset=500)
* type: 
    * collection = dataset collection
    * member = individual dataset which can be part of a dataset collection 
* score: Indicates how well the dataset matched the query term

In [None]:
# query database for "Triticum"
query = pan.PanQuery("Triticum")
print(f'There are {query.totalcount} query results.')

# save query as dataframe
query_results = pd.DataFrame(query.result)
print(f'Total length of data frame query_results is {len(query_results)}.')

In [None]:
query_results

## 2.2 More complex queries

[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords


#### Multiple query terms

In [None]:
# find datasets that contain both "marine" and "geology"
query = pan.PanQuery("marine geology")
print(f'There are {query.totalcount} query results.')

#### Optional query terms

In [None]:
# find datasets that contain "Globigerina" and either "falconensis" or "bulloides" 
query = pan.PanQuery("Globigerina AND (falconensis OR bulloides)")
print(f'There are {query.totalcount} query results.')

#### Uncertain spelling

In [None]:
# find datasets with uncertain spelling of single letter
query = pan.PanQuery("Gl?bigerina")
print(f'There are {query.totalcount} query results.')

In [None]:
# finds datasets with "Neogloboquadrina" regardless of your spelling mistake
query = pan.PanQuery("~Neogloboqadrina")
print(f'There are {query.totalcount} query results.') 

#### Specific author

In [None]:
# find datasets of author "Herzschuh"
query = pan.PanQuery("citation:author:Herzschuh")
print(f'There are {query.totalcount} query results.') 

#### Within geolocation

In [None]:
# query database for "Globigerina bulloides" within a certain geolocation aka bounding box
# bounding box: bbox=(minlon, minlat,  maxlon, maxlat)
query = pan.PanQuery("Globigerina bulloides", limit = 500, bbox=(17.7, 67.7, 21, 69))
print(f'There are {query.totalcount} query results.')

## 2.3 Queries exceeding 500 results

### How to query PANGAEA without result limitations
* The maximum of retrieving search results is 500 datasets.  
* Retrieve datasets in chunks of 500 via offset option.  
* Put all datasets in one data frame.

In [None]:
# query database for project "PAGES_C-PEAT" 
query = pan.PanQuery("project:label:PAGES_C-PEAT", limit = 500)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

In [None]:
# Get all results and combine them in data frame.

# create empty data frame
df_query_results_all = pd.DataFrame()

# loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):
    # store result of individual step in qs
    qs = pan.PanQuery("project:label:PAGES_C-PEAT", limit = 500, offset=i)
    # convert qs result with 500 entries to data frame df_qs
    df_qs = pd.DataFrame(qs.result)
    # concatenate all individual df_qs into one data frame named query_results_all
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)
    
print(f'There are {query.totalcount} query results.')
print(f'df_query_results_all consists of {len(df_query_results_all)} results.')

In [None]:
# show first 3 lines
df_query_results_all.head(3)

In [None]:
# show last 3 lines
df_query_results_all.tail(3)

## 2.4. Quiz

[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords

### 2.4.1 How many datasets contain "Octopus vulgaris"?

In [None]:
# Your solution

### 2.4.2 How many datasets contain "sea ice" in the title only?

In [None]:
# Your solution

### 2.4.3 How many datasets has the author Antje Boetius published?

In [None]:
# Your solution

### 2.4.4 How many datasets measured "Temperature, water" using a CTD/Rosette?

In [None]:
# Your solution

# 3. Get metadata

A long list of metadata is callable with PanDataSet. 
Find a comprehensive list in internal documentation  
_help(pan.PanQuery)_    

or in this notebook full of examples: [pangaeapy_detailed_metadata_search.ipynb](https://github.com/pangaea-data-publisher/community-workshop-material/tree/master/Python/PANGAEApy_practical/pangaeapy_detailed_metadata_search.ipynb)  

additional example on how to extract project-specific information from PANGAEA datasets: [PANGAEA_access_metadata_per_project.ipynb](https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Python/PANGAEApy_practical/PANGAEA_access_metadata_per_project.ipynb)

## 3.1 Get metadata of individual dataset

#### Example dataset from PANGAEA https://doi.pangaea.de/10.1594/PANGAEA.923033

In [None]:
# Example dataset from PANGAEA
#ds = PanDataSet('https://doi.pangaea.de/10.1594/PANGAEA.923033', include_data=False) # metadata only
#ds = PanDataSet('doi:10.1594/PANGAEA.923033', include_data=False) # metadata only
ds = PanDataSet(923033, include_data=False) # metadata only

### Basic metadata retrieval

In [None]:
# Title
print(f'Title: {ds.title}')
# Abstract
print(f'Abstract: {ds.abstract}')
# Publication date
print(f'Publication date: {ds.date}')
# Authors
print(f'Authors: {"; ".join([x.fullname for x in ds.authors])}')
# Author orcids
print(f'Orcids: {"; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])}')
# Citation
print(f'Citation: {ds.citation}')
# doi
print(f'doi: {ds.doi}')
# Geolocation
print(f'Latitude: {ds.geometryextent["meanLatitude"]}')
print(f'Longitude: {ds.geometryextent["meanLongitude"]}')
# Parameters
params = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
print(f'Parameters: {params}')
# Event devices
print(f'Event devices: {"; ".join(set([device for device in ds.getEventsAsFrame()["device"]]))}')

### Store metadata in data frame

In [None]:
# create empty data frame
df = pd.DataFrame()

# store metadata in df
df.loc[0,'dataset title'] = ds.title
df.loc[0,'abstract'] = ds.abstract
df.loc[0,'publication date'] = ds.date

# ds.authors is a list
df.loc[0,'first author fullname'] = ds.authors[0].fullname
df.loc[0,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])

# authors orcids is a list
df.loc[0,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])

df.loc[0,'citation'] = ds.citation
df.loc[0,'dataset DOI'] = ds.doi
df.loc[0,'mean latitude'] = ds.geometryextent["meanLatitude"]
df.loc[0,'mean longitude'] = ds.geometryextent["meanLongitude"]

# parameters is a list
df.loc[0,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])

# event devices
df.loc[0,'device'] = "; ".join(set([device for device in ds.getEventsAsFrame()["device"]]))

In [None]:
df

### Save dataframe as file

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
# Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
# Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.csv'), encoding='utf-8')
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata of {ds.id} saved')

##### find out more about output formats e.g. excel at https://pandas.pydata.org/pandas-docs/stable/reference/io.html

## 3.2 Getting metadata for multiple datasets

In [None]:
query = pan.PanQuery("basis:tara location:'arctic ocean'", limit=500)
print(f'There are {query.totalcount} query results.')

In [None]:
# store query results in dataframe
df = pd.DataFrame(query.result)

In [None]:
df.head()

#### Loop over all entries in df and get metadata for each entry

NOTE: As a safety precaution, the number of metadata requests is limited for a specific time period. 

_Received too many (metadata) requests error (429)...waiting 30s -_

If you have larger requests, prepare to wait or use a different tool e.g. OAI-PMH (https://wiki.pangaea.de/wiki/OAI-PMH).

In [None]:
for ind,value in df['URI'].items():
    #print(value)
    
    # get metadata 
    ds = PanDataSet(id=value, include_data=False) # just metadata
   
    # store metadata in df in new column
    df.loc[ind,'dataset title'] = ds.title
    df.loc[ind,'abstract'] = ds.abstract
    df.loc[ind,'publication date'] = ds.date

    # ds.authors is a list
    df.loc[ind,'first author fullname'] = ds.authors[0].fullname
    df.loc[ind,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])

    # authors orcids is a list
    df.loc[ind,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])

    df.loc[ind,'citation'] = ds.citation
    df.loc[ind,'dataset DOI'] = ds.doi
    
    # some datasets contain binaries and no events => ds.geometryextent is empty 
    if bool(ds.geometryextent):
        df.loc[ind,'mean latitude'] = ds.geometryextent["meanLatitude"]
        df.loc[ind,'mean longitude'] = ds.geometryextent["meanLongitude"]

    # parameters is a list
    df.loc[ind,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])

    # some older datasets have no events => ds.getEventsAsFrame() is empty 
    if not ds.getEventsAsFrame().empty:
        # event devices
        df.loc[ind,'device'] = "; ".join(set([device if device else "no device" for device in ds.getEventsAsFrame()["device"]]))

In [None]:
df.head(2)

### Save dataframe as file

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
# Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
# Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.csv'), encoding='utf-8')
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata df_all saved')

##### find out more about output formats e.g. excel at https://pandas.pydata.org/pandas-docs/stable/reference/io.html

## 3.3 Quiz

### 3.3.1 What is the title of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.937210

In [None]:
# Your solution

### 3.3.2 What is the publication date of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.863967

In [None]:
# Your solution

### 3.3.3 Did they measure temperature in this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.863975

In [None]:
# Your solution

# 4. Download datasets

## 4.1 Download single dataset
* download open access dataset
* apply authentication token

AIM: How can I download a single dataset right into Python or to my harddrive?

### Search for datasets

In [None]:
# query database for "Deep-sea Sponge Microbiome Project" 
query = pan.PanQuery("Deep-sea Sponge Microbiome Project", limit = 50)
query_results = pd.DataFrame(query.result)
query_results

### Download dataset from PANGAEA
Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.923033

Using the full url

In [None]:
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.923033")
# ds contains data and metadata
# see section below on how to get metadata
print(type(ds))

# ds.data is data frame
print(type(ds.data))

# dataset header contains of parameter short names without unit
ds.data.head(3)

Using the doi

In [None]:
ds = PanDataSet("doi:10.1594/PANGAEA.923033")
ds.data.head(3)

Using the PANGAEA ID

In [None]:
ds = PanDataSet(923033)
ds.data.head(3)

### Translate to long parameter names
Because by default parameters are abbreviated without units

In [None]:
# Translate short parameters names to long names including unit
def get_long_parameters(ds):
    """Translate short parameters names to long names including unit

    Args:
        ds (PANGAEA dataset): PANGAEA dataset
    """
    ds.data.columns =  [f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]


In [None]:
ds.data.head(2)

In [None]:
get_long_parameters(ds)

In [None]:
ds.data.head(2)

### What is an authentication token and what is it good for?

Example dataset with access restriction: https://doi.pangaea.de/10.1594/PANGAEA.960280

extract from help(pan.pandataset)  
_class PanDataSet(builtins.object)  
        PanDataSet(id=None, paramlist=None, deleteFlag='', enable_cache=False, include_data=True, expand_terms=[], auth_token=None, cache_expiry_days=1)_

Find **your** temporary authentication token at https://pangaea.de/user/

In [None]:
my_token = ''
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.960280", auth_token=my_token)

In [None]:
ds.data.head()

### Display location of dataset samples
Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.923033

In [None]:
ds = PanDataSet(923033)
get_long_parameters(ds)

# Plot sampling points on interactive plotly map
fig = px.scatter_mapbox(ds.data, lat="LATITUDE", lon="LONGITUDE", 
                        hover_name="Event label", 
                        hover_data=['LATITUDE', 'LONGITUDE', 'DEPTH, water [m]', 'Species', 'Gear'], 
                        zoom=0, height=300)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Save data

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
# Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
# Save as csv (comma seperated value)
print(f'PANGAEA dataset {ds.id} saved')
ds.data.to_csv(os.path.join(data_directory, f'PANGAEA_dataset_{ds.id}.csv'),index=False)

## 4.2 Download multiple datasets

AIM: How can I download multiple datasets right into Python or on my harddrive?

### Perform query

In [None]:
# query database for Thermosalinograph data published from 2020 onwards  

# Get all results and combine them in data frame.
df_all = pd.DataFrame()

# loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):

    # store result of individual step in qs
    qs = pan.PanQuery("device:thermosalinograph citation:year:202*", limit = 500, offset=i)
    
    # convert qs result with 500 entries to data frame df_qs
    df_qs = pd.DataFrame(qs.result)
    
    # concatenate all individual df_qs into one data frame named query_results_all
    df_all = pd.concat([df_all,df_qs],ignore_index=True)
    
df_all.head(2)

In [None]:
df_all.tail(2)

### Case 1: Download multiple datasets and treat them as individuals
Note: 
* Data collections and restricted datasets cannot be downloaded

In [None]:
# check whether df_all consists of collections
df_all[df_all['type']=='collection']

In [None]:
# Create dictionary to store dataframes in
data_dict = {}
# Loop over DOIs and download datasets
#for pangaea_doi in df_all['URI']:
for pangaea_doi in df_all['URI'][0:3]: # loop only over first 3 datasets
    print("".join(40*["-"]))
    print(f'PANGAEA ID: {pangaea_doi}')
    # Cache
    ds = PanDataSet(pangaea_doi, enable_cache=True)
    # Translate to long parameter names
    get_long_parameters(ds)
    print(f'Dataset title: {ds.title}')
    print(ds.data.head(2))
    pangaea_id = pangaea_doi.split('A.')[1]
    data_dict[pangaea_id] = ds.data

In [None]:
list(data_dict)

In [None]:
data_dict['910965'].head()

### Save multiple datasets as individuals

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
# Loop over each dataset in the dictionary and save to csv
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    # Save to csv
    data_dict[key].to_csv(os.path.join(data_directory, f'PANGAEA_dataset_{key}.csv'),index=False)

### Case 2: Download multiple datasets and combine them in 1 data frame

In [None]:
df_all.head()

In [None]:
# Create one data frame for all datasets
data_all = pd.DataFrame()

# Loop over DOIs and download datasets
#for pangaea_doi in df_all['URI']:
for pangaea_doi in df_all['URI'][0:3]: # loop only over first 3 datasets
    print("".join(40*["-"]))
    print(f'PANGAEA ID: {pangaea_doi}')
    
    # Cache
    ds = PanDataSet(pangaea_doi, enable_cache=True)
    
    # Translate to long parameter names
    get_long_parameters(ds)
    print(f'Dataset title: {ds.title}')
    #print(ds.data.head(2))
    #print(data_all.count())
    data_all = pd.concat([data_all,ds.data])#,ignore_index=True)

In [None]:
data_all.head(3)

In [None]:
data_all.tail(3)

### Save data frame

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
data_all.to_csv(os.path.join(data_directory, f'PANGAEA_dataset_all.csv'),index=False)

## 4.3 Quiz

### 4.3.1 Download this dataset and identify the first event name
https://doi.PANGAEA.de/10.1594/PANGAEA.947275

In [None]:
# Your solution

### 4.3.2 Download this dataset and identify the number of sampling points >1000m
https://doi.pangaea.de/10.1594/PANGAEA.943624

In [None]:
# Your solution

### 4.3.3 Was there a sampling point in Australia for this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.943455

In [None]:
# Your solution

# 5. Download binary files

## 5.1 Download PANGAEA dataset with image data
Dataset: https://doi.pangaea.de/10.1594/PANGAEA.943250

In [None]:
# Download dataset from PANGAEA
pan_id = 943250
ds = PanDataSet(pan_id)
# Spell out abbreviated parameters
get_long_parameters(ds)
df = ds.data.iloc[22:25,:]
df.head(2)

## 5.2 Download images

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

# Create file urls
df["image_url"] = [f'https://download.pangaea.de/dataset/{pan_id}/files/{img}' for img in df['Image']]
# Download images
for i, file_url in enumerate(df["image_url"]):
    urlretrieve(file_url, os.path.join(data_directory, df["Image"].iloc[i]))
    print(f'{file_url} downloaded')