![PANGAEA_Banner.png](https://github.com/pangaea-data-publisher/community-workshop-material/raw/master/banner.png)

# pangaeapy practical
**How to search and download data from PANGAEA**

By: Kathrin Riemann-Campe
Last updated: 2025-05-12

This notebook will guide you how to retrieve diverse earth- and environmental data and its metadata from the [PANGAEA data repository](https://www.pangaea.de) using Python. It uses the [PangaeaPy package](https://pypi.org/project/pangaeapy/), to facilitate the data download.

Run this notebook in:
[GoogleColab](https://colab.research.google.com/github/pangaea-data-publisher/community-workshop-material/blob/master/Python/PANGAEApy_practical/pangaeapy_practical.ipynb): <a target="_blank" href="https://colab.research.google.com/github/pangaea-data-publisher/community-workshop-material/blob/master/Python/PANGAEApy_practical/pangaeapy_practical.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Content of this notebook
1. Import libraries
2. Query for data in PANGAEA
3. Get metadata
4. Download datasets
5. Download binary files

## 1. Import libraries

In [None]:
### general libraries
import os
import pandas as pd
import numpy as np
import requests 
from urllib.request import urlopen, urlretrieve

In [None]:
### PANGAEApy
## if you need to install PANGAEApy use pip
!pip install pangaeapy # comment to not install pangaeapy

## if you need to upgrade PANGAEApy use 
# !pip install pangaeapy --upgrade # Uncomment to upgrade pangaeapy

## check version of PANGAEApy
# !pip show pangaeapy

## for details see https://pypi.org/project/pangaeapy/ 

import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet

### PANGAEApy documentation
To call the PANGAEApy documentation uncomment one of the following lines

In [None]:
# help(pan)
### or 
# help(pan.panquery)
### or
# help(pan.pandataset)

In [None]:
### ignore warnings in this script
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning))
warnings.simplefilter(action='ignore', category=FutureWarning)

## 2. Query for data in PANGAEA

AIM: What data can I find for a particular topic such as a species, location or author?

This mirrors the query via the [PANGAEA website](https://pangaea.de/)  

**Note:** The search term is enclosed with single quotes '. If your search term includes a blank, use additional double quotes " inside the single quotes.  
Example: _'sea ice'_ vs. _'"sea ice"'_  
Example: _'parameter:Temperature, water method:CTD/Rosette'_ vs. _'parameter:"Temperature, water" method:CTD/Rosette'_

### General info on query
Note:
* limit = the maximum number of datasets to be returned from query is 500.
    * default limit = 10
    * To download > 500 use the offset attribute e.g. pan.PanQuery("Triticum", limit = 500, offset=500)
* type: 
    * collection = dataset collection
    * member = individual dataset which can be part of a dataset collection 
* score: Indicates how well the dataset matched the query term
* help(pan.panquery)

### 2.1 Basic queries

In [None]:
### query database for "Geochemistry"
query = pan.PanQuery('Geochemistry')
## compare with https://pangaea.de/?q=Geochemistry

In [None]:
### query is a PANGAEApy object with built in objects
print(query)

In [None]:
### you can ask the following attributes
## totalcount, query error, result
print(query.query)

In [None]:
print(f'There are {query.totalcount} query results.')

In [None]:
### put query results into dataframe
query_results = pd.DataFrame(query.result)
print(f'Total length of data frame query_results is {len(query_results)}.')

In [None]:
query_results

#### Query PANGAEA with combinations of keywords
[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords

In [None]:
### find datasets that contain both "Geochemistry" and "sediment core"
## remember how to use the different quotes:
## The search term is enclosed with single quotes '. If your search term includes a blank, use additional double quotes " inside the single quotes.
query = pan.PanQuery('Geochemistry "sediment core"')
print(f'There are {query.totalcount} query results.')

#### Optional query terms

In [None]:
### find datasets that contain "Geochemistry" and either "Spitzbergen" or "Svalbard" 
query = pan.PanQuery('Geochemistry AND (Spitzbergen OR Svalbard)')
print(f'There are {query.totalcount} query results.')

#### Uncertain spelling

In [None]:
### find datasets with uncertain spelling of single letter
query = pan.PanQuery('M?ller')
print(f'There are {query.totalcount} query results.')

In [None]:
# finds datasets with "Neogloboquadrina" regardless of your spelling mistake
query = pan.PanQuery('~Neogloboqadrina')
print(f'There are {query.totalcount} query results.') 

#### Specific author

In [None]:
# find datasets of author "Herzschuh"
query = pan.PanQuery('citation:author:Boetius')
print(f'There are {query.totalcount} query results.') 

#### Within geographical coordinates a.k.a bounding box

In [None]:
### query database for "Geochemistry" and "sediment core" within a certain geolocation a.k.a. bounding box
## bounding box: bbox=(minlon, minlat,  maxlon, maxlat)
query = pan.PanQuery('Geochemistry "sediment core"', limit = 500, bbox=(-60, 50, -10, 70))
print(f'There are {query.totalcount} query results.')

### 2.3 Queries exceeding 500 results

### 2.2 How to query PANGAEA without result limitations
* The maximum of retrieving search results is 500 datasets.  
* Retrieve datasets in chunks of 500 via offset option.  
* Put all datasets in one data frame.

In [None]:
### query database for project "PAGES_C-PEAT" 
query = pan.PanQuery('project:label:PAGES_C-PEAT', limit = 500)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

In [None]:
### Get all results and combine them in data frame.

### create empty data frame
df_query_results_all = pd.DataFrame()

### loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):
    # store result of individual step in qs
    qs = pan.PanQuery('project:label:PAGES_C-PEAT', limit = 500, offset=i)
    # convert qs result with 500 entries to data frame df_qs
    df_qs = pd.DataFrame(qs.result)
    # concatenate all individual df_qs into one data frame named query_results_all
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)
    
print(f'There are {query.totalcount} query results.')
print(f'df_query_results_all consists of {len(df_query_results_all)} results.')

In [None]:
### show first 3 lines
df_query_results_all.head(3)

In [None]:
### show last 3 lines
df_query_results_all.tail(3)

### 2.3 Quiz
[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords

#### 2.3.1 How many datasets contain "geological investigations"?
Hint: "geological investigations" **not** "geological" and "investigations"

In [None]:
# Your solution

In [None]:
### solution
query = pan.PanQuery('"geological investigations"')
print(query.totalcount)

#### 2.3.2 How many datasets contain "geological investigations" in the title only?

In [None]:
# Your solution

In [None]:
### solution
query = pan.PanQuery('citation:title:"geological investigations"')
print(query.totalcount)

#### 2.3.3 How many datasets measured "Temperature, water" using a CTD/Rosette?

In [None]:
# Your solution

In [None]:
### solution
query = pan.PanQuery('parameter:"Temperature, water" method:CTD/Rosette')
print(query.totalcount)

## 3. Get metadata
A long list of metadata is callable with PanDataSet. 
Find a comprehensive list in internal documentation  
_help(pan.PanQuery)_    

or in this notebook full of examples: [pangaeapy_detailed_metadata_search.ipynb](https://github.com/pangaea-data-publisher/community-workshop-material/tree/master/Python/PANGAEApy_practical/pangaeapy_detailed_metadata_search.ipynb)  

### 3.1 Get metadata of individual dataset

##### Example dataset from PANGAEA https://doi.pangaea.de/10.1594/PANGAEA.923033

In [None]:
### 3 ways to ask for dataset metadata
## complete URL
# ds = PanDataSet('https://doi.pangaea.de/10.1594/PANGAEA.918423', include_data=False) 
## just URI
# ds = PanDataSet('doi:10.1594/PANGAEA.918423', include_data=False)
## just PANGAEA id number of dataset
ds = PanDataSet(918423, include_data=False) 

#### Basic metadata retrieval

In [None]:
### Title
ds.title

In [None]:
### Authors
print(f'Authors: {"; ".join([x.fullname for x in ds.authors])}')

In [None]:
### Full Reference
ds.citation

In [None]:
### Geolocation
print(f'Latitude: {ds.geometryextent["meanLatitude"]}')
print(f'Longitude: {ds.geometryextent["meanLongitude"]}')

In [None]:
### Parameters
params = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
print(f'Parameters: {params}')

In [None]:
### Event as dataframe
ds.getEventsAsFrame()

In [None]:
### Event as PanEvent object
print(ds.events)

for event in ds.events:
    print(event.label)
    print(event.method.name)
    print(event.basis.name)

#### Store metadata in data frame

In [None]:
### create empty data frame
df = pd.DataFrame()

### store metadata in df
df.loc[0,'dataset title'] = ds.title
df.loc[0,'abstract'] = ds.abstract

### ds.authors is a list
df.loc[0,'first author fullname'] = ds.authors[0].fullname
df.loc[0,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])

### authors orcids is a list
df.loc[0,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])

df.loc[0,'citation'] = ds.citation
df.loc[0,'dataset DOI'] = ds.doi
df.loc[0,'west bound longitude'] = ds.geometryextent["westBoundLongitude"]
df.loc[0,'east bound longitude'] = ds.geometryextent["eastBoundLongitude"]
df.loc[0,'south bound latitude'] = ds.geometryextent["southBoundLatitude"]
df.loc[0,'north bound latitude'] = ds.geometryextent["northBoundLatitude"]
### parameters is a list
df.loc[0,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])

### event devices
df.loc[0,'label'] = "; ".join(set([device for device in ds.getEventsAsFrame()["label"]]))

In [None]:
df

#### Save dataframe as file

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
# Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.csv'), encoding='utf-8', index=False)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata of "{ds.title}" saved')

##### find out more about output formats e.g. excel at https://pandas.pydata.org/pandas-docs/stable/reference/io.html

### 3.2 Getting metadata for multiple datasets

In [None]:
### remember the limit!
query = pan.PanQuery("basis:tara location:'arctic ocean'", limit=5)
print(f'There are {query.totalcount} query results.')

In [None]:
# store query results in dataframe
df = pd.DataFrame(query.result)

In [None]:
df.head()

#### Loop over all entries in df and get metadata for each entry
NOTE: As a safety precaution, the number of metadata requests is limited for a specific time period. 

_Received too many (metadata) requests error (429)...waiting 30s -_

If you have larger requests, prepare to wait or use a different tool e.g. OAI-PMH (https://wiki.pangaea.de/wiki/OAI-PMH).

In [None]:
### Create one data frame for all datasets
data_all = pd.DataFrame()

### loop over all datasets ins df
for ind,value in df['URI'].items():
    
    ## use PanDataSet to get metadata and data and put them into 2 diferent dataframes
    ds = PanDataSet(value, include_data=False)

    print(ind, ds.doi)

    ## put metadata into df in new columns
    df.loc[ind,'Title'] = ds.title
    df.loc[ind,'Publication date'] = ds.date
    df.loc[ind,'Authors'] = {"; ".join([x.fullname for x in ds.authors])}
    df.loc[ind,'Citation'] = ds.citation
    df.loc[ind,'DOI'] = ds.doi
    df.loc[ind,'Parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    if ds.events:
        df.loc[ind,'Event'] = "; ".join([x.label for x in ds.events])


In [None]:
df.head(2)

In [None]:
### drop columns no longer needed
df = df.drop(['URI','score','html','type','position'],axis=1)
df

#### Save dataframe as file

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.csv'), encoding='utf-8', index=False)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata saved')

##### find out more about output formats e.g. excel at https://pandas.pydata.org/pandas-docs/stable/reference/io.html

### 3.3 Quiz

#### 3.3.1 What is the title of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.937210

In [None]:
# Your solution

In [None]:
### solution
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.937210", include_data=False)
ds.title

#### 3.3.2 What is the name of the second author of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.863967

In [None]:
# Your solution

In [None]:
### solution
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.863967", include_data=False)
ds.authors[1].fullname

#### 3.3.3 Did they measure temperature in this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.863975

In [None]:
# Your solution

In [None]:
### solution
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.863975")

list_params = list(ds.params)
print(list_params)

if 'temperature' in list_params:
    print('temperature included')
else:
    print('no temperature')

# print("".join(40*["-"]))

# list_parameter = ("; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]))
# print(list_parameter)

# if 'temperature' in list_parameter:
#     print('temperature included')
# else:
#     print('no temperature')

## 4. Download datasets

### 4.1 Download single dataset
* download open access dataset
* apply authentication token

#### Search for datasets

In [None]:
# query database for "Deep-sea Sponge Microbiome Project" 
query = pan.PanQuery('"Deep-sea Sponge Microbiome Project"', limit = 50)
query_results = pd.DataFrame(query.result)
query_results

#### Download dataset from PANGAEA
Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.923033

In [None]:
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.923033")
### ds contains data and metadata
### see section below on how to get metadata
print(type(ds))

### ds.data is data frame
print(type(ds.data))

### dataset header contains of parameter short names without unit
ds.data.head(3)

#### Translate to long parameter names
Because by default parameters are abbreviated without units

In [None]:
# Translate short parameters names to long names including unit
def get_long_parameters(ds):
    """Translate short parameters names to long names including unit

    Args:
        ds (PANGAEA dataset): PANGAEA dataset
    """
    ds.data.columns =  [f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]


In [None]:
ds.data.head(2)

In [None]:
get_long_parameters(ds)

In [None]:
ds.data.head(2)

#### What is an authentication token and what is it good for?

Example dataset with access restriction: https://doi.pangaea.de/10.1594/PANGAEA.960280

extract from help(pan.pandataset)  
_class PanDataSet(builtins.object)  
        PanDataSet(id=None, paramlist=None, deleteFlag='', enable_cache=False, include_data=True, expand_terms=[], auth_token=None, cache_expiry_days=1)_

Find **your** temporary authentication token at https://pangaea.de/user/

In [None]:
my_token = ''
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.960280", auth_token=my_token)

In [None]:
ds.data.head()

#### Save data

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
### Save as csv (comma seperated value)
print(f'PANGAEA dataset "{ds.title}" saved')
ds.data.to_csv(os.path.join(data_directory, f'PANGAEA_dataset_{ds.id}.csv'),index=False)

### 4.2 Download multiple datasets

#### Perform query

In [None]:
# query database for Thermosalinograph data published from 2020 onwards  

# Get all results and combine them in data frame.
df_all = pd.DataFrame()

# loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):

    # store result of individual step in qs
    qs = pan.PanQuery('device:thermosalinograph citation:year:202* project:"dam underway"', limit = 500, offset=i)
    
    # convert qs result with 500 entries to data frame df_qs
    df_qs = pd.DataFrame(qs.result)
    
    # concatenate all individual df_qs into one data frame named query_results_all
    df_all = pd.concat([df_all,df_qs],ignore_index=True)
    
pd.concat([df_all.head(2),df_all.tail(2)])

#### 4.2.1 Download multiple datasets and treat them as individuals
Note: Data collections and restricted datasets cannot be downloaded

In [None]:
# check whether df_all consists of collections
df_all[df_all['type']=='collection']

In [None]:
### Create dictionary to store dataframes in
data_dict = {}
### Loop over DOIs and download datasets
# for pangaea_doi in df_all['URI']:
for pangaea_doi in df_all['URI'][0:3]: # loop only over first 3 datasets
    print("".join(40*["-"]))
    print(f'PANGAEA ID: {pangaea_doi}')
    ### Cache
    ds = PanDataSet(pangaea_doi, enable_cache=True)
    ### Translate to long parameter names
    get_long_parameters(ds)
    print(f'Dataset title: {ds.title}')
    print(ds.data.head(2))
    pangaea_id = pangaea_doi.split('A.')[1]
    data_dict[pangaea_id] = ds.data

In [None]:
list(data_dict)

In [None]:
data_dict['918071'].head()

#### Save multiple datasets as individuals

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
# Loop over each dataset in the dictionary and save to csv
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    # Save to csv
    data_dict[key].to_csv(os.path.join(data_directory, f'PANGAEA_dataset_{key}.csv'),index=False)

#### 4.2.2 Download and combine data and metadata of query results

NOTE: As a safety precaution, the number of metadata requests is limited for a specific time period. 

_Received too many (metadata) requests error (429)...waiting 30s -_

If you have larger requests, prepare to wait or use a different tool e.g. OAI-PMH (https://wiki.pangaea.de/wiki/OAI-PMH).

In [None]:
### Create one data frame for all datasets
data_all = pd.DataFrame()

### loop over all datasets in df_all
# for ind,value in df_all['URI'].items():
### only download first 3 results during workshop
for ind,value in df_all['URI'][0:3].items(): 
    
    ## use PanDataSet to get metadata and data and put them into 2 diferent dataframes
    ds = PanDataSet(value)

    print(ind, ds.doi)

    ## put metadata into df_all in new columns
    df_all.loc[ind,'Title'] = ds.title
    df_all.loc[ind,'Publication date'] = ds.date
    df_all.loc[ind,'Authors'] = {"; ".join([x.fullname for x in ds.authors])}
    df_all.loc[ind,'Citation'] = ds.citation
    df_all.loc[ind,'DOI'] = ds.doi
    df_all.loc[ind,'Parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    if ds.events:
        df_all.loc[ind,'Event'] = "; ".join([x.label for x in ds.events])
    
    ### Translate default short parameter names to long parameter names and add unit
    get_long_parameters(ds)
    
    ### create new data dataframe for each query result 
    df_data = pd.DataFrame()
    df_data = ds.data
    df_data['DOI'] = ds.doi

    ### combine all datasats into one dataframe
    data_all = pd.concat([data_all,df_data], ignore_index=True)


In [None]:
### show first two rows of metadata in df_all
df_all.head(2)

In [None]:
### show first two rows of data in data_all
data_all.head(2)

#### Save data

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as tab delimited text file
## set filenames
filename1 = 'thermosalinograph_metadata.txt'
filename2 = 'thermosalinograph_data.txt'

df_all.to_csv(os.path.join(data_directory, filename1), sep='\t', encoding='utf-8', index=False)
data_all.to_csv(os.path.join(data_directory, filename2), sep='\t', encoding='utf-8', index=False)

### 4.3 Quiz

#### 4.3.1 Download this dataset and identify the first event name
https://doi.PANGAEA.de/10.1594/PANGAEA.947275

In [None]:
# Your solution

In [None]:
### solution
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.947275")
print(ds.data.head(3))
print(f'The first event name is {ds.data.loc[0]["Event"]}')

#### 4.3.2 Download this dataset and identify the number of sampling points >1000m
https://doi.pangaea.de/10.1594/PANGAEA.943624

In [None]:
# Your solution

In [None]:
### solution
# Download and store data table in df
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.943624")
df = ds.data
# Filter sampling points above 1000m
df_1000 = df[df["Elevation"]>1000]
#print(df_1000.head(3))
# Count the number of sampling points
print(f'There are {len(df_1000)} sampling point above 1000m')


## 5. Download binary files

### 5.1 Download PANGAEA dataset with image data
Dataset: https://doi.pangaea.de/10.1594/PANGAEA.943250

In [None]:
# Download dataset from PANGAEA
pan_id = 943250
ds = PanDataSet(pan_id)
# Spell out abbreviated parameters
get_long_parameters(ds)
df = ds.data.iloc[22:25,:]
df.head(2)

### 5.2 Download images

In [None]:
# Create data directory
data_directory = "PANGAEA_data"
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

### example to download only 1 image at specific time step
df = ds.data[ds.data['DATE/TIME']=='2019-10-01 23:15:06']

### Create file urls
df["image_url"] = [f'https://download.pangaea.de/dataset/{ds.id}/files/{img}' for img in df['Image']]

### download images
for i, file_url in enumerate(df['image_url']):
    response = requests.get(file_url,data_directory)    
    index = df.loc[(df == file_url).any(axis=1)].index[0]
    ### save image
    open(data_directory+'/'+df.loc[index,'Image'], 'wb').write(response.content)
    print(df.loc[index,'Image'] +' downloaded')