![PANGAEA CWS Banner](https://github.com/pangaea-data-publisher/community-workshop-material/raw/master/banner.png)

# **How to retrieve data from PANGAEA**

Version: 0.1.0<br>
By: Michael Oellermann, Kathrin Riemann-Campe<br>
Last updated: 2023-05-12

This notebook will guide you how to retrieve diverse earth- and environmental data and its metadata from the [PANGAEA data repository](https://www.pangaea.de) using Python. It uses the [PangaeaPy package](https://github.com/pangaea-data-publisher/pangaeapy), to facilitate the data download.

Run this notebook in:
* [GoogleColab](https://colab.requery.google.com/github/pangaea-data-publisher/community-workshop-material/blob/master/Python/Get_pangaea_data/get_pangaea_data.ipynb): <a target="_blank" href="https://colab.requery.google.com/github/pangaea-data-publisher/community-workshop-material/blob/master/Python/Get_pangaea_data/get_pangaea_data.ipynb">
  <img src="https://colab.requery.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 1. Import libraries

In [None]:
import os
import pandas as pd
import openpyxl

# Plotting
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px

# Pangaeapy
!pip install pangaeapy # Uncomment to install pangaeapy
import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet

# Web scraping
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve
import json
import requests
from pandas import json_normalize

# To access genebank records
!pip install biopython # To install biopython library
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "your_email@example.com"

# 2. Query for data in PANGAEA

AIM: What data can I find for a particular topic such as a species, location or author?

This mirrors the query via the [PANGAEA website](https://pangaea.de/)

## 2.1 Simple query
Note:
* limit = the total number of datasets to be returned from query is 500.
    * To download > 500 use the offset attribute e.g. pan.PanQuery("Triticum", limit = 500, offset=500)
* type: 
    * parent = data collection
    * child = data set as part of a data collection 
* score: Indicates how well the dataset matched the query term

In [None]:
# query database for Helgoland Roads
query = pan.PanQuery("Triticum")
print(f'There have been {query.totalcount} query results')
# Save query as dataframe
query_results = pd.DataFrame(query.result)
query_results.head(4)

## 2.2 More complex queries

[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords


Multiple query terms

In [None]:
# Finds datasets that contain both "marine" and "geology"
query = pan.PanQuery("marine geology")
print(f'There have been {query.totalcount} query results')

Optional query terms

In [None]:
# Find datasets that contain "Globigerina" and either "falconensis" or "bulloides" 
query = pan.PanQuery("Globigerina AND (falconensis OR bulloides)")
print(f'There have been {query.totalcount} query results')

Uncertain spelling

In [None]:
# Finds datasets with uncertain spelling of single letter
query = pan.PanQuery("Gl?bigerina")
print(f'There have been {query.totalcount} query results')

In [None]:
# Finds datasets with "Neogloboquadrina" regardless of your spelling mistake
query = pan.PanQuery("~Neogloboqadrina")
print(f'There have been {query.totalcount} query results') 

Specific author

In [None]:
#  	Finds datasets of author "Herzschuh"
query = pan.PanQuery("citation:author:Herzschuh")
print(f'There have been {query.totalcount} query results') 

Within geolocation

In [None]:
# query database for "Deep-sea Sponge Microbiome Project" within a certain geolocation
query = pan.PanQuery("Globigerina bulloides", limit = 500, bbox=(17.7, 67.7, 21, 69))
print(f'There have been {query.totalcount} query results')

## 2.3 Queries exceeding 500 results

### Function to query Pangaea without limited results

In [None]:
# Function to query pangaea for datasets
# This function overcomes the limit of 500 datasets
def query_pangaea(query_term = "", limit = 500, exclude_collection = True):
    query = pan.PanQuery(query_term, limit = limit)
    print(f'{query.totalcount} total query results. Query limited to {limit} results.')

    # Save query as dataframe
    query_results = pd.DataFrame(query.result)
    
    # Constrain query to limit
    if limit:
        query.totalcount = limit

    # if more than 500 query increase the offset to overcome data download limit
    if query.totalcount > 500:
        for offset in range(500, int(query.totalcount), 500):
            # new query with increased offset
            query = pan.PanQuery(query_term, offset=offset, limit = 500)
            # Attach further query results
            query_results = pd.concat([query_results, pd.DataFrame(query.result)])

    # Exclude data collection (parents) if true
    if exclude_collection:        
        query_results = query_results[query_results.type == "child"]
        print(f'{len(query_results)} child datasets extracted')

    # Delete redundant columns
    query_results = query_results.drop(["html", "position"], axis = 1)

    # Add query term to table
    query_results["query_term"] = query_term
    
    return query_results.reset_index(drop=True)

Perform query

In [None]:
# Perform PANGAEA query
query_term = "citation:author:Herzschuh"
query_results = query_pangaea(query_term, limit = 50, exclude_collection=True)
query_results.head(2)

### Add Pangaea ID (optional for labeling)

In [None]:
# Function to extract and add pangaea ID to query result dataframe
def add_pangaea_id(query_df):
    # Extract PANGAEA dataset ID
    if "pangaea_id" not in query_df.columns:
        return query_df.insert(0, "pangaea_id", [int(id.split(".")[-1:][0]) for id in query_df.URI])

# Add pangaea dataset ids
add_pangaea_id(query_results)
query_results.head(2)

## 2.4. Quiz

[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords

### 2.4.1 How many datasets contain "Octopus vulgaris"?

In [None]:
# Your solution

### 2.4.2 How many datasets contain "Gadus morhua" in the title only?

In [None]:
# Your solution

### 2.4.3 How many datasets did the author Hannes Grobe publish?

In [None]:
# Your solution

### 2.4.4 How many datasets measured "Temperature, water" using a CTD/Rosette?

In [None]:
# Your solution

# 3. Download datasets

## 3.1 Download single dataset

AIM: How can I download a single dataset right into Python or to my harddrive?

### Search for datasets

In [None]:
# Perform PANGAEA query
query_term = "Deep-sea Sponge Microbiome Project"
query_results = query_pangaea(query_term, limit = 50, exclude_collection=True)
# Add pangaea dataset ids
add_pangaea_id(query_results)
query_results.head(2)

### Download dataset from PANGAEA
Dataset: https://doi.pangaea.de/10.1594/PANGAEA.923033

Using the full url

In [None]:
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.923033")
print(ds.data.head(3))

Using the doi

In [None]:
ds = PanDataSet("doi:10.1594/PANGAEA.923033")
print(ds.data.head(3))

Using the PANGAEA ID

In [None]:
ds = PanDataSet(923033)
print(ds.data.head(3))

### Translate to long parameter names
Because by default parameters are abbreviated without units

In [None]:
# Translate short parameters names to long names including unit
def get_long_parameters(ds):
    """Translate short parameters names to long names including unit

    Args:
        ds (PANGAEA dataset): PANGAEA dataset
    """
    ds.data.columns =  [f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]

print(ds.data.columns[:10])
get_long_parameters(ds)
ds.data.columns[:10]

### Display location of dataset samples

In [None]:
# Plot sampling points on interactive plotly map
fig = px.scatter_mapbox(ds.data, lat="LATITUDE", lon="LONGITUDE", 
                        hover_name="Event label", 
                        hover_data=['LATITUDE', 'LONGITUDE', 'DEPTH, water [m]', 'Species', 'Gear'], 
                        zoom=0, height=300)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Save data

In [None]:
# Create data folder
data_folder = "pangaea_data"
# Check if it already exists before creating it
if not os.path.isdir(data_folder):
    os.mkdir(data_folder)
# Save to csv
print(f'PANGAEA dataset {ds.id} saved')
ds.data.to_csv(os.path.join(data_folder, f'Pangaea_dataset_{ds.id}.csv'))

## 3.2 Download multiple datasets

AIM: How can I download multiple datasets right into Python or on my harddrive?

### Perform query

In [None]:
# Perform PANGAEA query
query_term = "Deep-sea Sponge Microbiome Project"
query_results = query_pangaea(query_term, limit = 50, exclude_collection=True)
query_results.head(2)

### Download multiple datasets
Note: 
* Data collections and restricted datasets cannot be downloaded

In [None]:
# Add pangaea dataset ids
add_pangaea_id(query_results)

# Create dictionary to store dataframes in
data_dict = {}
# Loop over IDs and download datasets
for pangaea_id in query_results.pangaea_id[:4]:
    print("".join(40*["-"]))
    print(f'Pangaea ID: {pangaea_id}')
    # Cache
    ds = PanDataSet(pangaea_id, enable_cache=True)
    # Translate to long parameter names
    get_long_parameters(ds)
    print(f'Dataset title: {ds.title}')
    print(ds.data.head(2))
    data_dict[pangaea_id] = ds.data

### Save multiple datasets

In [None]:
# Create data folder
data_folder = "pangaea_data"
if not os.path.isdir(data_folder):
    os.mkdir(data_folder)
# Loop over each dataset in the dictionary and save to csv
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    # Save to csv
    data_dict[key].to_csv(os.path.join(data_folder, f'Pangaea_dataset_{key}.csv'))

## 3.3 Quiz

### 3.3.1 Download this dataset and identify the first event name
https://doi.pangaea.de/10.1594/PANGAEA.947275

In [None]:
# Your solution

### 3.3.2 Download this dataset and identify the number of sampling points >1000m
https://doi.pangaea.de/10.1594/PANGAEA.943624

In [None]:
# Your solution

### 3.3.3 Was there a sampling point in Australia for this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.943455

In [None]:
# Your solution

# 4. Get metadata

## 4.1 Get metadata for each dataset

#### Download dataset

In [None]:
# Download dataset from PANGAEA
ds = PanDataSet(923033, include_data=False)
ds.data

### Basic metadata retrieval

In [None]:
# Title
print(f'Title: {ds.title}')
# Abstract
print(f'Abstract: {ds.abstract}')
# Publication date
print(f'Publication date: {ds.date}')
# Authors
print(f'Authors: {"; ".join([x.fullname for x in ds.authors])}')
# Author orcids
print(f'Orcids: {"; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])}')
# Citation
print(f'Citation: {ds.citation}')
# doi
print(f'doi: {ds.doi}')
# Geolocation
print(f'Latitude: {ds.geometryextent["meanLatitude"]}')
print(f'Longitude: {ds.geometryextent["meanLongitude"]}')
# Parameters
params = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
print(f'Parameters: {params}')
# Event devices
print(f'Event devices: {"; ".join(set([device for device in ds.getEventsAsFrame()["device"]]))}')

## 4.2 Getting metadata for multiple datasets

### 4.2.1 Using Pangaeapy

#### Function to extract PANGAEA metadata

In [None]:
# Function to extract metadata from Pangaea dataset
def get_pangaea_meta(pangaea_id):
    try:
        print(f'Extract metadata for Pangaea ID: {pangaea_id}')
        # Get metadata for pangaea dataset
        ds = pan.PanDataSet(pangaea_id, enable_cache=True, include_data=False)
        # Create data frame to store metadata
        meta = pd.DataFrame({"pangaea_id": [pangaea_id]})
        # Extract and add metadata    
        meta["year"] = ds.year
        meta["authors"] = "; ".join([x.fullname for x in ds.authors])
        meta["title"] = ds.title
        meta["abstract"] = ds.abstract 
        meta["citation"] = ds.citation
        meta["parameters"]= "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
        meta["publication_date"] = ds.date
        # Check if there are geometry metadata
        if ds.geometryextent:
            meta["mean_latitude"] = ds.geometryextent["meanLatitude"]
            meta["mean_longitude"] = ds.geometryextent["meanLongitude"]
        # Check if events are available
        if not ds.getEventsAsFrame().empty:   
            meta["events"] = "; ".join(ds.getEventsAsFrame()["label"])
            meta["event_device"] = ds.getEventsAsFrame()["device"]
            meta["elevation"] = ds.getEventsAsFrame()["elevation"]
            meta["campaign"] = ds.getEventsAsFrame()["campaign"]
            meta["location"] = ds.getEventsAsFrame()["location"]
        meta["doi"] = ds.doi
        meta["datastatus"] = ds.datastatus
    except AttributeError:
        meta = pd.DataFrame()
    
    return meta

#### Function to download metadata from multiple datasets

In [None]:
# Function to download multiple Pangaea metadata
def get_pangaea_meta_df(query_term, pangaea_id_list, folder = "PANGAEA_metadata"):
    # Create folder for metadata
    if os.path.isdir(folder):
        print(f'{folder} already exists')
    # if not create it    
    else:
        os.mkdir(folder)
    
    # Create file path
    file_path = os.path.join(os.getcwd(), "PANGAEA_metadata", f'metadata_{query_term.replace(":", "_")}.csv')
    print(file_path)
    # Add check if data have already been downloaded
    if os.path.isfile(file_path):
        print("File already exists")
        meta_df = pd.read_csv(file_path)
    else:
        meta_df = {}
        # Retrieve and store metadata in dictionary
        for id in pangaea_id_list:
            meta_df[id] = get_pangaea_meta(id)
        # Join all metadata into single dataframe
        meta_df = pd.concat(meta_df).reset_index(drop=True)
        # Save metadata to csv file
        meta_df.to_csv(file_path, index=False)
        print(f'Pangaea metadata saved as {file_path}')
    return meta_df

#### Use these functions to download metadata

In [None]:
# Perform PANGAEA query
query_term = "citation:author:Herzschuh"
query_results = query_pangaea(query_term, limit = 50)
# Add pangaea dataset ids
add_pangaea_id(query_results)
# Extract metadata for all query results
meta_df = get_pangaea_meta_df(query_term = query_term, pangaea_id_list = query_results["pangaea_id"])
meta_df

## 4.3 Alternative way to retrieve metadata

### 4.3.1 HTML Scraping
* --> To access metadata via direct web scraping
* ... and to apply this more generic approach to scrape data from other repositories

#### First define scraping functions

In [None]:
# Function to extract the full PANGAEA dataset web content
def get_html(url):
    """Function to extract html web content

    Args:
        dataset_id (str): PANGAEA dataset ID

    Returns:
        str: html content of PANGAEA dataset
    """
    page = urlopen(url)
    html = page.read().decode("utf-8")
    return BeautifulSoup(html, "html.parser")

# Function to extract PANGAEA metadata
def get_pan_metadata(dataset_html, metadata):
    """Function to scrape metadata from PANGAEA dataset html content

    Args:
        dataset_html (str): html content of PANGAEA dataset
        metadata (str): metadata type to be extracted

    Returns:
        str: Extracted metadata
    """
    return dataset_html.find("meta", attrs={"name": metadata}).get("content")

#### See what metadata are available
Note:
* Example dataset: https://doi.org/10.1594/PANGAEA.923035
* You can view the source code in the browser by pressing CTRL + U (in Firefox)

In [None]:
# Scrape PANGAEA dataset
html = get_html("https://doi.org/10.1594/PANGAEA.923035")
# Extract all available metadata types
for meta in html.find_all("meta"):
    if meta.has_attr("name"):
        print(meta.attrs["name"])

In [None]:
# Get the abstract
html.find("meta", attrs={"name": "title"}).get("content")

#### Scrape the html content for multiple PANGAEA dataset

Perform query

In [None]:
# Perform PANGAEA query
query_term = "Deep-sea Sponge Microbiome Project"
query_results = query_pangaea(query_term, limit = 5, exclude_collection=True)
query_results

In [None]:
# Generate url from uri
query_results["url"] = [f'https://doi.org/{uri.split(":")[-1:][0]}' for uri in query_results.URI]
# Scrape the html content for each PANGAEA dataset
query_results["html"] = [get_html(url) for url in query_results["url"]]

#### Exctract desired metadata from each dataset

In [None]:
# Extract desired metadata from dataset html
for counter, metadata in enumerate(["title", "author", "date", "geo.position", "description"]):
    # Check if metadata already exist
    if metadata not in query_results.columns:
        query_results.insert(counter+1, metadata, [get_pan_metadata(html, metadata) for html in query_results["html"]])

# Extract the abstract
query_results["abstract"] = [html.find("div", attrs={"class": "abstract"}).get_text() for html in query_results["html"]]
query_results.head(2)

#### Save metadata

In [None]:
# Save metadata
query_results.to_csv(os.path.join("PANGAEA_metadata", 'Pangaea_metadata_html.csv'), index=False)
query_results

### 4.3.2 Metadata from json

In [None]:
# Extract json string from html
query_results["json"] = [html.find("script", attrs={"type": "application/ld+json"}).string for html in query_results["html"]]
# Alternative way of doing the same thing (it is 5 times slower though)
#query_results["json"] = [requests.get(url, headers={'Accept': 'application/ld+json'}).json() for url in query_results["url"]]

#See what metadata are available
print(json.loads(query_results["json"][0]))
# Ad json metadata to dataframe
print([json.loads(json_str)["name"] for json_str in query_results["json"]])
# Extract nested metadata such as ORCID ID
print(json_normalize(json.loads(query_results["json"][0])["creator"])["@id"])

## 4.4 Quiz

### 4.4.1 What is the title of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.937210

In [None]:
# Your solution

### 4.4.2 What is the publication date of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.863967

In [None]:
# Your solution

### 4.4.3 Did they measure temperature in this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.863975

In [None]:
# Your solution

# 5. Download specific parameters across multiple datasets

## 5.1 Check the frequency of parameters

Combine data headers from all data frames 

In [None]:
params = pd.DataFrame()
# Extract and combine headers of all data sets
for key, df in data_dict.items():
    params = pd.concat([params, df.columns.to_frame()], ignore_index=True, axis=0)
# Rename the parameter column
params = params.rename(columns={0: "parameters"})
# Show the first 10 parameters
params.head(10)

Plot parameter frequency

In [None]:
# Calculate the parameter frequency
param_count = params["parameters"].value_counts()
print(param_count.head(10))
# Plot the parameter frequency
plt.figure(figsize=(10,3))
count_plot = sns.barplot(x = param_count.index[:15], y = param_count.values[:15], color="darkcyan")
count_plot = count_plot.set_xticklabels(count_plot.get_xticklabels(), rotation=90)

## 5.2 Extract and combine parameters from data frames

#### Function to extract specific parameter(s) from dataframes
**This is a key benefit of harmonised parameters in a well curated data repository**

In [None]:
# Function to find and extract desired parameters across all dataframes
def get_param_data(data_dict, params):
    """Function to find and extract desired parameters across all dataframes

    Args:
        params (list): List of parameters to be extracted

    Returns:
        pandas.core.frame.DataFrame: Data frame containing data for all parameters
    """

    # Define empty dictionary to temporarily store extracted data
    extracted_data = {}
    # Loop over all dataframe, look for and extract parameters
    for key, df in data_dict.items():
        # Convert headers to lowercase to improve matching
        df.columns = [x.lower() for x in df.columns]
        # Find parameters that exists in the dataset
        found_params = list(set([x.lower() for x in params]).intersection(set(df.columns)))
        if found_params:
            print(f'Found the parameters {found_params} in dataset {key}')
            # Copy found parameters to new dataframe
            df_sub = df[found_params]
            # Insert PANGAEA dataset ID
            df_sub.insert(0, "Pangaea_dataset_id", key)
            # Store extracted data in dictionary
            extracted_data[key] = df_sub

    # Join all dataframe in dictionary
    extracted_data = pd.concat(extracted_data, ignore_index = True)
    return extracted_data

#### Extract and save specific parameters from dataframes

In [None]:
# Enter all parameters to be extracted
extracted_data = get_param_data(data_dict, ["LATITUDE", "LONGITUDE", 'DATE/TIME', "DEPTH, water [m]", "Salinity"])
# Save extracted data parameters
extracted_data.to_csv(os.path.join(data_folder, 'Extracted_data.csv'), index=False)
extracted_data

# 6. Download linked genetic data

## 6.1 Download PANGAEA dataset with genetic accession numbers
Dataset: https://doi.pangaea.de/10.1594/PANGAEA.937551

In [None]:
# Download dataset from PANGAEA
ds = PanDataSet(937551)
get_long_parameters(ds)
df = ds.data.head(4)
df.head(2)

## 6.2. Download Genebank records

In [None]:
# Extract NCBI accession number from dataset
df.loc[:,"Accession number, genetics"] = [x.split(":")[1] for x in df.loc[:,"Accession number, genetics"]]
# Fetch gene records from NCBI
records = []
for acc_id in df["Accession number, genetics"]:
    print(acc_id)
    handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text",
                          id=acc_id)
    records.append(SeqIO.read(handle, 'fasta'))

## 6.3 Add genetic records to PANGAEA data frame

In [None]:
# Add gene description
df.loc[:, "Gene"] = [record.description for record in records]
# Add genetic sequence
df.loc[:, "Sequence"] = [record.seq for record in records]
# Save to file
df.to_csv(os.path.join(data_folder, 'PANGAEA_NCBI_data.csv'), index=False)
df.head(2)

# 7. Download binary files

## 7.1 Download PANGAEA dataset with image data
Dataset: https://doi.pangaea.de/10.1594/PANGAEA.943250

In [None]:
# Download dataset from PANGAEA
pan_id = 943250
ds = PanDataSet(pan_id)
# Spell out abbreviated parameters
get_long_parameters(ds)
df = ds.data.iloc[22:25,:]
df.head(2)

## 7.2 Download images

In [None]:
# Create file urls
df["image_url"] = [f'https://download.pangaea.de/dataset/{pan_id}/files/{img}' for img in df['Image']]
# Download images
for i, file_url in enumerate(df["image_url"]):
    urlretrieve(file_url, os.path.join(data_folder, df["Image"].iloc[i]))
    print(f'{file_url} downloaded')