## An example how to access project-specific metadata of PANGAEA datasets

By: Flavia Höring

Last updated: 2024-04-29

This notebook includes an example how to extract project-specific information from PANGAEA datasets using the Python modules 'pangaeapy' and 'pandas'. This information is for example relevant for the project's data managers who need to report the statistics of published datasets within a specific project to the funding agencies. This python script shows how to 1) search for a dataset list of a specific project, 2) print a full list of dataset citations, and 3) create a pandas data frame with metadata of interest for further statistical analysis or plotting.

For more examples on metadata extraction with 'pangaeapy', see [pangaeapy_detailed_metadata_search.ipynb](https://github.com/pangaea-data-publisher/community-workshop-material/blob/master/Python/PANGAEApy_practical/pangaeapy_detailed_metadata_search.ipynb)



In [None]:
#install missing packages
!pip install pandas
!pip install pangaeapy

In [None]:
# import packages
import pandas as pd
import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet


### Search for dataset list of specific project

In [None]:
query = pan.PanQuery("project:label:CDRmare", limit=100)

In [None]:
query.totalcount

In [None]:
query.query

In [None]:
 #show first entry of query result (query.result = list of dictionaries)
query.result[0]

In [None]:
query.result[0].keys() 

In [None]:
# get a list of URIs for the query result
l_dois = [d.get('URI') for d in query.result]

In [None]:
l_dois

In [None]:
len(l_dois) #length of the list

### Print a full list of dataset citations to a .txt file

In [None]:
# write all citations to a .txt file
file=open("citations_CDRmare.txt", "w")
for doi in l_dois:
    ds = PanDataSet(doi, include_data=False)
    citation = ds.citation
    #print(ds.citation)
    file.write(citation + "\n")
file.close()

### Extract the DOIs, publication dates and the project names in table format for further analysis

In [None]:
# access to xml metadata scheme via pangaeapy
df_metadata = pd.DataFrame()

for ind, doi in enumerate(l_dois):
    ds = PanDataSet(doi, include_data=False)
    df_metadata.loc[ind,'DOI'] = doi
    df_metadata.loc[ind,'Publication_Date'] = ds.date
    #print(ds.date)
    x = []
    for pro in ds.projects:
        #print(pro.label)
        x.append(pro.label)
    df_metadata.loc[ind,'Project'] = ', '.join(x)
    
df_metadata

In [None]:
# export metadata table as tab-delimited text-file
df_metadata.to_csv('Metadata_datapub_CDRmare.txt', sep = '\t', index = False)