![PANGAEA_Banner.png](https://github.com/pangaea-data-publisher/community-workshop-material/raw/master/banner.png)# PANGAEApy
## Introduction and examples for PANGAEA community workshop 2022

### This script shows an example of searching and downloading multiple files via pangaeapy and how to get information on meta data

#### For more information and examples on pangaeapy: https://github.com/pangaea-data-publisher/pangaeapy


### Overview
* How to search for specific data sets in PANGAEA
* Search by project
* Convert search results into one easy to read table
* Export search results to csv file
* Get data including meta data of a single data set
* Refine search with geographical coordinates by applying a bounding box
* Refine search further: filter only datasets with "Geochemistry" in title
* Get multiple data sets
* Download many binary files
* Download many files

In [None]:
# import necessary packages
import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet
import pandas as pd
import re
import os

In [None]:
# show functions of pangaeapy
help(pan)

<br/>

# How to search for specific data sets in PANGAEA

In [None]:
# show functions of pangaeapy.panquery
help(pan.panquery)

##  Search by project
### Example: search for project "PAGES_C-PEAT"
pan.PanQuery("PAGES_C-PEAT", limit = 500)
vs. 
pan.PanQuery("project:label:PAGES_C-PEAT", limit = 500)


Note: the default limit = 10, maximum of limit = 500

In [None]:
search1 = pan.PanQuery("PAGES_C-PEAT", limit = 500)
print(search1.totalcount)
print(search1.error)
print(search1.query)

In [None]:
search1.result[0:2] #show only first 3 results

#### Documentation on search with keywords
https://wiki.pangaea.de/wiki/PANGAEA_search

In [None]:
# refined search with project label
# same as search on website https://www.pangaea.de/?q=project:label:PAGES_C-PEAT
search2 = pan.PanQuery("project:label:PAGES_C-PEAT", limit = 500)
print(search2.totalcount)
print(search2.error)
print(search2.query)

In [None]:
search2.result[0:2] #show only first 3 results

#### Hint: specify your search with facet filter at https://www.pangaea.de and refine your search query with PANGAEApy

### What if list of search results exceeds limit of 500?
If search has a result list (totalcount) > 500, split search results in 2 querys

In [None]:
PAGES1=pan.PanQuery("project:label:PAGES_C-PEAT", limit = 500)
print(PAGES1.totalcount)
print(PAGES1.error)
print(PAGES1.query)
print(PAGES1.result[0]['URI'])

In [None]:
PAGES2=pan.PanQuery("project:label:PAGES_C-PEAT", limit = 500, offset=500)
print(PAGES2.totalcount)
print(PAGES2.error)
print(PAGES2.query)
print(PAGES2.result[0]['URI'])

In [None]:
print(type(PAGES1))
print(type(PAGES1.result))
print(PAGES1.result[0])

<br/>

### Convert search results into one easy to read table
convert list of dictionaries into data frame 

In [None]:
df1 = pd.DataFrame(PAGES1.result)
df2 = pd.DataFrame(PAGES2.result)

information on size of data frames and columns

In [None]:
df1.columns

In [None]:
df1.count()

In [None]:
df2.count()

merge both data frames into one

In [None]:
df=pd.concat([df1,df2],ignore_index=True)
df.count()

show first 5 lines of data frame

In [None]:
df.head()

which information is in column html?

In [None]:
df.html[0]

#### get information on titel and author out of html code and add to data frame
use regular expressions

In [None]:
# create column: titel
df['titel'] = df.html.str.extract(r'"citation">(?:.*?)<\/strong>(.*?)<\/a>')

# create column: author
df['author(s)'] = df.html.str.extract(r'"dataset-link"><strong>(.*?)\([0-9]{4}\):<\/strong>')

# create column: year of publication
df['year of publication'] = df.html.str.extract(r'(\([0-9]{4}\))')

#create column: PANGAEA ID
df['PANGAEA ID'] = df.html.str.extract(r'class="citation"><a href="https:\/\/doi.pangaea.de\/10.1594\/PANGAEA.([0-9]{6})')

#print(df.columns)
# adapt position of columns 
df = df[['PANGAEA ID','author(s)', 'titel','year of publication','URI','type','score','position', 'html']]

In [None]:
df

type = child means, that data set is part of a data collection

score indicates how well the data set matches the search query

<br/>

## Export search results to csv file

Find out what your current path is and alter it to your liking.

In [None]:
# what is my Current Working Directory ?
print(os.getcwd())

Define the path and file name where the output will be stored.

##### NOTE: If you are working on a Windows machine: \ need to be / and don't forget the last /

In [None]:
datapath='<your_specific_path>'
outfile='search_result_PAGES.txt'

Export list in csv formatted file

In [None]:
df.to_csv((datapath+outfile),sep='\t',index=False)

<br/>

## Get data including meta data of a single data set
#### use function PanDataSet

In [None]:
#help(PanDataSet)

In [None]:
Joey_core12 = PanDataSet(890405)
print(Joey_core12.title)
print(Joey_core12.citation)
print(Joey_core12.isParent)

In [None]:
Joey_core12.data.head()

Parameter long names and units are given in lists 

In [None]:
long_names = []
for param in Joey_core12.params.values():
    print(param.name)
    print(param.unit)
    long_names.append(str(param.name) + ' [' + str(param.unit) + ']')
    
#print(long_names)

### download table as tab-delimited txt file

In [None]:
# what is my Current Working Directory ?
print(os.getcwd())

In [None]:
datapath='<your_specific_path>'
outfile_joey='Joey_core12.txt'

Joey_core12.data.to_csv((datapath+outfile_joey),sep='\t',index=False,header=long_names)


<br/>

## Refine search with geographical coordinates by applying a bounding box

bbox: set the bounding box to define geographical search constraints following the GeoJSON specs


bbox=(minlon, minlat, maxlon, maxlat)

In [None]:
# datasets in northern Sweden
PAGES_Sweden = pan.PanQuery("project:label:PAGES_C-PEAT", limit = 500, bbox=(17.7, 67.7, 21, 69))
print(PAGES_Sweden.totalcount)
print(PAGES_Sweden.error)
print(PAGES_Sweden.query)


loop over result list and take PANGAEA data set ID from URI

In [None]:
panID = []
title = []
for count,value in enumerate(PAGES_Sweden.result):
    #print(PAGES_Sweden.result[count]['URI'].split('.'))
    c = []
    a,b,c = PAGES_Sweden.result[count]['URI'].split('.')
    #print(int(c))
    panID.append(int(c))
    
    df_tmp = PanDataSet(int(c))
    #print(df_tmp.title)
    title.append(df_tmp.title)
    
    
#print(panID)
#print(title)

In [None]:
df_sweden_meta = pd.DataFrame(panID,columns=['panID'])
df_sweden_meta['Title'] = title

In [None]:
df_sweden_meta.head()

In [None]:
df_sweden_meta.count()

<br/>

## Refine search further: filter only datasets with "Geochemistry" in title

In [None]:
PAGES_Sweden_geo = df_sweden_meta[df_sweden_meta['Title'].str.contains('Geochemistry')]

In [None]:
PAGES_Sweden_geo.head()

In [None]:
PAGES_Sweden_geo.count()

<br/>

## Get multiple data sets

combine all data of PAGES_Sweden_geo search results into a single data frame


In [None]:
# new data frame
PAGES_Sweden_data = []
first = True

for i, id_pan in PAGES_Sweden_geo['panID'].iteritems():
    #print(i)
    #print(id_pan)
    df_tmp = []
    df_tmp = PanDataSet(id_pan)
    df_tmp.data['DOI'] = df_tmp.doi
    df_tmp.data['citation'] = df_tmp.citation
    
    if first == True:
        PAGES_Sweden_data = pd.DataFrame(df_tmp.data)
        first = False
    else:
        PAGES_Sweden_data = pd.concat([PAGES_Sweden_data,df_tmp.data], axis=0, ignore_index=True)


In [None]:
# rearange order of columns
print(PAGES_Sweden_data.columns)
PAGES_Sweden_data = PAGES_Sweden_data[['Depth', 'Age', 'DBD', 'OM', 'OM dens', 'TC', 'TN', 'Corg dens', 'Peat',
       'Peat_2', 'Samp thick', 'LOI', 'C', 'Event', 'Latitude', 'Longitude', 'Elevation', 'DOI', 'citation']]

In [None]:
PAGES_Sweden_data.count()

In [None]:
PAGES_Sweden_data.head()

download table as tab-delimited text file

In [None]:
# what is my Current Working Directory ?
print(os.getcwd())

In [None]:
datapath='<your_specific_path>'
outfile_sweden='Sweden_geochem.txt'

PAGES_Sweden_data.to_csv((datapath+outfile_sweden),sep='\t',index=False)

<br/>

## Download many binary files

download the images from a single dataset https://doi.pangaea.de/10.1594/PANGAEA.919398

In [None]:
df_image = PanDataSet(919398)

In [None]:
df_image.data.head()

In [None]:
df_image.data.count()

download files listed in the column "IMAGE"

set the prefix first (see .tab file) https://doi.pangaea.de/10.1594/PANGAEA.919398?format=textfile

In [None]:
prefix = 'https://download.pangaea.de/dataset/919398/files/'

download only images when "fauna" is listed in column "Content"

In [None]:
df_fauna = df_image.data[df_image.data['Content'].str.contains('fauna')]

In [None]:
# what is my Current Working Directory ?
print(os.getcwd())

In [None]:
import urllib.request 

datapath='<your_specific_path>'

for i, image_name in df_fauna['IMAGE'].iteritems():
    print(image_name)
    urllib.request.urlretrieve((prefix+image_name), (datapath+image_name))

<br/>

## Download many files

this concerns datasets published before 2020 (in .tab file the full path is given, not just the file name)

example: https://doi.pangaea.de/10.1594/PANGAEA.910179

In [None]:
file_list = PanDataSet(910179)

In [None]:
file_list.data.head()

In [None]:
file_list.data.count()

In [None]:
# what is my Current Working Directory ?
print(os.getcwd())

In [None]:
import urllib.request 

datapath='<your_specific_path>'

for i, file_url in file_list.data['URL file'].iteritems():
    print(file_url)
    urllib.request.urlretrieve(file_url, (datapath+file_list.data['File name'].iloc[i]))