# Download data packages from the Movebank data repository

In [1]:
# %matplotlib inline

import requests
import json
# from io import StringIO
import pandas as pd
# import matplotlib.pyplot as plt
# plt.style.use('seaborn-white')
# from bs4 import BeautifulSoup
# import re

## Get list of data packages

We want to get a list of all relevant data packages deposited in the [Movebank data repository](https://www.datarepository.movebank.org/). This repository - which is build on top of [Dryad's DSpace](https://github.com/datadryad/dryad-repo) - offers some services to do this, such as [OAI-PMH](https://www.datarepository.movebank.org/oai/request?verb=ListRecords&from=2000-01-01&metadataPrefix=oai_dc), but since all relevant data packages are also registered via [DataCite](https://www.datacite.org/), we use [their more convenient API](http://api.datacite.org/) instead.

Applied search filters:

* `publisher` = `tib.ukon`: All Movebank packages are published by the Universität Konstanz - Bibliothek.
* `recource-type-id` = `dataset`: To exclude collections and other works published by `tib.ukon`.
* `query` = `Movebank Data Repository`: All relevant packages will have this populated in `contributor`. Movebank packages that don't are test records (see [issue 2](https://github.com/peterdesmet/movebank2gbif/issues/2)).

**Note: Still need to add `rows` to get all packages, not just 27.**

In [2]:
parameters = {
    'publisher-id': 'tib.ukon',
    'resource-type-id': 'dataset',
    'query': 'Movebank Data Repository'
}
results = requests.get('https://api.datacite.org/works', params=parameters)
results.url

'https://api.datacite.org/works?query=Movebank+Data+Repository&resource-type-id=dataset&publisher-id=tib.ukon'

In [3]:
data_packages_dict = results.json()

In [4]:
for package in data_packages_dict['data']:
    print(package['id'])

http://doi.org/10.5441/001/1.NF80477P/2
http://doi.org/10.5441/001/1.3HP3S250/1
http://doi.org/10.5441/001/1.NF80477P
http://doi.org/10.5441/001/1.NF80477P/1
http://doi.org/10.5441/001/1.3HP3S250/2
http://doi.org/10.5441/001/1.F3550B4F
http://doi.org/10.5441/001/1.F3550B4F/1
http://doi.org/10.5441/001/1.F3550B4F/2
http://doi.org/10.5441/001/1.8C56F72S
http://doi.org/10.5441/001/1.8C56F72S/1
http://doi.org/10.5441/001/1.8C56F72S/2
http://doi.org/10.5441/001/1.PR1VJ29N
http://doi.org/10.5441/001/1.PR1VJ29N/1
http://doi.org/10.5441/001/1.PR1VJ29N/2
http://doi.org/10.5441/001/1.62S17B4V
http://doi.org/10.5441/001/1.62S17B4V/1
http://doi.org/10.5441/001/1.62S17B4V/2
http://doi.org/10.5441/001/1.F321PF80
http://doi.org/10.5441/001/1.F321PF80/1
http://doi.org/10.5441/001/1.F321PF80/2
http://doi.org/10.5441/001/1.5JD56S8H
http://doi.org/10.5441/001/1.5JD56S8H/1
http://doi.org/10.5441/001/1.5JD56S8H/2
http://doi.org/10.5441/001/1.5JD56S8H/3
http://doi.org/10.5441/001/1.PV048Q7V
dataset
tib.ukon

In [5]:
# Load dict directly into df
data_packages_df = pd.DataFrame(data_packages_dict['data'])

In [6]:
data_packages_df

Unnamed: 0,attributes,id,type
0,"{'description': 'Gagliardo A, Bried J, Lambard...",http://doi.org/10.5441/001/1.NF80477P/2,works
1,"{'description': 'Dodge S, Bohrer G, Weinzierl ...",http://doi.org/10.5441/001/1.3HP3S250/1,works
2,"{'description': 'Gagliardo A, Bried J, Lambard...",http://doi.org/10.5441/001/1.NF80477P,works
3,"{'description': 'Gagliardo A, Bried J, Lambard...",http://doi.org/10.5441/001/1.NF80477P/1,works
4,"{'description': 'Dodge S, Bohrer G, Weinzierl ...",http://doi.org/10.5441/001/1.3HP3S250/2,works
5,"{'description': 'Bartlam-Brooks HLA, Beck PSA,...",http://doi.org/10.5441/001/1.F3550B4F,works
6,"{'description': 'Bartlam-Brooks HLA, Beck PSA,...",http://doi.org/10.5441/001/1.F3550B4F/1,works
7,"{'description': 'Bartlam-Brooks HLA, Beck PSA,...",http://doi.org/10.5441/001/1.F3550B4F/2,works
8,"{'description': 'Spiegel OM, Harel R, Centeno-...",http://doi.org/10.5441/001/1.8C56F72S,works
9,"{'description': 'Spiegel OM, Harel R, Centeno-...",http://doi.org/10.5441/001/1.8C56F72S/1,works


**Note: Why do I get back non-`work` types?**