Skip to content

Python library to access data in the GEOROC database

License

Notifications You must be signed in to change notification settings

pofatu/pygeoroc

Repository files navigation

pygeoroc

Build Status codecov PyPI

Python library to access GEOROC data as archived at https://data.goettingen-research-online.de/dataverse/digis.

Cite GEOROC data as specified for each pre-compiled dataset and pygeoroc as

Robert Forkel. (2022). pofatu/pygeoroc: Programmatic access to GEOROC data. Zenodo. http://doi.org/10.5281/zenodo.3744586

Install

Install pygeoroc from the Python Package Index running

pip install pygeoroc

preferably in a separate virtual environment, to keep your system's Python installation unaffected.

Installing pygeoroc will also install a command line program georoc, which provides functionality to curate a local copy of the GEOROC data. Run

georoc -h

for details on usage.

Overview

GEOROC provides its downloadable content in precompiled files organized in datasets.

pygeoroc provides functionality to

  • download this data to a local directory, called repository,
  • access the data in the repository programmatically from Python code,
  • load the data from the repository into a SQLite database for scalable and performant analysis.

Downloading GEOROC data

Running

$ georoc --repos tmp download

will create or update a local repository, i.e. a directory with the following layout:

$ tree -L 1 .
.
├── csv
└── datasets.json

The repository contains a "table of contents" in datasets.json. The checksum recorded per file in this table is checked when running georoc download again, making sure only new versions of files are fetched.

The local repository can be inspected running georoc ls, e.g.

$ georoc --repos tmp/ ls --samples --references --format pipe

will output the table included below

file dataset size last modified # samples # references path
doi:10.25625/1KRR1P/QIINIE GEOROC Compilation: Archaean Cratons 211.7KB 2022-06-20 242 8 2022-06-1KRR1P_ALDAN_SHIELD_ARCHEAN.csv
doi:10.25625/1KRR1P/9YG7VJ GEOROC Compilation: Archaean Cratons 494.3KB 2022-06-20 564 23 2022-06-1KRR1P_AMAZONIAN_CRATON.csv
doi:10.25625/1KRR1P/E7NI47 GEOROC Compilation: Archaean Cratons 3.3MB 2022-06-20 3776 66 2022-06-1KRR1P_BALTIC_SHIELD_ARCHEAN.csv
doi:10.25625/1KRR1P/FY7PXY GEOROC Compilation: Archaean Cratons 53.8KB 2022-06-20 61 5 2022-06-1KRR1P_BASTAR_CRATON_ARCHEAN.csv

...

Loading GEOROC data into SQLite

Since GEOROC contains data about hundreds of thousands of samples, querying it is made a lot easier by loading it into a SQLite database:

$ georoc --repos tmp/ createdb
100%|██████████████████████████████████████████████████| 5/5 [00:00<00:00, 25.32it/s]
INFO    tmp/georoc.sqlite

The resulting database has 4 tables:

  • file: Info about a CSV file, basically the data from index.csv.
  • sample: Info about individual samples.
  • reference: Info about sources of the data
  • citation: The association table relating samples with references.

The schema can be inspected running:

$ sqlite3 tmp/georoc.sqlite
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .schema

and looks as follows:

CREATE TABLE file (id TEXT PRIMARY KEY, date TEXT, section TEXT);
CREATE TABLE reference (id INTEGER PRIMARY KEY, reference TEXT);
CREATE TABLE sample (
    id TEXT PRIMARY KEY,
    file_id TEXT,
    `AGE` TEXT,
`ALTERATION` TEXT,
`DRILL_DEPTHAX` TEXT,
`DRILL_DEPTH_MIN` TEXT,
`ELEVATION_MAX` TEXT,
`ELEVATION_MIN` TEXT,
`EPSILON_ND` TEXT,
`ERUPTION_DAY` TEXT,
`ERUPTION_MONTH` TEXT,
`ERUPTION_YEAR` TEXT,
`GEOL.` TEXT,
`LAND_OR_SEA` TEXT,
`LATITUDE_MAX` TEXT,
`LATITUDE_MIN` TEXT,
`LOCATION` TEXT,
`LOCATION_COMMENT` TEXT,
`LONGITUDE_MAX` TEXT,
`LONGITUDE_MIN` TEXT,
`MATERIAL` TEXT,
`MINERAL` TEXT,
`ROCK_NAME` TEXT,
`ROCK_TEXTURE` TEXT,
`ROCK_TYPE` TEXT,
`TECTONIC_SETTING` TEXT,
`AR40_K40` TEXT,
`HE3_HE4` TEXT,
`HE4_HE3` TEXT,
`HF176_HF177` TEXT,
`K40_AR40` TEXT,
`ND143_ND144` TEXT,
`ND143_ND144_INI` TEXT,
`OS184_OS188` TEXT,
`OS186_OS188` TEXT,
`OS187_OS186` TEXT,
`OS187_OS188` TEXT,
`PB206_PB204` TEXT,
`PB206_PB204_INI` TEXT,
`PB207_PB204` TEXT,
`PB207_PB204_INI` TEXT,
`PB208_PB204` TEXT,
`PB208_PB204_INI` TEXT,
`RE187_OS186` TEXT,
`RE187_OS188` TEXT,
`SR87_SR86` TEXT,
`SR87_SR86_INI` TEXT,
`AG(PPM)` TEXT,
`AL(PPM)` TEXT,
`AS(PPM)` TEXT,
`AU(PPM)` TEXT,
`B(PPM)` TEXT,
`BA(PPM)` TEXT,
`BE(PPM)` TEXT,
`BI(PPM)` TEXT,
`BR(PPM)` TEXT,
`C(PPM)` TEXT,
`CA(PPM)` TEXT,
`CAO(WT%)` TEXT,
`CD(PPM)` TEXT,
`CE(PPM)` TEXT,
`CL(PPM)` TEXT,
`CL(WT%)` TEXT,
`CO(PPM)` TEXT,
`CR(PPM)` TEXT,
`CS(PPM)` TEXT,
`CU(PPM)` TEXT,
`DY(PPM)` TEXT,
`ER(PPM)` TEXT,
`EU(PPM)` TEXT,
`F(PPM)` TEXT,
`F(WT%)` TEXT,
`FE(PPM)` TEXT,
`FEO(WT%)` TEXT,
`FEOT(WT%)` TEXT,
`GA(PPM)` TEXT,
`GD(PPM)` TEXT,
`GE(PPM)` TEXT,
`HE(CCM/G)` TEXT,
`HE(CCMSTP/G)` TEXT,
`HE(NCC/G)` TEXT,
`HF(PPM)` TEXT,
`HG(PPM)` TEXT,
`HO(PPM)` TEXT,
`I(PPM)` TEXT,
`IN(PPM)` TEXT,
`IR(PPM)` TEXT,
`K(PPM)` TEXT,
`LA(PPM)` TEXT,
`LI(PPM)` TEXT,
`LOI(WT%)` TEXT,
`LU(PPM)` TEXT,
`MAX._AGE_(YRS.)` TEXT,
`MG(PPM)` TEXT,
`MGO(WT%)` TEXT,
`MIN._AGE_(YRS.)` TEXT,
`MN(PPM)` TEXT,
`MNO(WT%)` TEXT,
`MO(PPM)` TEXT,
`NA(PPM)` TEXT,
`NB(PPM)` TEXT,
`ND(PPM)` TEXT,
`NI(PPM)` TEXT,
`NIO(WT%)` TEXT,
`O(WT%)` TEXT,
`OH(WT%)` TEXT,
`OS(PPM)` TEXT,
`OTHERS(WT%)` TEXT,
`P(PPM)` TEXT,
`PB(PPM)` TEXT,
`PD(PPM)` TEXT,
`PR(PPM)` TEXT,
`PT(PPM)` TEXT,
`RB(PPM)` TEXT,
`RE(PPM)` TEXT,
`RH(PPM)` TEXT,
`RU(PPM)` TEXT,
`S(PPM)` TEXT,
`S(WT%)` TEXT,
`SB(PPM)` TEXT,
`SC(PPM)` TEXT,
`SE(PPM)` TEXT,
`SM(PPM)` TEXT,
`SN(PPM)` TEXT,
`SR(PPM)` TEXT,
`TA(PPM)` TEXT,
`TB(PPM)` TEXT,
`TE(PPM)` TEXT,
`TH(PPM)` TEXT,
`TI(PPM)` TEXT,
`TL(PPM)` TEXT,
`TM(PPM)` TEXT,
`U(PPM)` TEXT,
`V(PPM)` TEXT,
`VOLATILES(WT%)` TEXT,
`W(PPM)` TEXT,
`Y(PPM)` TEXT,
`YB(PPM)` TEXT,
`ZN(PPM)` TEXT,
`ZR(PPM)` TEXT,
`AL2O3(WT%)` TEXT,
`B2O3(WT%)` TEXT,
`CH4(WT%)` TEXT,
`CL2(WT%)` TEXT,
`CO1(WT%)` TEXT,
`CO2(PPM)` TEXT,
`CO2(WT%)` TEXT,
`CR2O3(WT%)` TEXT,
`FE2O3(WT%)` TEXT,
`H2O(WT%)` TEXT,
`H2OM(WT%)` TEXT,
`H2OP(WT%)` TEXT,
`H2OT(WT%)` TEXT,
`HE3(AT/G)` TEXT,
`HE3(CCMSTP/G)` TEXT,
`HE3_HE4(R/R(A))` TEXT,
`HE4(AT/G)` TEXT,
`HE4(CCM/G)` TEXT,
`HE4(CCMSTP/G)` TEXT,
`HE4(MOLE/G)` TEXT,
`HE4(NCC/G)` TEXT,
`HE4_HE3(R/R(A))` TEXT,
`K2O(WT%)` TEXT,
`NA2O(WT%)` TEXT,
`P2O5(WT%)` TEXT,
`SIO2(WT%)` TEXT,
`SO2(WT%)` TEXT,
`SO3(WT%)` TEXT,
`SO4(WT%)` TEXT,
`TIO2(WT%)` TEXT,
    FOREIGN KEY (file_id) REFERENCES file(id)
);
CREATE TABLE citation (
    sample_id TEXT,
    reference_id INTEGER,
    FOREIGN KEY (sample_id) REFERENCES sample(id),
    FOREIGN KEY (reference_id) REFERENCES reference(id)
);

Thus, information similar to what is reported by georoc ls can be obtained by running SQL queries.

E.g. to determine the paper which contributed the highest number of samples, we can run

SELECT
    r.reference, COUNT(c.sample_id) AS c
FROM
    reference as r, citation as c
WHERE
    r.id = c.reference_id
GROUP BY r.reference
ORDER BY c DESC
LIMIT 1;

which yields

FONTIJN K., MCNAMARA K., TADESSE A. Z., PYLE D. M., DESSALEGN F., HUTCHISON W., MATHER T. A., YIRGU G.:    CONTRASTING STYLES OF POST-CALDERA VOLCANISM ALONG THE MAIN ETHIOPIAN RIFT: IMPLICATIONS FOR CONTEMPORARY VOLCANIC HAZARDS  J. VOLCANOL. GEOTHERM. RES. 356   [2018] 90-113    doi: 10.1016/j.jvolgeores.2018.02.001|3978

Accessing GEOROC data programmatically

pygeoroc provides a Python API to access local GEOROC data:

>>> from pygeoroc import GEOROC
>>> api = GEOROC('tmp/')
>>> for sample in api.iter_samples():
>>> for sample, file in api.iter_samples():
...     print(sample)
...     print(file)
...     break
... 
Sample(id='138180', name='s_27-261-33,CC,PC. 2 [2118]', citations=['2118'], data={'SR(PPM)': '', ... })
File(name='Ocean_Basin_Flood_Basalts_comp__ARGO_ABYSSAL_PLAIN;INDIAN_OCEAN.csv', date=datetime.date(2020, 3, 9), section='Ocean Basin Flood Basalts')

Converters

For both access modes - SQLite and the python API - pygeoroc provides a mechanism to specify "converters", i.e. python callables to convert the values for specific columns in GEOROC CSV data. This mechanism can be used to fix errata - e.g. missing negative signs in geographic coordinates - or cast data to more suitable datatypes.

This mechanism is implemented in pygeoroc.errata and works as follows: If the data repository contains a python module called converters.py, this module is loaded and two python dicts are looked up by name:

  • FIELDS: dict mapping CSV column names to converter functions.
  • COORDINATES: dict mapping CSV filenames (as stored in the repository) to dicts specifying converter functions for keys latitude and longitude.

A "converter function" is a python callable with the following signature:

def conv(old, data, fname):
    """
    @param old: old value for the respective field in the sample data
    @param data: full `dict` of the sample data in one row in the GEOROC CSV
    @param fname: name of the GEOROC CSV file containing the row
    @return: the new value for "field" in "data"
    """

Some useful converter functions are available as attributes of pygeoroc.errata.CONVERTERS.

So, to specify that values for the column LAND_OR_SEA must be all uppercase, and latitudes in the file BISMARCK_ARC_NEW_BRITAIN_ARC.csv must be negative (look up filenames in your repository's csv/ directory or by running SELECT id FROM file; in the SQLite db), you would put the following python code in your repository's converters.py:

from pygeoroc.errata import CONVERTERS

FIELDS = {
    'LAND_OR_SEA': CONVERTERS.upper,
}

COORDINATES = {
    "BISMARCK_ARC_NEW_BRITAIN_ARC.csv": {
        'latitude': CONVERTERS.negative,
    },
}