# Geochemical Data - Importing, Processing and 'Munging'

In this notebook we'll go through some of the tasks which `pyrolite` can help make a bit easier with respect to getting your data analysis ready.


## Importing Data

`pyrolite` is largely based around `pandas`, and as such you're typically working with Pandas dataframes. Pandas can work with [a variety of file types](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), some more performant than others, but is also happy to digest humble CSV and Excel files (with the functions `pandas.read_csv` and `pandas.read_excel`, respectively). Some of these functions are able to directly connect to remote files (e.g. CSV) or directly to database connections. You'll see one or two examples of fetching a remote CSV file directly below and in other notebooks.

## Cleaning Up Column Names

One of the challenges of working with larger datasets is being able to quickly find the right data when you need it. `pyrolite` provides some functions for this, but for the time being is dependent on being able to recognise compositional columns by looking for elements, oxide names and isotopes (without unit annotations, delimiters and other markup). Here we show some of the steps which might be required to get your dataframe in a standardised format (using examples [from GEOROC](http://georoc.mpch-mainz.gwdg.de/georoc/Entry.html)). Notably, this can be the most difficut step of any analysis workflow, so being able to do this in a way which is repeatable might save you a decent amount of time if you have to do it again in the future!

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('http://georoc.mpch-mainz.gwdg.de/georoc/Csv_Downloads/Continental_Flood_Basalts_comp/CENTRAL_ATLANTIC_MAGMATIC_PROVINCE_-_CAMP.csv',
                 encoding='cp1252',
                 skip_blank_lines=False) # get some data from GEOROC directly
df = df.loc[:np.argmax(df.iloc[:, 0].isnull())-1] # omit the abbreviations and references after the blank line in this file

Quickly looking at this dataframe, we can see that all the column names are capitalised, `_` is used as a delimiter for isotopes, and units are given for the geochemical parameters in the format `(UNIT)`. We also have a unique index, and ean extra redundant column on the right hand side of our table (which we can drop):

In [2]:
df.head(2)

Unnamed: 0,CITATIONS,TECTONIC SETTING,LOCATION,LOCATION COMMENT,LATITUDE MIN,LATITUDE MAX,LONGITUDE MIN,LONGITUDE MAX,LAND OR SEA,ELEVATION MIN,...,RE187_OS188,HF176_HF177,HE3_HE4,HE3_HE4(R/R(A)),HE4_HE3,HE4_HE3(R/R(A)),K40_AR40,AR40_K40,UNIQUE_ID,Unnamed: 171
0,[20054],CONTINENTAL FLOOD BASALT,CENTRAL ATLANTIC MAGMATIC PROVINCE - CAMP / SI...,"NEAR LEONFORTE VILLAGE, ALONG THE SOUTHERN SLO...",37.6403,37.6403,134.3278,134.3278,SAE,,...,,,,,,,,,1015617,
1,[20054],CONTINENTAL FLOOD BASALT,CENTRAL ATLANTIC MAGMATIC PROVINCE - CAMP / SI...,"NEAR LEONFORTE VILLAGE, ALONG THE SOUTHERN SLO...",37.6403,37.6403,134.3278,134.3278,SAE,,...,,,,,,,,,1015618,


In [3]:
df.drop(columns=[c for c in df.columns if 'Unnamed' in c], inplace=True) # drop our redundant column

We can alter the names of columns which `pyrolite` will recognise so we can use some of its more automated methods. Specifically, pyrolite expects element and oxide names to have 'title case' names (e.g. `Si`, `SiO2`, not `MG`, `MGO`). Similarly, for isotope ratios, it expects something along the lines of `87Sr/86Sr`. Here I've written a function which will attenpt to find relevant element and oxide names among the capitalised versions we find here.

In [4]:
from pyrolite.geochem.ind import __common_elements__, __common_oxides__ # indexes of elements and oxides which we'll check against

def rename_columns(df):
    """
    Rename the columns which pyrolite can access so we can use the indexing and transformation functions.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Dataframe with columns you'd like to rename.
    
    Returns
    -------
    pandas.DataFrame
        Dataframe with columns renamed.
    """
    _elements, _oxides = {e.upper(): e for e in __common_elements__}, {o.upper(): o for o in __common_oxides__} # these will serve as lookup tables for our capitalised components
    
    element_columns = {c: _elements[c[:c.find("(PPM")]] for c in df.columns if c[:c.find("(PPM")] in _elements} # all of the elemental values are in ppm here
    oxide_columns = {c: _oxides[c[:c.find("(WT%")]] for c in df.columns if c[:c.find("(WT%")] in _oxides} # this omits some gas measurements, but gets the major oxides
    isotope_columns = {c: '/'.join([c.title() for c in c.split('_')]) for c in df.pyrochem.list_isotope_ratios if len(c.split('_'))==2}
    
    return df.rename(columns = {**element_columns, **oxide_columns, **isotope_columns})
    

df = rename_columns(df)

In [5]:
df.head(2)

Unnamed: 0,CITATIONS,TECTONIC SETTING,LOCATION,LOCATION COMMENT,LATITUDE MIN,LATITUDE MAX,LONGITUDE MIN,LONGITUDE MAX,LAND OR SEA,ELEVATION MIN,...,Re187/Os186,Re187/Os188,Hf176/Hf177,He3/He4,He3/He4(R/R(A)),He4/He3,He4/He3(R/R(A)),K40/Ar40,Ar40/K40,UNIQUE_ID
0,[20054],CONTINENTAL FLOOD BASALT,CENTRAL ATLANTIC MAGMATIC PROVINCE - CAMP / SI...,"NEAR LEONFORTE VILLAGE, ALONG THE SOUTHERN SLO...",37.6403,37.6403,134.3278,134.3278,SAE,,...,,,,,,,,,,1015617
1,[20054],CONTINENTAL FLOOD BASALT,CENTRAL ATLANTIC MAGMATIC PROVINCE - CAMP / SI...,"NEAR LEONFORTE VILLAGE, ALONG THE SOUTHERN SLO...",37.6403,37.6403,134.3278,134.3278,SAE,,...,,,,,,,,,,1015618


Finally, it might be a good idea to use the Unique ID as the index for our dataframe:

In [6]:
df = df.set_index('UNIQUE_ID', drop=True)

You could package this up into a function to fetch one of these GEOROC files:

In [7]:
def fetch_GEOROC_csv(filepath):
    """
    Fetch a GEOROC csv from a local file or URL.
    
    Parameters
    ----------
    filepath : str | pathlib.Path
        Filepath to a GEOROC csv - can be used to directly fetch a URL thanks to `pandas.read_csv`.
    
    Returns
    -------
    pandas.DataFrame
        Dataframe formatted for `pyrolite`.
    """
    df = pd.read_csv(filepath, encoding='cp1252', skip_blank_lines=False) # get some data from GEOROC directly
    df = df.loc[:np.argmax(df.iloc[:, 0].isnull())-1] # omit the abbreviations and references after the blank line in this file
    df.drop(columns=[c for c in df.columns if 'Unnamed' in c], inplace=True) # drop our redundant column
    df = rename_columns(df)
    df = df.set_index('UNIQUE_ID', drop=True)
    return df

Now we can use this to fetch another CSV - this time from the Kermadec Arc. Note that the column names which `pyrolite` can work with are already converted to usable versions:

In [8]:
kermadec_df = fetch_GEOROC_csv('http://georoc.mpch-mainz.gwdg.de/georoc/Csv_Downloads/Convergent_Margins_comp/KERMADEC_ARC.csv')
kermadec_df.head(2)

Unnamed: 0_level_0,CITATIONS,TECTONIC SETTING,LOCATION,LOCATION COMMENT,LATITUDE MIN,LATITUDE MAX,LONGITUDE MIN,LONGITUDE MAX,LAND OR SEA,ELEVATION MIN,...,Os187/Os188,Re187/Os186,Re187/Os188,Hf176/Hf177,He3/He4,He3/He4(R/R(A)),He4/He3,He4/He3(R/R(A)),K40/Ar40,Ar40/K40
UNIQUE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10203-46325,[13460][9592],CONVERGENT MARGIN,KERMADEC ARC / KERMADEC ISLANDS / RAOUL / RAOUL,,-29.25,-29.25,-177.87,-177.87,SAE,,...,,,,,,,,,,
10203-46329,[13460][9592],CONVERGENT MARGIN,KERMADEC ARC / KERMADEC ISLANDS / RAOUL / RAOUL,,-29.25,-29.25,-177.87,-177.87,SAE,,...,,,,,,,,,,


## Selecting Subsets of your Data

The `pyrolite.pyrochem` API provides access to indexing and transformation functions. This allows easy subsetting of geochemical datasets which can otherwise be unweildly (expecially as the number of columns increases..). To provide a simple illustration we generate a synthetic dataset to work from, which contains an array of typical geochemical measures - oxide components, element components (here as ppm), element ratios and isotope ratios. While this size dataset is managable, some of the indexing tools pyrolite provides make it straightforward to pull out different parts of the dataset.

In [9]:
import numpy as np
from pyrolite.util.synthetic import normal_frame

df = normal_frame(columns=['CaO', 'MgO', 'SiO2', 'FeO','Na2O', 'Ni', 'Ti', 'La', 'Lu', 'Te']) * 100
df[['Ni', 'Ti', 'La', 'Lu', 'Te']] *= 10
df.pyrochem.add_ratio('Mg/Fe') # one way to add an element ratio to a dataframe!
df['Sr87/Sr86'] = 0.0700  / 0.0986 + np.random.randn(df.index.size) * 0.0001
df

Unnamed: 0,CaO,MgO,SiO2,FeO,Na2O,Ni,Ti,La,Lu,Te,Mg/Fe,Sr87/Sr86
0,1.655055,7.086181,5.57225,1.753955,16.903489,313.245685,149.330266,117.027877,77.029714,13.65715,3.134337,0.709945
1,2.018815,7.260948,6.385096,1.956095,16.1316,315.263957,130.808577,121.702827,80.410158,14.288939,2.879753,0.710012
2,1.883454,7.621294,5.876502,1.975933,16.684251,283.236364,160.895403,119.091654,80.107607,16.254629,2.992321,0.710067
3,1.820512,7.599999,5.748971,2.018871,16.515718,317.596513,136.269178,116.929177,78.694511,13.469901,2.920497,0.70983
4,1.672767,7.236712,5.454837,1.858761,16.467654,327.623953,134.538619,120.285869,77.824026,12.820223,3.020435,0.709914
5,1.849501,7.662645,5.544457,2.008992,16.931303,307.457489,133.573533,123.913256,80.864561,14.222178,2.95905,0.709852
6,1.759604,7.224562,5.578504,1.889394,16.426477,313.777911,144.190118,120.94659,78.503193,13.796774,2.966477,0.709879
7,1.662329,6.846866,5.726762,1.67786,18.618896,295.671929,154.964956,112.604728,76.161235,15.270037,3.165833,0.709929
8,1.847861,7.425272,5.771191,1.9354,16.286983,314.511146,139.492888,120.125835,79.590426,13.612632,2.976414,0.709965
9,1.883241,7.343225,5.700526,1.976209,15.192967,316.259645,140.002246,128.340952,80.312341,14.123132,2.882742,0.710008


In [10]:
df.pyrochem.oxides

Unnamed: 0,CaO,MgO,SiO2,FeO,Na2O
0,1.655055,7.086181,5.57225,1.753955,16.903489
1,2.018815,7.260948,6.385096,1.956095,16.1316
2,1.883454,7.621294,5.876502,1.975933,16.684251
3,1.820512,7.599999,5.748971,2.018871,16.515718
4,1.672767,7.236712,5.454837,1.858761,16.467654
5,1.849501,7.662645,5.544457,2.008992,16.931303
6,1.759604,7.224562,5.578504,1.889394,16.426477
7,1.662329,6.846866,5.726762,1.67786,18.618896
8,1.847861,7.425272,5.771191,1.9354,16.286983
9,1.883241,7.343225,5.700526,1.976209,15.192967


In [11]:
df.pyrochem.elements

Unnamed: 0,Ni,Ti,La,Lu,Te
0,313.245685,149.330266,117.027877,77.029714,13.65715
1,315.263957,130.808577,121.702827,80.410158,14.288939
2,283.236364,160.895403,119.091654,80.107607,16.254629
3,317.596513,136.269178,116.929177,78.694511,13.469901
4,327.623953,134.538619,120.285869,77.824026,12.820223
5,307.457489,133.573533,123.913256,80.864561,14.222178
6,313.777911,144.190118,120.94659,78.503193,13.796774
7,295.671929,154.964956,112.604728,76.161235,15.270037
8,314.511146,139.492888,120.125835,79.590426,13.612632
9,316.259645,140.002246,128.340952,80.312341,14.123132


In [12]:
df.pyrochem.REE

Unnamed: 0,La,Lu
0,117.027877,77.029714
1,121.702827,80.410158
2,119.091654,80.107607
3,116.929177,78.694511
4,120.285869,77.824026
5,123.913256,80.864561
6,120.94659,78.503193
7,112.604728,76.161235
8,120.125835,79.590426
9,128.340952,80.312341


In [None]:
df.pyrochem.compositional

In [None]:
df.pyrochem.isotope_ratios

Notably, these dataframe accessors can also be used to re-assign values back to the dataframe. Here we transform element components to wt% equivalents by dividing by 10000, and note that the change has been incorporated into our dataframe:

In [None]:
df.pyrochem.elements /= 10000
df.pyrochem.elements

If you're just after a list of the relevant column names, there are respective functions for that too:

In [None]:
df.pyrochem.list_oxides

## Unit Scales

While you can transform elements and oxide abundnace units easily when you remember the relative scales, `pyrolite` provides some functions such that you don't have to rely on your memory. Here we create a copy of the dataframe and within it revert the change we made above - so these should be the orignal ppm values. This method provides an easy way to explicitly declare your intention when changing units - and makes sure the relative scales are correct!

In [None]:
df.pyrochem.elements.pyrochem.scale('wt%', 'ppm') # wt% to ppm

## Converting Chemical Components 

`pyrolite` provides some straightfoward methods to calcuate element-oxide conversions (e.g. to transform Ti abundance to TiO2 abudnance), assuming that the system is open to oxygen (i.e. in this case the extra oxygen will be added to the composition). This interface also allows the user to quickly add ratios and specify redox pairs at the same time. For example, we can transform a copy of our dataframe to include extra ratios and change some of our oxide components to elements:

In [None]:
df.pyrochem.convert_chemistry(
    to=["MgO", "SiO2", "FeO", "Ca", "Te", "Na", "Na/Te", "MgO/SiO2"]
)

In a similar way, we can also specify the molar speciation for redox species (so far just iron; others could be incorporated if they'll be useful). Here we adjust the total iron within our compositions (currently specified as FeO) to have a $Fe^{2+}/Fe^{3+}$ ratio of 9:1 (roughly what you might expect from a normal mantle-derived magma):

In [None]:
df.pyrochem.convert_chemistry(to=[{"FeO": 0.9, "Fe2O3": 0.1}])