# 2. Data Aggregation
Compiled by [Morgan Williams](mailto:morgan.williams@csiro.au) for C3DIS 2018 

As with 

In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

sys.path.insert(0, './src')
from geochem import *
from text_accessories import titlecase

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Unification: Standardisation and Aggregation of Columns

A number of complications arise due to database format and academic norms:
* Geochemical data comes in a variety of formats, and multiple valid representations. For example, compositions typically use oxides (e.g. $SiO_2$) over elements (e.g. $Si$) for major components, whereas trace elements are almost exclusively reported in the elemental form. 
* Compositions can validly be represented as molar (e.g. mol%) and weight-based (e.g wt%, parts per million) quantities. 
* Slight differences between approaches, and the presence of 'minor elements' (which can be either significant enough to express as oxides, or insignificant enough to express as elements) leads to databases which can contain values for multiple forms of the same element.

In [63]:
df = pd.read_csv(r'./data/examples/processingexample_1.csv')
df = df.loc[:, [c for c in df.columns if not ("_" in c)]]
df.columns = [titlecase(h, abbrv=['ID', 'IGSN'], split_on="[\s_-]+") for h in df.columns] # Establish naming convention for headers
element_translation = {e.upper(): e for e in common_oxides(output='str') + common_elements(output='str')}
df.columns = [element_translation[h.upper()] if h.upper() in element_translation else h for h in df.columns]
print(df.columns.values)

['SampleID' 'IGSN' 'Source' 'Reference' 'CruiseID' 'Latitude' 'Longitude'
 'LocPrec' 'MinAge' 'Age' 'MaxAge' 'Method' 'Material' 'Type'
 'Composition' 'RockName' 'Mineral' 'SiO2' 'TiO2' 'Al2O3' 'Fe2O3' 'Fe2O3T'
 'FeO' 'FeOT' 'MgO' 'CaO' 'U238' 'Na2O' 'K2O' 'P2O5' 'MnO' 'LOI' 'H2O'
 'Cr2O3' 'La' 'NiO' 'Caco3' 'Ce' 'Pr' 'Nd' 'Sm' 'Eu' 'Gd' 'Tb' 'Dy' 'Ho'
 'Er' 'Tm' 'Yb' 'Lu' 'Li' 'Be' 'B' 'C' 'CO2' 'F' 'Th230' 'Cl' 'K' 'Ca'
 'Mg' 'Sc' 'Ti' 'V' 'Fe' 'Cr' 'Mn' 'Co' 'Ni' 'Zn' 'Cu' 'Zr' 'Ga' 'Ra226'
 'Pa231' 'Th232' 'Ba' 'W' 'Au' 'Hg' 'Ta' 'Sb' 'Se' 'Sn' 'S' 'U' 'Re' 'I'
 'P' 'Y' 'Mo' 'Pd' 'Te' 'Pt' 'Hf' 'Ir' 'Pb' 'Indium' 'Ag' 'Th' 'Tl' 'As'
 'Rb' 'Al' 'Cs' 'Sr' 'Bi' 'Nb' 'Os' 'Cd' 'Quartz']


In [64]:
major_components = [i for i in common_oxides(output='str') if i in df.columns]
elemental_components = [i for i in common_elements(output='str') if i in df.columns]
metadata_headers = df.columns[:list(df.columns).index('SiO2')].values  # Everything before SiO2
print(metadata_headers)
print(major_components)

['SampleID' 'IGSN' 'Source' 'Reference' 'CruiseID' 'Latitude' 'Longitude'
 'LocPrec' 'MinAge' 'Age' 'MaxAge' 'Method' 'Material' 'Type'
 'Composition' 'RockName' 'Mineral']
['H2O', 'CO2', 'Na2O', 'MgO', 'Al2O3', 'SiO2', 'P2O5', 'K2O', 'CaO', 'TiO2', 'Cr2O3', 'MnO', 'FeO', 'Fe2O3', 'NiO', 'FeOT', 'Fe2O3T', 'LOI']


Redox-sensitive elements are commonly denonted as in either one or multiple oxidation states (e.g. Fe is commonly listed as Fe, FeO, Fe2O3, FeOT, Fe2O3T). Where multiple oxidation states are listed, geological information exists (i.e. the oxidation state of the rock; this could be extracted and added as a secondary feature..), but this is not common, and effective comparison with other rocks requires unification to one state (typically either FeOT or Fe2O3T).

In [65]:
df = recalculate_redox(df, to_oxidised=False, renorm=False)
df = renormalise(df, components=major_components)
major_components = [i for i in common_oxides(output='str') if i in df.columns]
print(f"We've now adjusted the major oxide components to: {major_components}")

We've now adjusted the major oxide components to: ['H2O', 'CO2', 'Na2O', 'MgO', 'Al2O3', 'SiO2', 'P2O5', 'K2O', 'CaO', 'TiO2', 'Cr2O3', 'MnO', 'NiO', 'FeOT', 'LOI']


Hydration is one of the most common reactions encountered in geology. In some rocks, this can be approximated as simple addition of water. But, as the data is compositional - the abundance of other components is altered due to this process. As a first-pass, these volatile components can be removed to readjust the bulk-rock composition:

In [66]:
df = devolatilise(df, exclude=['H2O', 'H2O_PLUS', 'H2O_MINUS', 'CO2', 'LOI'], renorm=False)
df = renormalise(df, components=major_components)
major_components = [i for i in common_oxides(output='str') if i in df.columns]
print(f"We've now adjusted the major oxide components to: {major_components}")

We've now adjusted the major oxide components to: ['Na2O', 'MgO', 'Al2O3', 'SiO2', 'P2O5', 'K2O', 'CaO', 'TiO2', 'Cr2O3', 'MnO', 'NiO', 'FeOT']


Furthermore, elements are commonly present in more than one form; to compare them they should be aggregated to a single series:

In [67]:
mutual_elements = check_multiple_cation_inclusion(df)
print(f'Elements in both oxide and trace form: {mutual_elements}')

for el in mutual_elements:
    df = aggregate_cation(df, el, form='oxide', unit_scale=1/10000)

print(f'After aggregation: {check_multiple_cation_inclusion(df)}')


Elements in both oxide and trace form: [Mg, Al, P, K, Ca, Ti, Cr, Mn, Ni]
After aggregation: []


Ideally, chemical components would be present in the same form throughout a database, and in the same units. Here we stop short of this for simplicity.

In [68]:
geochemical_components = [comp for comp in common_oxides(output='str') + common_elements(output='str') if comp in df.columns]
print(f"Final components: {geochemical_components}")

Final components: ['Na2O', 'MgO', 'Al2O3', 'SiO2', 'P2O5', 'K2O', 'CaO', 'TiO2', 'Cr2O3', 'MnO', 'NiO', 'FeOT', 'Li', 'Be', 'B', 'C', 'F', 'S', 'Cl', 'Sc', 'V', 'Fe', 'Co', 'Cu', 'Zn', 'Ga', 'As', 'Se', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Pd', 'Ag', 'Cd', 'Sn', 'Sb', 'Te', 'I', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Th', 'U']


## Aggregation of Records

Where multiple records correspond to the same sample (as is common in geochemistry; major components are commonly measured separately to trace components), it may be valid to aggregate them:
1. If the records contain the same components, the aggregation can be performed using a weighted mean of the log-transformed components.
2. If the records contain mutually exclusive components, they may be combined subject to an adjusted closure parameter.
3. If neither of these are the case, the situation becomes more complex:
    * If records contain just one common component the aggregation is equivalent to 'internal standardisation', a technique commonly used in geochemical analysis. The records are scaled such that the common component is equivalent (typically scaled to the most accurate value), and from there the aggregation as is above in situation 2.
    * If the records contain more than one common component, an alterative method is needed. In this case, 'best fit' scaling is the simplest approach where additional information is not present.