# 5. The Value of Domain Knowledge
Compiled by [Morgan Williams](mailto:morgan.williams@csiro.au) for C3DIS 2018 

Thus far, much of the content has concerned more applied chemistry and statistics. Here we touch on some valuable aspects of domain-specific knowledge relevant for geochemical data analysis. Additionally, we examine the potential for data to be used as a validation method for existing domain methods and knowledge.

### Classification

*"All science is either physics or stamp collecting"* **Ernest Rutherford**

Geologists, like most other scientists which conduct observation-based investigations, have a habit of 'stamp collecting'. Methods to classify rocks provide geologists with a vocabulary with which to discuss and focus their investigations, but models developed in the past are rarely subjected to continued testing and refinement. Geological samples and derived geochemical data are commonly stored with pre-derived classifications. Here we derive these from their models, and compare i) the quality of database classification information, and ii) the potential relevance of historical classification models in an era which is no longer data-limited.

Modern databases provide a foundation for i) testing historical classification schemes and ii) implementing new classification schemes based on inherent data relationships and statistical divisions.

In [4]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from compositions import *
from geochem import *
from classification import Geochemistry
from alteration import *
from datasource import load_df

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
df = load_df('EarthChemGlobal.pkl')

In [7]:
cm = Geochemistry.TAS()
classnames = cm.clsf.fclasses + ['none']
df.TotalAlkali = df.Na2O + df.K2O
df['TAS'] = cm.classify(df).astype('category')
df['TAScolors'] = df['TAS'].map(lambda x: classnames.index(x)) # Use the index in the list for a simple colormap
df['TASRock'] = df['TAS'].map(lambda x: cm.clsf.fields.get(x,{'names': ['N/A']}).get('names')[0])

From here we can assess a range of questions regarding the data quality, and potentially the quality of classification methods.

Note: The following cell takes a while to run - plotting 100s of thousands of points is not matplotlib's best quality.

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(12, 12), sharex=True, sharey=True)
ax=ax.flat
for a in ax:
    a.set_ylabel('$Na_2O + K_2O$')
    a.set_xlabel('$SiO_2$')
    cm.add_to_axes(a, alpha=0.2, edgecolor='k', linewidth=1, zorder=-1)

filters = []
for rockname in ['BASALT', 'RHYOLITE']:
    filters.append((rockname, (df.loc[:, 'RockName'] == rockname) & (df.loc[:, 'Material'] == 'IGNEOUS')))
    
for composition in ['ULTRAMAFIC', 'FELSIC']:
    filters.append((composition, (df.loc[:, 'Composition'] == composition) & (df.loc[:, 'Material'] == 'IGNEOUS')))
    
for ty in ['PLUTONIC', 'VOLCANIC']:
    filters.append((ty, (df.loc[:, 'Type'] == ty) & (df.loc[:, 'Material'] == 'IGNEOUS'))) 
    
for ix, (name, filt) in enumerate(filters):
    ax[ix].annotate(name, xy=(0.1, 0.9), ha='left', xycoords=ax[ix].transAxes)
    ax[ix].scatter(df.loc[filt, 'SiO2'],
                   df.loc[filt, 'TotalAlkali'],
                   c=df.loc[filt, 'TAScolors'],
                   alpha=0.5, marker='D', s=8, cmap='tab20c')


### Give me a Map: Incorporating Reference Frames

Communicating the significance of data in a broader context requires some reference frame. This is a feature commonly addressed throughout all scientific disciplines, and geologists in general do this relatively well.

The simplest case of such a reference frame in geochemistry is the composition of a known reference material - which could be a mineral, a particular rock type or our estimate for the average composition of the solar system. While such reference frames are present throughout the literature, they are rarely baked into analysis tools, and the basic information required to reproduce them is commonly required to be reconstructed manually.

Here we attempt to begin construction of a few reference frames for different purposes. The first is general orientation - where do our compositions lie relative to others (e.g. mineral compositions, average compositions globally). To some extent, the classifications mentioned above can help with this. The second is to have a reference frame to illustrate the potential effects of a specific process (e.g. if I mixed two compositions, what would the range of expected compositions look like?).

### Beyond Data QA

Controlling data quality is the first step, but for various reasons specific samples can be unsuitable for investigating specific geochemical features or processes. Here we note a group of simple alteration proxies which can be used to create data-derived features for use in data filtering/classification.

These proxies are intended to encompase the effects of chemical alteration - most are tied to specific mineralogical weathering processes. As a first pass we can visualise the distribution of these proxies across the database samples.

In [None]:
xvar = 'FeOT'
alteration_index_functions = [CIA, CIW, PIA, SAR, SiTiIndex, WIP]
lenfs = len(alteration_index_functions)
plotswide=3
unit = 2
fig, ax = plt.subplots(lenfs%plotswide+lenfs//plotswide, plotswide,
                       figsize=((1+plotswide) * unit, (1 + lenfs//plotswide) * unit),
                       sharex=True)
ax = ax.flat
major_components = [i for i in df.columns if i in common_oxides(output=str)]
for ix, f in enumerate(alteration_index_functions):
    fname = f.__name__
    if not fname in df.columns:
        df[fname] = f(to_molecular(df.loc[:, major_components]))
    
    filt = (df.Material == 'IGNEOUS')
    ax[ix].annotate(fname, xy=(0.9, 0.9), ha='right', xycoords=ax[ix].transAxes)
    ax[ix].scatter(df.loc[filt, xvar], df.loc[filt, fname], alpha=0.01, color='0.5')
    ax[ix].set_xlabel(xvar)
    ax[ix].set_ylabel(fname)
    
ax[2].set_ylim((-100, 100))
ax[3].set_ylim((0, 100))
ax[0].set_xlim((0, 30))