# Geochemical Data Analytics

**Intro** [Start](0.0_StartHere.ipynb) • <mark>[Intro](1.1_Introduction.ipynb)</mark>
<br> **pyrolite** [Geochem](2.1_pyroliteGeochem.ipynb) • [Visualisation](2.2_pyroliteVis.ipynb) • [alphaMELTS](2.3_pyroliteMELTS.ipynb) • [lambdas](2.4_lambdas.ipynb) <br> **Machine Learning** [Intro](4.0_MachineLearning.ipynb) • [Features](4.1_Features.ipynb)• [High-D Vis](4.3_HighDVis.ipynb) • [Classification](4.4_Classification.ipynb) • [Regression & Prediction](4.5_Regression.ipynb) • [Clustering](4.6_Clustering.ipynb) <br>  **GitHub** [geochem4nickel](https://github.com/morganjwilliams/geochem4nickel) • [pyrolite](https://github.com/morganjwilliams/pyrolite)

#### <i class="fa fa-twitter" aria-hidden="true"><a href="https://twitter.com/metasomite" style="font-family:Courier New,Courier,Lucida Sans Typewriter,Lucida Typewriter,monospace;"> @metasomite</a></i>

The presentatation section covers:

* **Why?** Motivations behind the approach

* What does '**data driven**' mean?

* How can we derive **constraints on geology** from geochemical data?

This section of the workshop will focus on geochemical data analysis tools, and particuarly taking a data science approach to geochemical problems. The latter part of this section will cover some tools and include some examples of applying this in practice.

1. Data Driven Understanding
1. Adopting a Programmatic Approach
1. Data Sources and Data Mining
1. An Example: Machine Tectonic Discrimination
1. pyrolite

## Data Driven Understanding from Geochemistry

### Data driven?

> Understanding derived from the data itself, <br>rather than our idea of what the data represents.

Geochemists love classifying, binning and 'butterfly collecting', but how many of these divisions make sense in the natural world (e.g. consider the Total Alkali - Silica diagram)? 

A data driven approach instead considers divisons according to the natural clustering of data, where they exist.

### Data Science in Exploration Geochemistry

Interpretation is required to understand the why and how, but there are many practical questions to ask of an exploration dataset:

* Is this more enriched/depleted in Y than expected?

* What rocks are most similar to X?

* Which chemical signatures are associated with mineralisation?

* Which features provide the most valuable information for prediction? At what scale?

* What setting did this form in?


### Quantifying Geochemical Context

Context is particularly important for geochemistry, as it's often expressed in relative terms. It's most useful when you have something else to compare it to.

* We discuss 'enrichement' and 'depletion', relative to some reference point.

* We normalise trace element chemistry to 'chondrite', 'primitive mantle' or 'MORB' to visualise the *relative* effects of geological processes
    * *'Let's ignore the effects of nucleosynthesis and planet formation for now..'*

* We use the term 'signature' to refer to the affinity or 'flavour' of geochemical compositions (e.g. MORB, E-MORB, N-MORB, 'Chondritic')

What we infer based on our geochemistry is commonly based on local changes or gradients (e.g. alteration haloes, mineral zoning) or similarity to some reference composition (e.g. basalts with 'ocean island basalt signatures'). 

In both of these instances, we express geochemistry relative to some reference point, but the relationship between our data and these reference compositions could often be better quantified. 

To highlight some of these issues, consider the second case, which is related to the problem of tectonic discrimination (which will come up a bit later):

* What is an 'ocean island basalt signature'?
    * *Ocean island basalts have a range of compositions, do you take the average? Median?*
    * *How close does it have to be to have an 'ocean island basalt signature'?*

* What if it also has an 'oceanic plateau' like signature?
    * *What's the probability that this is an ocean island basalt?*
    * *With what confidence can you say that this is an ocean island basalt, and not an oceanic plateau?*
    * *If you had analysed more elements, would you have the same confidence in the 'OIB signature'?*

### Beyond Three Dimensions: Multidimensional Analysis

One of the historical limitations for geochemical classificaiton and regression problems has been the need to graphically represent data in 2D.


Multidimensional geochemical data analysis can overcome previous human-centric limitations and inaccuracies.

#### Trace Element Discrimination Diagrams

* Pearce's Th-Yb-Nb diagram offers some insight into the issues with using 2D classification schemes.

<div class="row">
<div class="column">
    <img src="../img/Smithies2018Fig1.png" width="30%" style="float: left; margin: 0px 15px 15px 0px;"/>
    </div>
<div class="column">
    <img src="../img/Li2015Fig9.png" width="30%" style="float: left; margin: 0px 15px 15px 0px;"/>
    </div>
</div> 

* The significant degree of overlap in these two dimensions between different tectonic settings renders this approach futile for generalised discrimination.

Figures from [Smithies (2018)] after [Pearce (2008)], and [Li et al. (2015)]).

[Li et al. (2015)]: https://doi.org/10.1016/j.lithos.2015.06.022 "Li, C., Arndt, N.T., Tang, Q., Ripley, E.M., 2015. Trace element indiscrimination diagrams. Lithos 232, 76–83. "

[Smithies (2018)]: https://doi.org/10.1016/j.epsl.2018.01.034 "Smithies, R.H., Ivanic, T.J., Lowrey, J.R., Morris, P.A., Barnes, S.J., Wyche, S., Lu, Y.-J., 2018. Two distinct origins for Archean greenstone belts. Earth and Planetary Science Letters 487, 106–116."

[Pearce (2008)]: https://doi.org/10.1016/j.lithos.2007.06.016 "Pearce, J.A., 2008. Geochemical fingerprinting of oceanic basalts with applications to ophiolite classification and the search for Archean oceanic crust. Lithos 100, 14–48."

### Summary

* Our reference points are often distributions
* Classification is not often straightforward
* Some of these issues are due to examining a limited set of components

## Adopting a Programmatic Approach

### Is it worth the effort?

* We make less errors
> "One in five genetics papers contains errors thanks to Microsoft Excel" [DOI:10. 1186/s13059-016-1044-7](https://doi.org/10.1186/s13059-016-1044-7)

* Using a code-based approach to data analysis means that we can
    * effectively 'version' the process and 
    * identify errors after the fact
    
* Our analysis is reproducible

* Develop new tools, which you can integrate into your own workflows
* Automate the repetitive work, and use analysis for decision support
* We *can* become more productive

* We can better quantify uncertainty
* We can quantify how well our data support our models
* Iterative testing and model development
* Can change how we see the system as a whole
    * *e.g. beyond 2D and 3D diagrams to multidimensional analysis*

* **Changes the questions we ask**

Using open source software, and adopting open science practices:

* Can socialise data, ideas, methods, code and analyses
* The repeatable paper: share your data and analysis
* Build in interoperability and flexibility (open data formats, common standards)
* Build a community and develop consensus on **best practice** in a new era of geoscience

### Summary 

* Gaining a few new skills and writing a bit of code could help you get more value out of your data.
* It will take a bit of learning and won't happen overnight, but there's lots to gain.

## Data Sources and Data Mining

Next: [pyrolite: Python for Geochemistry](2.1_pyroliteGeochem.ipynb)

Specificity - Also - could you be sure someone else means the same thing?

Repeatability - Would someone else come to the same conclusion?

### How do you like your geochemical data distributed..?

Acknowledging and accouting for the log-normally distributed nature of geochemical data can help deal with some spurious correlation, and allow for more robust geochemical statistics.

<img src="https://pyrolite.readthedocs.io/en/develop/_images/CompositionalDistributions.png" width="45%" style="float: left; margin: 0px 15px 15px 0px;"/>