# Introduction: Geochemical Data Analytics

**Intro** [Start](0.0_StartHere.ipynb) • <mark>[Intro](1.1_Introduction.ipynb)</mark>
<br> **pyrolite** [Geochem](2.1_pyroliteGeochem.ipynb) • [Visualisation](2.2_pyroliteVis.ipynb) • [alphaMELTS](2.3_pyroliteMELTS.ipynb) <br>  **Comparative Geochem**  [Databases, Data Mining and Deriving Context](3.1_ComparativeGeochemData.ipynb)<br>  **Machine Learning** [Intro](4.0_MachineLearning.ipynb) • [Features](4.1_Features.ipynb) • [Dimensional Reduction](4.2_DimReduction.ipynb) • [High-D Vis](4.3_HighDVis.ipynb) • [Classification](4.4_Classification.ipynb) • [Regression & Prediction](4.5_Regression.ipynb) • [Clustering](4.6_Clustering.ipynb) <br>  **GitHub** [geochem4nickel](https://github.com/morganjwilliams/geochem4nickel) • [pyrolite](https://github.com/morganjwilliams/pyrolite)

#### <i class="fa fa-twitter" aria-hidden="true"><a href="https://twitter.com/metasomite" style="font-family:Courier New,Courier,Lucida Sans Typewriter,Lucida Typewriter,monospace;"> @metasomite</a></i>

The introductory section covers:

* **Why?** Motivations behind the approach

* What does '**data driven**' mean?

* How can we derive **constraints on geology** from geochemical data?

* What are good **exploration-style questions** which we can work towards answering with this approach, and where will this be less useful?

This section of the workshop will focus on geochemical data analysis tools, and particuarly taking a data science approach to geochemical problems. The latter part of this section will cover some tools and include some examples of applying this in practice.

## A Programmatic Approach to Geochemistry


**Exectutive Summary**:
* While some geochemists have numerical tendencies, it's not a common trait. 
* Gaining a few new skills and writing a bit of code could help you get more value out of your data.
* It will take a bit of learning and won't happen overnight, but there's lots to gain (and I try to make it simple)

#### Is it worth the effort?

* We make less errors
> "One in five genetics papers contains errors thanks to Microsoft Excel" [DOI:10. 1186/s13059-016-1044-7](https://doi.org/10.1186/s13059-016-1044-7)
* Our analysis is repoducible
* Codifying data analysis means that we can i) effectively 'version' the process and ii) identify errors after the fact

* Develop new tools, which you can integrate into your own workflows
* Automate the repetitive work, and use analysis for decision support
* We *can* become more productive

* We can better quanitfy uncertainty, and start to examine the confidence we should have in models
* Easily iteratively test hypotheses and build better models
* Can change how we see the system as a whole
    * *e.g. beyond 2D and 3D diagrams to multidimensional analysis*

* **Changes the questions we ask**

Using open source software, and adopting open science practices:

* Can socialise data, ideas, methods, code & analyses
* Build in interoperability & flexibility
* The repeatable paper: Share your data & analysis
* Build a community and develop consensus on **best practice** in a new era of geoscience

# Data Driven Understanding from Geochemistry

### Data driven?

> Understanding derived from the data itself, rather than our idea of what the data represents

Geochemists love classifying, binning and 'butterfly collecting', but how many of these divisions make sense in the natural world? For example, consider the TAS diagram. A data driven approach instead considers divisons according to the natural clustering of data (where it exists).

### Quantifying Geochemical Context

Context is particularly important for geochemistry, as it's often expressed in relative terms. It's most useful when you have something else to compare it to.

* We discuss 'enrichement' and 'depletion', relative to some reference point.

* We normalise trace element chemistry to 'chondrite', 'primitive mantle' or 'MORB' to visualse the *relative* effects of geological processes
    * *'Let's ignore the effects of nucleosynthesis and planet formation for now..'*

* We use the term 'signature' to refer to the affinity or 'flavour' of geochemical compositions (e.g. MORB, E-MORB, N-MORB, 'Chondritic')

What we infer based on our geochemistry is commonly based on local changes (e.g. alteration haloes) or similarity to some reference composition (e.g. basalts with 'ocean island basalt signatures'). In both of these instances, we express geochemistry relative to some reference point, but the relationship between our data and these reference compositions could often be better quantified. 

To highlight some of these issues, consider the second case, which is related to the problem of tectonic discrmination (to be found later in the practical section):
* What is an 'ocean island basalt signature'?
    * *Ocean island basalts have a range of compositions, do you take the average? Median?*
    * *How close does it have to be to have an 'ocean island basalt signature'?*
* What if it also has an 'oceanic plateau' like signature?
    * *What's the probability that this is an ocean island basalt?*
    * *With what confidence can you say that this is an ocean island basalt, and not an oceanic plateau?*
    * *If you had analysed more elements, would you have the same confidence in the 'OIB signature'?*

Key points:
* Our reference points are often distributions
* Classification is not often straightforward
* Some of these issues are due to examining a limited set of components

### Beyond Three Dimensions: Multidimensional Analysis

One of the historical limitations for geochemical classificaiton and regression problems has been that of the need to graphically represent data in 2D.

Extending some of these same approaches to use multidimensional geochemical data analysis can overcome previous limitations and inaccuracies.

### How do you like your geochemical data distributed..?

Acknowledging the log-normally distributed nature of geochemical data can help account for some spurious correlation.

# Data Science Applied to Geochemical Exploration Questions

* Is something new/novel?
    * Is this enriched/depleted? Relative to what?
    * What setting did this form in?
    * What rocks are most similar to X?
    * Which chemical signatures are associated with mineralisation?
    * Which features provide the most valuable information? At what scale?
    * What does this distribution mean?

Next: [pyrolite: Python for Geochemistry](2.1_pyroliteGeochem.ipynb)