In [2]:
%matplotlib widget
import numpy as np
import matplotlib.pyplot as plt
from util import render_audio_sample

# Digging Into Deep Time and Deep Cover

<span id='authors'><b>Morgan Williams <a class="fa fa-twitter" aria-hidden="true" href="https://twitter.com/metasomite" title="@metasomite"></a></b>, Jens Klump, Steve Barnes and Fang Huang; </span>
<span id='affiliation'><em>CSIRO Mineral Resources</em></span>


| [**Abstract**](./00_overview.ipynb) | **Introduction**                                                    | [**Examples**](./00_overview.ipynb#Examples)            | [**Tools**](./00_overview.ipynb#Tools) | [**Insights**](./00_overview.ipynb#Insights) |
|:-----|:-----|:-----|:-----|:-----|
|  | [Minerals Exploration](./00_overview.ipynb#Mineral-Exploration)  | [Classification](./011_classification.ipynb) |  |  |
|  | [Data Driven Geochem](./00_overview.ipynb#Data-Driven-Geochemistry) | [Regression](./012_regression.ipynb) |  |  |
|  |  | [Data Exploration](./013_dataexploration.ipynb) | |  |

    
<details class='alert alert-info'>
    <summary><b>About this Presentation</b></summary>
 
This is a Binder-enabled repository to accompany the abstract
<a href="https://goldschmidt.info/2020/abstracts/abstractView?id=2020003649">"Digging into Deep Time & Deep Cover"</a> for <a href="https://goldschmidt.info/2020/program/programViewThemes#period_472_4730_12338">Goldschmidt 2020 Session 06h</a> (Development of Big Data Geochemical Networks and new Analysis and Visualization
tools: Innovative approaches for 21st Century Multidimensional and Transdisciplinary
Science; 13:45-14:45 Thursday June 25 AWST / 19:45-20:45 Wednesday June 24 HST). To view the un-rendered notebooks, have a look at the <a href="https://github.com/morganjwilliams/gs2020_diggingdeeper">repository on GitHub</a>.

This presentation has been constructed using <a href="https://jupyter.org">Jupyter notebooks</a> and <a href="https://voila.readthedocs.io">Voilà</a> as an experiment in combining what would typically be found in a conference presentation with live-rendered interactive elements as a way to demonstrate aspects of software.,
    
</details>

## Abstract 
Increasing volumes of open data and improved data quality allow geochemists to use data-driven approaches to address large-scale geological problems. At the same time, exploration for both base and critical metals is moving into under-explored areas with deeper cover, as a significant fraction of readily identifiable near-surface resources have likely already been discovered. With the cost of discovery increasing, predictive mineral systems science looks to better integrate and utilise both new and existing data to constrain the subsurface environment. Where we can use this data to restrict viable exploration spaces, exploration efforts may be focused to reduce cost and potentially time-to-discovery.

We will demonstrate a series of data-driven and machine learning approaches to classical geological problems, principally using data from global geochemical data repositories. We’ll use multivariate whole-rock geochemistry to distinguish tectonic environments, examine shifts in global basaltic geochemistry through time, use dimensional reduction and network techniques to visualize and better understand the relationship between samples and endmembers, and use multi-modal drill core data for predictive geochemistry. These examples illustrate some of the common challenges encountered while working with geochemical data:  working across different scales, and linking geochemistry to spatiotemporal domains. We focus these towards extending existing methods through the use of multivariate statistics and visualization methods, addressing model uncertainties, and acknowledging the potential impacts of common confounding effects (such as evolution, alteration and deformation). 

We highlight where we can readily gain useful insight,  where we may be able to transfer methods and learning to new problems and scales, and how we can use data to drive geological knowledge, extract latent features, and perhaps identify some 'unknown unknowns'. Finally, we demonstrate some tools which can make these methods more approachable for geochemists, such that the methods can be better integrated into established geochemical scientific workflows.

## Introduction & Context

This presentation largely focuses on one question:

> #### *How can we get the most from our geochemical data?*

As we discuss below and in other sections of this presentation, this is partly about asking the right questions, but also how we value, use and enrich our data. In particular, we discuss how we incorporate geochemical data to build and evaluate models of how our planet operates and evolves, and how we can adapt to more modern methods for assessing these in a relatively data-rich age.

### Data Driven Geochemistry

> Understanding derived from the data itself, <br>rather than our idea of what the data represents.

The best place to start is a good question. Questions around 'why' and 'how' typically require complicated reasoning, and are more often the domain of simulation and modeling than data analysis. As interpretation is required to answer these kinds of questions, here we're largely focusing on the what, and where and when. For example, there are many practical questions to ask of an exploration dataset:

* Is this more enriched/depleted in Y than expected?
* What rocks are most similar to X?
* Which chemical signatures are associated with mineralization?
* Which features provide the most valuable information for prediction? At what scale?
* What setting did this form in?

### Mineral Exploration

At the same time as we're adopting increasingly digital approaches to research and amassing large collections of data, we're encountering the growing challenges of sustainability extracting the resources used to facilitate the technology. A steady demand for 'tech metals' along with a shift towards renewable energy (and together with it, increasing demand for battery metals) coupled with limited supply (either due to geographical heterogeneity, or simply that many of these metals are produced as by-products of major commodities) renders many of these resources 'critical metals'. To ensure future metal supply, continued exploration for both well known and potentially novel mineral deposits will be required. However, like many sectors, mineral exploration teams are expected to do more with lower budgets. They're also often working in relatively data-poor scenarios (at least relative to the scale they're working at). Below we discuss some of the challenges of questions in exploration, and some approaches which might be useful in 'putting our data to work' to mitigate some of the risk of mineral exploration, potentially reduce time to discoveries, and provide some more certainty for the inputs to critical minerals pipelines.

The rate of discovery for major deposits (those needed to fulfill future demand) has declined over the last few decades. While there certainly remain many undiscovered deposits which are likely amendable to discovery following traditional exploration approaches targeting near-surface deposits, many of the deposits that have been discovered recently are of lower quality or volume than some of the large high-grade deposits which global resources companies have been built on. As a result, the search space for mineral exploration is expanding to include areas with more significant cover, and with it approaches to exploration are changing. As exploration pursues deeper targets, one of the principal challenges is the relative lack of geological information, especially considering classical approaches are geared towards finding deposits with readily-identifiable surface expressions (e.g. deposits you can 'kick'). The role of exploration geochemistry still has a key role to play in these scenarios, but increasingly the integration of data from various sources will be key to identifying signatures of buried deposits (e.g. including geophysics, remote sensing, groundwater and regolith chemistry and detrital indicators; all of these provide information over a variety of scales!). The shallow subsurface is an accessible frontier, and typically remains under-explored.

Despite the evolving challenges of mineral exploration, budgets for exploration are increasingly tight and continue to be at least partly tied to resources cycles. Using existing data to enrich exploration processes and reduce exploration search spaces is one way in which exploration teams can adapt to 'do more with less'. Further, using data to make decisions around sampling activities and targeting in near-real time (i.e. active sampling) could allow adaptive exploration campaigns which return higher information value, reduce search spaces faster with lower costs and potentially lower risk. Mineral exploration is the beginning of the resource value chain, with a relatively long lead time to resource development and extraction. While there's no guarantees, honing approaches to mineral exploration to better cover search spaces and making the most of both new and existing data should decrease average times to discovery, and solidify the longer-term viability of growing and emergent technologies dependent on sustained supply of critical metals.

While it's difficult to provide useful constraints on questions along the lines of 'What is there left to discover?', useful constraints on 'Where could it be?' are attainable. From a high level, the use of geochemistry to understand geological reservoirs and processes can provide first-order constraints to reduce the exploration search space. Particularly older terranes, geochemistry has a larger role in providing geological context which has been lost through deformation, alteration and passage of time. However, the certainty with which we can use geochemistry to provide information about geological environments into deep time is limited, and increasingly so the further back we go. The magnitude of evolution from the early Earth to modern environments we can directly observe necessitates that we consider what some of these limits may be, and how appropriate our models based on modern geological systems are in deep time. In exploration these problems typically have a strong spatial component, but here we focus largely on the geochemistry. 

The remainder of this presentation focuses on data-driven approaches to provide these first-order constraints, and how we can adapt our approach to classical geological problems with modern data analysis and visualization techniques.

<!--
Prospectivity and Fertility - While mineral deposits are often considered to be 'unique', the environments we find them in exhibit similarities we can exploit on larger scales. Even where an area may be prospective, the system may not be 'fertile' for forming mineral deposits
-->


### Adopting a Programmatic Approach

While there exists a range of data analysis software one could use to help derive insight from datasets, there are a number of advantages to adopting a programmatic approach ("writing code to ask questions of our data"):

* We can **avoid making simple errors** (e.g See "One in five genetics papers contains errors thanks to Microsoft Excel" [doi: 10. 1186/s13059-016-1044-7](https://doi.org/10.1186/s13059-016-1044-7)).
* We can **make our analysis repeatable** - it's not dependent on a particular sequence of mouse clicks and potentially unrecorded data analysis/reduction options. By recording the environment under which it was done (e.g. software versions, platform etc), you can also make sure that someone else can get the same results and come to the same conclusions, **making the analysis reproducible**.
* By using version control together with code-based approach to data analysis means we can effectively **version the process** (and potentially tie versions of data analysis pipelines to versions of datasets). This allows our data analysis to evolve without fear of lost history, and potentially identify where errors may have been present after the fact.

Beyond this, the flexibility of a programmatic approach also lends itself towards developing new tools which you can integrate into your own workflows, automatic repetitive work, using analysis for decision support, and potentially productivity gains (although, this is never guaranteed; 'better science' is largely our goal here).

However, perhaps one of the key benefits of a programmatic approach for data-driven geochemistry is that it **changes our perspective** (e.g. beyond 2D and 3D diagrams to multidimensional analysis), and changes the questions we ask. Particularly, it allows us to more easily quantify or estimate uncertainties, investigate how well our data support our models and better supports iterative testing and model development.    

Finally, community adoption of open science practices and open source software will contribute to the socialization of data, ideas, methods, code and analyses. It is also a viable pathway to developing consensus on **best practice** in a new era of data-driven geoscience, and developing community-driven research software considering interoperability and flexibility (open data formats, common standards).

## Examples

We've chosen to provide a few practical examples to illustrate how we might adapt our approaches to common problems in geochemistry using modern data analysis and machine learning, and included links to these below.

<div class='alert alert-info'> <b>Note:</b> the links below will open separate notebooks, and take short while to execute and load.</div>
 
The first concerns geochemical classification, and while it has minerals exploration relevance, it is generally relevant to the classification of rock and mineral chemistry where natural (*or artificial*) groupings and segmentations are known (supervised classification). Some of the latter sections of this example could also be directly applied to understand some of the relationships between samples for unsupervised classification/clustering.

<!--
The second example discusses making predictions, and partly relates to sampling

The last example we present here briefly discusses uses for network analysis.
-->

## Tools

For more information and interactive demonstration of some of these tools, have a look at our other Goldschmidt 2020 presentation ([live presentation](https://mybinder.org/v2/gh/morganjwilliams/gs2020-python4geochem/master?urlpath=/voila/render/00_overview.ipynb), [repository](https://github.com/morganjwilliams/gs2020-python4geochem/)).

<!--
### Knowledge from Data

We have a variety of formal and informal models for how we understand the workings of our (and other) planets, but with new data we should continually revisit some of these to see whether we can improve them or learn something new


### Transferring Methods

From other fields to geochemistry

From other scales

### "Unknown Unknowns"

What does this mean?

How do you find them?
-->

[**Abstract**](./00_overview.ipynb) | 
**Context:**
[Minerals Exploration](./01_context.ipynb#Mineral-Exploration),
[Data Driven Science](./01_context.ipynb#Data-Driven-Science) | 
**Examples:** 
[Classification](./02_examples.ipynb#Classification),
[Regression](./02_examples.ipynb#Regression),
[Data Exploration](./02_examples.ipynb#Data-Exploration) |
[**Tools**](./03_tools.ipynb) |
[**Towards Insight**](./04_towardsinsight.ipynb) 