# Exploratory Multivariate Analysis of Geochemical Datasets
Compiled by [Morgan Williams](mailto:morgan.williams@csiro.au) for C3DIS 2018 

This collection of Jupyter notebooks illustrates some common simple problems encountered with geochemical data, and some solutions. They cover the majority of the workflow outlined below. Associated data is sourced solely from the [EarthChem data portal](http://ecp.iedadata.org/), and is here stored in a S3 bucket for simplicity.

### The Workflow

The data analysis workflow denoted below lists some common necessary tasks to derive useful insight from geochemical data. Much of this is common to any data science workflow, but due to the nature of the geochemical data itself, a few of these processes are still current research problems. Our research aims not to introduce radical change in methodology, but instead to simply streamline and standardise the process, such that we can use geochemistry in a robust way to address geological problems.

![Workflow Image](images/Workflow.png)

### The Problem

Much has happened since our planet was a primitive ball of molten rock,  including the origin of plate tectonics, the modern atmosphere and life. This extended geological history has been encoded into chemical signatures of rocks and minerals, which may then used to (partially) reconstruct the past.

Inverting geochemistry to infer the geological past is commonly an underdetermined problem (especially prior to the advent of modern geochemical analysis instrumentation), and is hindered by complex geological histories.

Modern analytical methods have higher throughput and greater sensitivity and precision. As a result, established publicly-accessible geochemical databases are growing steadily. However, the potential value of aggregating the increasing volume of high-quality data has not yet been fully realised.

### The Other Problems..

Before we can tackle the geological problems, we must first have a dataset which is consistently formatted and which contains relevant data of sufficient accuracy (lest we achieve simply *"garbage in, garbage out"*). These notebooks illustrate some of these processing steps, and demonstrate some approaches for the initial stages of data exploration.

### The Data

If you wish to download a subset of the EarthChem data to this binder server (approx 300 MB as a sparse dataframe) such that it can be acessed in later notebooks, do so below. If you do not, it will instead be downloaded *on-run* as necessary. Please note this can take more than a minute even on a good day.

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
%%time
import sys
sys.path.insert(0, './src')
from datasource import download_data, load_df
download_data('EarthChemGlobal.pkl', 'EarthChemGlobal.pkl')
#df = load_df('EarthChemGlobal.pkl')
#print(df.info())

Wall time: 1min 39s
