| [**Overview**](./00_overview.ipynb) | [Getting Started](./01_jupyter_python.ipynb) | **Examples:** | [Access](./02_accessing_indexing.ipynb) | [Transform](./03_transform.ipynb) | [Plotting](./04_simple_vis.ipynb) | [Norm-Spiders](./05_norm_spiders.ipynb) | [Minerals](./06_minerals.ipynb) | **Workflows:** | [lambdas](./07_lambdas.ipynb) | [CIPW](./08_CIPW_Norm.ipynb)  | [ML](./11_geochem_ML.ipynb) | [Spatial Data](./12_spatial_geochem.ipynb) |
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |

# Transforming Geochemical Data


In [None]:
import numpy as np
import pandas as pd
import pyrolite
from synthdata import get_synthetic_data

pd.options.display.precision = 3  # smaller graphical outputs

# synthetic data with a multivariate normal distribution in log-transformed space
df = get_synthetic_data(
    columns=["CaO", "MgO", "SiO2", "FeO", "Na2O", "Ni", "Ti", "La", "Lu", "Te"]
)
df

----
### Using Indexers, Scaling

You can also use these indexers for assignment, where the dimensionality of the dataset doesn't change. While you can transform elements and oxide abundnace units easily when you remember the relative scales, `pyrolite` provides some functions such that you don't have to rely on your memory. Here we create a copy of the dataframe and within it revert the change we made above - so these should be the orignal ppm values. This method provides an easy way to explicitly declare your intention when changing units - and makes sure the relative scales are correct!

In [None]:
from pyrolite.util.units import scale

# get a copy of just the elements from the dataframe, we'll then edit this version
els = df.pyrochem.elements.copy()
els

We can use the `scale` function to convert between known unit systems:

In [None]:
els.pyrochem.elements * scale("ppm", "wt%")

We can also assign this in place (`A *= B` is the shorthand equivalent of `A = A * B`):

In [None]:
els.pyrochem.elements *= scale("ppm", "wt%")

In [None]:
df.pyrochem.elements, els.pyrochem.elements

---
### Converting Chemical Components 

`pyrolite` provides some straightfoward methods to calcuate element-oxide conversions (e.g. to transform Ti abundance to TiO2 abudnance), assuming that the system is open to oxygen (i.e. in this case the extra oxygen will be added to the composition). This interface also allows the user to quickly add ratios and specify redox pairs at the same time. For example, we can transform a copy of our dataframe to include extra ratios and change some of our oxide components to elements:

In [None]:
df.pyrochem.convert_chemistry(
    to=["MgO", "SiO2", "FeO", "Ca", "Te", "Na", "Na/Te", "MgO/SiO2"]
)

In a similar way, we can also specify the molar speciation for redox species (so far just iron; others could be incorporated if they'll be useful). Here we adjust the total iron within our compositions (currently specified as FeO) to have a $Fe^{2+}/Fe^{3+}$ ratio of 9:1 (roughly what you might expect from a ~normal mantle-derived magma) - note that columns which aren't specified as *abundances* (be they metadata or things like ratios) would also be returned here:

In [None]:
df.pyrochem.convert_chemistry(to=[{"FeO": 0.9, "Fe2O3": 0.1}])

----
<div class='alert alert-warning'> <font size="+1" color="black"><b> Checkpoint & Time Check</b><br>How are things going?</font></div>

----

### Compositional Data

 We only have time to touch on compositional data anlaysis here, but there's a bit more information in [the pyrolite documentation](https://pyrolite.readthedocs.io/en/main/examples/index.html#compositional-data-examples). 
 
First, let's look at a simple demonstration of the utility of compositional data analysis at a scale where it's imminently feasible - the measurement and estimation of isotope ratios. While in larger multivariate datasets we often encounter a high relative degree of missing data (one principal hurdle of compositional data analysis), for this scenario we can expect low to no missingness. We'll use a synthetic dataset for demonstration purposes - but we encourage you to try it out on your own dataset!


In [None]:
from synthdata import count_based_signal

In [None]:
oxygen = count_based_signal(
    columns=["18O", "17O", "16O"],
    bias=np.array([np.log(498.81), np.log(0.189)]),
)
oxygen.head()

From this we can calculate isotope ratios, and optionally transorm this into delta/permille values relative to a standard reference:

In [None]:
ratio_18_16 = oxygen["18O"] / oxygen["16O"]

We can transform this into a delta representation (relative to a known standard value, which happens to be what we've used as our estimate of composition above):

In [None]:
delta = ((ratio_18_16 / 498.81) - 1) * 1000

In [None]:
ax = delta.plot.hist(bins=20)
ax.set(xlabel=r"$\delta^{18}O$ ($\perthousand$)")
ax.axvline(delta.mean(), color="k")
for a in [
    delta.mean() - delta.std() / (delta.size**0.5),
    delta.mean() + delta.std() / (delta.size**0.5),
]:
    ax.axvline(a, color="0.5", ls="--")

In [None]:
mean_18_16 = ratio_18_16.mean()
mean_18_16

Equally, you could have chosen (if you ignored conventions) to take the ratio of $^{16}O$ and $^{18}O$; and you'd expect that the means are invertible - right?

In [None]:
ratio_16_18 = oxygen["16O"] / oxygen["18O"]
mean_16_18 = ratio_16_18.mean()
mean_16_18

Which, inverted to give a comparable estimate gives:

In [None]:
1 / mean_16_18

This isn't quite the same number. In terms of permille, we're around 0.05 ‰ off - not a great deal, but a problem when you're looking to do high precision analysis.

In [None]:
(mean_18_16 - (1 / mean_16_18)) / mean_18_16 * 1000

So what's going on here? These peculiarities result from incorrect assumptions regarding the distribution of the data: ratios of compositional components are typically lognormally distributed, rather than normally distributed, and the compositional components themselves commonly have a Poisson distribution or similar. These distributions contrast significantly with the normal distribution at the core of most statistical tests. To some extent, part of this makes sense - the normal distribution has one immediate failure for geochemical data, in that it has non-zero probability density below 0, and we know that you can’t have negative atoms! We can compare distributions with similar expected values and variances to compare normal, lognormal and Poisson disributions (log-transformed variables below):

<img src="https://pyrolite.readthedocs.io/en/main/_images/sphx_glr_compositional_data_002.png" style="display:inline; margin: 20px 10px 10px 20px;" width="60%"/>

We can see that by taking the logarithm of the lognormal distribution we find normally distributed variable in log space. If we take the natural logarithm of the ratios before we take an average (accounting for expected log-normal distribution), and then the exponent of the mean - we can see that the situation improves:

In [None]:
logmean_18_16 = np.exp(np.log(ratio_18_16).mean())
logmean_16_18 = np.exp(np.log(ratio_16_18).mean())
logmean_18_16, 1 / logmean_16_18

In reality the magnitude of this difference will be dependent on the range of values within your population; but depending on what you're looking at, this could be significant! It will be a bit different for things with low counts (e.g. Pb/U ratios in geochronology or rare trace elements) or strongly compositionally covariant variables (e.g. major elements, where an increase in one means an equal decrease in the other(s)). Note that this also means the uncertainties on the mean of ratios of compositional variables will be *asymmetric*! 

### Dealing with These Types of Issues in `pyrolite`

pyrolite includes a few functions for dealing with compositional data, at the heart of which are i) closure (i.e. everything sums to 100%) and ii) transformation (commonly log-transformations) to deal with
the compositional space. 

The commonly used log-transformations include the Additive Log-Ratio (`ALR`), Centred Log-Ratio (`CLR`), and Isometric Log-Ratio (`ILR`). Let's have a look at one of the log-transforms, which can be accessed directly from your dataframes (via the `df.pyrocomp` API). A key thing to note here is that everything should start in the same units and sum to one if you want it to be able to be back-transformed! Note we're using `df.pyrochem.compositional` to extract the elements and oxides by leave other columns alone:

In [None]:
scaled_df = df.copy()
scaled_df.pyrochem.elements *= scale("ppm", "wt%")
scaled_df.pyrochem.compositional = (
    scaled_df.pyrochem.compositional.pyrocomp.renormalise(scale=1)
)

In [None]:
scaled_df.head()

In [None]:
lr_df = scaled_df.pyrochem.compositional.pyrocomp.CLR()
lr_df.head()

In [None]:
back_transformed = lr_df.pyrocomp.inverse_CLR()
back_transformed.head()

One of the key areas where these logratio transforms might be useful is in deriving statistical properties from your geochemical data, for example calculating a mean. You could perform the transform, calculate the mean, and perform an inverse transform - but there's also a specific function dedicated to this:

In [None]:
scaled_df.pyrochem.compositional.pyrocomp.logratiomean()

### Compositional Data - Spherical Transformation

pyrolite also includes a spherical transformation, which is particularly useful for scenarios where zeros are valid, which is one scenario where logratio methods fall down (e.g., for mineralogy):

In [None]:
sphered = scaled_df.pyrochem.compositional.pyrocomp.sphere()

In [None]:
sphered

In [None]:
sphered.pyrocomp.inverse_sphere()

We can have a quick look at how these compare between a 'flat' ternary diagram perspective and the analogous view of the (hyper)spherical variant:

In [None]:
import matplotlib.pyplot as plt
from pyrolite.plot.color import process_color
from pyrolite.util.plot.helpers import init_spherical_octant

colors = process_color(scaled_df["MgO"], cmap="RdBu")["c"]

fig = plt.figure(figsize=(10, 5))
ax0 = fig.add_subplot(121)
ax1 = fig.add_subplot(122, projection="3d")

init_spherical_octant(labels=[c[2:] for c in sphered.columns[:3]], ax=ax1)

scaled_df.iloc[:, 1:4].pyroplot.scatter(ax=ax0, c=colors)
ax1.scatter(*np.sqrt(scaled_df.values[:, 1:4]).T, c=colors)

| [**Overview**](./00_overview.ipynb) | [Getting Started](./01_jupyter_python.ipynb) | **Examples:** | [Access](./02_accessing_indexing.ipynb) | [Transform](./03_transform.ipynb) | [Plotting](./04_simple_vis.ipynb) | [Norm-Spiders](./05_norm_spiders.ipynb) | [Minerals](./06_minerals.ipynb) | **Workflows:** | [lambdas](./07_lambdas.ipynb) | [CIPW](./08_CIPW_Norm.ipynb)  | [ML](./11_geochem_ML.ipynb) | [Spatial Data](./12_spatial_geochem.ipynb) |
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |

### Bonus: Extra Compositional Examples

We can see the non-invertibility of means releated to compositional data in a few more examples - here for a less precise dataset:

In [None]:
majors = count_based_signal(columns=["MgO", "CaO", "FeO"], strength=1000)
MgFe = majors["MgO"] / majors["FeO"]
FeMg = majors["FeO"] / majors["MgO"]
ax = MgFe.plot.hist(bins=20)

This is ~0.5% off:

In [None]:
(FeMg.mean() * MgFe.mean() - 1) * 100

This phenomena occurs regardless of the signal generation process, e.g. generating a signal from random integers with a set initial bias:

In [None]:
arr = np.array([100, 200, 300]) + np.random.randint(-10, 10, size=(100, 3)).astype(
    float
)
arr /= arr.sum(axis=1)[:, None]
int_random_signal = pd.DataFrame(arr, columns=["A", "B", "C"])
int_random_signal.head()

In [None]:
B_A = (int_random_signal["B"] / int_random_signal["A"]).mean()
A_B = (int_random_signal["A"] / int_random_signal["B"]).mean()

In [None]:
B_A.mean()  # should be about 2

This is 0.4% different to the inverted A/B:

In [None]:
(B_A - (1 / A_B)) / B_A * 100