# Geochemistry - Automated ML Clustering 

## using pycaret framework


Hierarchical clustering of XRF geochemistry with constraint.

+ Features
    + Custom geochemistry dataframe I/O
    + Automated ML framework built on PyCaret
    + Depth constraints for hierarchical clusters 
    + Visualization / interpretation utilities 

N.C. Howes   
November 2021

## Configuration

Load custom geochemistry module from digital-core package. 

In [None]:
import sys, os 
sys.path.append("..")

#For development 
%reload_ext autoreload
%autoreload 2

In [None]:
from digitalcore import GeochemML

Read sample geochemistry data from csv. OOLDEA2 core from South Australia. 

In [None]:
#Input path 
filepath = '../data/OOLDEA2_1m_intervals.csv'

# Output path
fullpath_to_product = '../data/data-products/OOLDEA2_1m_intervals-labeled.csv'

## Workflow

In [None]:
this = GeochemML.read_csv( filepath )

Custom geochemistry dataframe provides convenience methods for: preview, visualization, and feature mapping. We'll preview the header, plot an element series, and list the subset of variables that are used as features in the clustering analysis. To see a list of masked or "ignored" variables, call `this.get_ignorefeatures()`. These are used for visualization and interpretation, but not the model fits. By default all metadata, depth, mdl, and 2SE variables are omitted.  

View the first 5 rows/instances, use `this.tail()` to view the last n rows. 

In [None]:
this.head()

Plot an element series

In [None]:
this.plot_element( "U" )

List features used for fit... 

In [None]:
features = this.get_features()
print( features )
print( len(features) )

## Prepare experiment 

Specify the **data preparation** parameters to be used for the experiment.   

Parameters to explore include:
+ Feature scaling/normalization: zscore, minmax 
+ Dimension reduction: principal component analysis, etc... 


It also possible to pass a custom preparation pipeline as an input option. See the PyCaret documentation on cluster setup for more options: [PyCaret Cluster Setup](https://pycaret.readthedocs.io/en/latest/api/clustering.html)

In [None]:
this.dataopts = dict(
    normalize=True,
    pca = False
    )

this.prepare( silent=True )

## Prepare models


Review a list available cluster models.  

In [None]:
this.get_listmodels()

Specify the **model types** to be used for the experiment. Customize model parameters by assigning a dictionary to the `modelopts` attribute, otherwise the models will run with default parameter settings. The modelopts are global parameters and are applied to all cluster models in the experiment.

All agglomerative cluster models are depth/strat constrained. 

Parameters to explore include:
+ k-means
    + n-clusters
+ agglomerative clustering
    + linkage method
    + dissimilarity metric 
    + linkage threshold

In [None]:
# Add models. Specify as a list 
this.name = ["hclust"]

# Add optional model configuration. 
this.modelopts = dict( num_clusters = 3 )

In [None]:
this.modelopts


Fit a cluster model or array of models (autoML)

In [None]:
this.create()

### Assign labels

After fitting the models, we append vluster labels to the end of the dataframe

In [None]:
this.label()

In [None]:
this.data.iloc[10:18, [0,1,2,-2,-1] ]

## Evaluate results

In [None]:
this.active = 0
this.get_activemodel()


### Cluster PCA Plot (2d)  

In [None]:
this.plottype = "cluster"
this.plotmodel()

### Cluster TSnE (3d)

In [None]:
if this.get_activemodel() != "hclust_Cluster":
    this.plottype = "tsne"
    #this.plotmodel()

### Elbow Plot

In [None]:
if this.get_activemodel() != "hclust_Cluster":
    this.plottype = "elbow"
    this.plotmodel()

### Distance Plot

In [None]:
this.plottype = "distance"
#this.plotmodel()

### Distribution Plot

In [None]:
this.plottype = "distribution"
this.plotmodel()

In [None]:
this.plot_element( "U", labels=True )

## Interpretation 

Aggregate element statistics by cluster.

In [None]:
df = this.aggregate( output="unstack" )
df

In [None]:
this.plot_aggregates( by="cluster", type="pct" )

In [None]:
this.plot_aggregates( by="cluster", type="ppm" )

In [None]:
this.plot_aggregates( by="feature", type="pct" )

In [None]:
this.plot_aggregates( by="feature", type="ppm" )

In [None]:
this.plot_scatter("Fe", "Si")

## Export

Save/write to disk. 

In [None]:
this.data.to_csv(fullpath_to_product, index=False)