# Quantitative analysis, visualization, and modelling of detrital geochronology data
# GSA 2022 Short Course
---
### Application: detritalPy tutorial
---
Dr. Glenn Sharman, University of Arkansas





detritalPy is an open source Python-based toolset for visualing and analyzing detrital geo-thermochronologic data. More information can be found in [this article](https://onlinelibrary.wiley.com/doi/full/10.1002/dep2.45) published in 2018 in The Depositional Record and on the [detritalPy GitHub site](https://github.com/grsharman/detritalPy).

To run a cell with code, first select the cell and then either click the arrow button or return Shift+Enter

## 1. Import required modules

In [None]:
import detritalpy
import detritalpy.detritalFuncs as dFunc
import pathlib
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # For improving matplotlib figure resolution
matplotlib.rcParams['pdf.fonttype'] = 42 # For allowing preservation of fonts upon importing into Adobe Illustrator
matplotlib.rcParams['ps.fonttype'] = 42
print('detritalPy version: ',detritalpy.__version__)

## 2. Import the dataset

In [None]:
# Specify file paths to data input file(s)
dataToLoad = ['ExampleDataset_1.xlsx',
              'ExampleDataset_2.xlsx']

main_df, main_byid_df, samples_df, analyses_df = dFunc.loadDataExcel(dataToLoad)

detritalPy makes a Pandas dataframe that contains a row for each sample. Execute the next cell to see what data is available for the first sample.

In [None]:
main_byid_df.head(1)

## 3. Select Samples
There are two ways to select samples in detritalPy. (1) Individual samples can be specified in a list, or (2) sample groups can be defined in a tuple, as shown below.

In [None]:
sampleList = ['POR-1','POR-2','POR-3','BUT-5','BUT-4','BUT-3','BUT-2','BUT-1']

#The code below returns a list of arrays used in subsequent calculations
ages, errors, numGrains, labels = dFunc.sampleToData(sampleList,
                                                     main_byid_df,
                                                     sigma = '1sigma')

In [None]:
print('The first sample has this many analyses :',len(ages[0]))

In [None]:
sampleList = [(['POR-1','POR-2','POR-3'],'Point of Rocks Sandstone'),
              (['BUT-5','BUT-4','BUT-3','BUT-2','BUT-1'],'Butano Sandstone')]


#The code below returns a list of arrays used in subsequent calculations
ages, errors, numGrains, labels = dFunc.sampleToData(sampleList,
                                                     main_byid_df,
                                                     sigma = '1sigma')

In [None]:
print('The first sample group has this many analyses :',len(ages[0]))

## Plot Detrital Age Distributions
Run the following cell to execute a function that plots detrital age distributions using default options.

In [None]:
fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels)

The plotAll() function contains many optional keyword arguments. We will go through these optional arguments in the following cells.
---
The `whatToPlot` variable specifies whether to plot age distributions as cumulative, relative, or both. The default is `whatToPlot = 'both'` but we will choose `whatToPlot = 'relative'` instead

In [None]:
whatToPlot = 'relative' # Options: cumulative, relative, or both

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot)

Similarly, setting `whatToPlot = 'cumulative'` results in only the cumulative age distribution being plotted.
Notice that the optional keyword arguments are placed inside of the dFunc.plotAll() function. Defaults will be used if you do not specify.

In [None]:
whatToPlot = 'cumulative' # Options: cumulative, relative, or both

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot)

To adjust the x-axis, you can specify the plot starting and ending range via the `x1` and `x2` variables. Let's zoom in on the young (0-300 Ma) part of the plot where most of the data is.

In [None]:
# Enter plot options below
whatToPlot = 'relative' # Options: cumulative, relative, or both

# Specify the age range (Myr) that you want to plot
x1 = 0
x2 = 300

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, x1=x1, x2=x2)

There are two ways of plotting relative age distributions in detritalPy. (1) If `separateSubplots = True`, each sample or sample group is plotted within its own "subplot". (2) If `separateSubplots = False`, then plots will be stacked on top of each other in a shared subplot (similar to how plots are made by the Arizona LaserChron Excel macros).


If `separateSubplots = False`, then age distributions are always normalized (i.e., area under the curves is equal). If `separateSubplots = True`, then you have a choice of whether to normalize the plots, which can be set using `normPlots`.

**Exercise:** Try making `separateSubplots` both `False` and `True` and compare the difference. When `separateSubplots = True`, try making `normPlots` both `False` and `True` and compare the difference.

In [None]:
# Enter plot options below
whatToPlot = 'relative' # Options: cumulative, relative, or both
separateSubplots = False # Set to True to plot each relative age distribution in a separate subplot (allows histogram and pie)
normPlots = False

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, x1=x1, x2=x2, separateSubplots=separateSubplots, agebins=[1,2], normPlots=normPlots)

Did you notice how the Butano Sandstone age distribution fills the entire subplot when `normPlots = False` but only extends about 2/3 up the subplot when `normPlots = True`? This is because the Point of Rocks Sandstone has a "higher" age probability peak near 100 Ma, and thus the Butano Sandstone age distribution must be more "spread out" for the area under both curves to be equal.

The x-axis can be converted to a log scale by setting `plotLog = True`. If you do this, then `x1` cannot equal 0 (it will be set to 0.1 Ma by default, if you forget to change it from 0).

In [None]:
# Enter plot options below
whatToPlot = 'both' # Options: cumulative, relative, or both
separateSubplots = False # Set to True to plot each relative age distribution in a separate subplot (allows histogram and pie)

# Specify the age range (Myr) that you want to plot
x1 = 10
x2 = 3000
plotLog = True

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    agebins=[1,2], x1=x1, x2=x2, plotLog=plotLog)

Plot dimensions can also be adjusted using several variables: `w` for the width of the plot, `c` for the height of the CDF panel (if plotted), and `h` for the height of the relative panel (if `separateSubplots = False`)

In [None]:
# Enter plot options below
whatToPlot = 'both' # Options: cumulative, relative, or both
separateSubplots = True # Set to True to plot each relative age distribution in a separate subplot (allows histogram and pie)

# Specify the age range (Myr) that you want to plot
x1 = 0
x2 = 300
plotLog = False

# Relative distribution options
normPlots = True # Will normalize the PDP/KDE if equals True (if separateSubplots is True)

# Specify the plot dimensions
w = 5 # width of the plot
c = 4 # height of CDF panel
h = 5 # height of the relative panel (only required if separateSubplots is False). Options: 'auto' or an integer

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    agebins=[1,2], x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots)

detritalPy can plot relative age distributions as PDPs, KDEs, and/or histograms. Set `plotKDE = True` to plot the KDE. The bandwidth (`bw`) can be set to a number (bandwidth in Myr) or several automatic bandwidth selection choices can be made:

`bw = 'optimizedFixed'` uses the optimized (single-value) bandwidth selection of [Shimazaki an Shinomoto (2010)](https://pubmed.ncbi.nlm.nih.gov/19655238/), as implemented via the adaptiveKDE Python module (https://pypi.org/project/adaptivekde/). I have found that this option is a good choice for many datasets. *Warning: This algorithm is slow to run!*

`bw = 'optimizedVariable'` uses a locally variable bandwidth ([Shimazaki an Shinomoto, 2010](https://pubmed.ncbi.nlm.nih.gov/19655238/)) as implemented via the adaptiveKDE Python module (https://pypi.org/project/adaptivekde/). I have found that this option provides inconsistent results for samples or sample groups with relatively few numbers of analyses. *Warning: This algorithm is very slow to run!*

`bw = 'ISJ'` uses the Improved Sheather-Jones algorithm as implemented by KDEpy (https://kdepy.readthedocs.io/en/latest/). 

**Exercise:**  Try out several different bandwidth options and compare the results

In [None]:
# Enter plot options below
whatToPlot = 'relative' # Options: cumulative, relative, or both

# Specify the plot dimensions
w = 10 # width of the plot
normPlots = False

plotKDE = True # Set to True if want to plot KDE
colorKDE = True # Will color KDE according to same coloration as used in CDF plotting
bw = 'optimizedFixed' # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', 'ISJ', or a number (bandwidth in Myr)

plotPDP = False # Set to True if want to plot PDP
colorPDP = True # Will color PDP according to same coloration as used in CDF plotting

plotHist = True # Set to True to plot a histogram (only available when separateSubplots is True)
b = 5 # Specify the histogram bin size (Myr)

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    agebins=[1,2], x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotKDE=plotKDE, colorKDE=colorKDE,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, plotHist=plotHist, b=b)

There are several options for what type of cumulative distribution to plot. `plotCDF = True` for a "raw" CDF, `plotPDP = True` for a cumulative PDP, and `plotKDE = True` for a cumulative KDE. Note: if plotting a CKDE, you specify the bandwidth through the `bw` varaible (see above).

**Exercise:** Compare the CDF with the CPDP and CKDE

In [None]:
# Enter plot options below
whatToPlot = 'cumulative'

# Cumulative distribution options
plotCDF = False # Plot the CDF discretized at xdif interval
plotCPDP = True # Plot the cumulative PDP
plotCKDE = False # Plot the cumulative KDE

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    agebins=[1,2], x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE)

If you want to see the actual data points + errors, you can set `plotAgesOnCDF = True`

In [None]:
whatToPlot = 'both'
plotAgesOnCDF = True

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    agebins=[1,2], x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotAgesOnCDF=plotAgesOnCDF)

Coloring beneath relative age distributions can help visualize patterns in the data. Both PDPs and KDEs can be colored by setting `colorKDEbyAge = True` or `colorPDPbyAge = True`. Note that the KDE and/or PDP will only be colored if `plotKDE = True` or `plotPDP = True`, respectively.

There are two ways to specify the boundaries of different age ranges and their colors. The first is to set `agebins` to a list of numbers, where the numbers indicate the starting and ending points of the age categories. The `agebinsc` variable holds a list of the colors you want to use. Click [here](https://matplotlib.org/3.3.0/gallery/color/named_colors.html) for a list of color names compatible with matplotlib.

**Exercise:** Try plotting both a colored PDP and KDE. Try adjusting the age bin boundaries and colors.

In [None]:
whatToPlot = 'relative'

plotPDP = True
colorPDP = False
colorPDPbyAge = True # Will color PDP according to age populations if set to True
plotKDE = False
colorKDE = False
colorKDEbyAge = True # Will color KDE according to age populations if set to True
bw = 2.5 # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', 'ISJ', or a number (bandwidth in Myr)

plotPIE = True # Will plot a pie diagram (only available when separateSubplots is True)

# Specify  age categories for colored KDE, PDP, and/or pie plots
# Sharman et al. 2015 scheme
agebins = [0, 23, 65, 85, 100, 135, 200, 300, 500, 4500]
agebinsc = ['slategray','royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotKDE=plotKDE, colorKDE=colorKDE, colorKDEbyAge=colorKDEbyAge,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, colorPDPbyAge=colorPDPbyAge, plotHist=plotHist, b=b,
                    plotPIE=plotPIE, agebins=agebins, agebinsc=agebinsc)

The other way to specify age categories is to set starting and ending values for each range, in the format shown below in the `agebins` variable. This allows one to allow gaps in the color plotting. *Note: it is possible to have overlapping age ranges, but this is not recommended.*

In [None]:
plotPDP = False
colorPDPbyAge = False # Will color PDP according to age populations if set to True
plotKDE = True
colorKDE = False
colorKDEbyAge = True # Will color KDE according to age populations if set to True
bw = 2.5 # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', 'ISJ', or a number (bandwidth in Myr)

plotPIE = True # Will plot a pie diagram (only available when separateSubplots is True)

# Specify  age categories for colored KDE, PDP, and/or pie plots
# Sharman et al. 2015 scheme
agebins = [[40,50],[65,85],[90,100],[105,125],[145,175],[240,280]]
agebinsc = ['royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotKDE=plotKDE, colorKDE=colorKDE, colorKDEbyAge=colorKDEbyAge,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, colorPDPbyAge=colorPDPbyAge, plotHist=plotHist, b=b,
                    plotPIE=plotPIE, agebins=agebins, agebinsc=agebinsc)

Set `plotColorBar = True` to plot age ranges as vertical colored bars that extent through the CDF and/or relative age plots.

In [None]:
whatToPlot = 'both'
colorKDEbyAge = False # Will color KDE according to age populations if set to True

plotColorBar = True

plotPIE = True # Will plot a pie diagram (only available when separateSubplots is True)

# Specify  age categories for colored KDE, PDP, and/or pie plots
# Sharman et al. 2015 scheme
agebins = [[40,50],[65,85],[90,100],[105,125],[145,175],[240,280]]
agebinsc = ['royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotKDE=plotKDE, colorKDE=colorKDE, colorKDEbyAge=colorKDEbyAge,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, colorPDPbyAge=colorPDPbyAge, plotHist=plotHist, b=b,
                    plotPIE=plotPIE, agebins=agebins, agebinsc=agebinsc, plotColorBar=plotColorBar)

detritalPy can also plot PDPs or KDEs as heatmaps where coloration corresponds to the maximum PDP value. I haven't seen many studies use this plotting option, but I'll leave it up to you whether it is useful or not! Check out [this link)[https://matplotlib.org/tutorials/colors/colormaps.html] for a list of available heatmap options.

**Exercise:** Experiment between KDE and PDP heatmaps. Try a few different heatmap options to find one that you like. 

In [None]:
whatToPlot = 'relative'

plotColorBar = False
plotHist = False
plotPIE = False
plotKDE = True

plotHeatMap = True
heatMapType = 'KDE' # Options: 'PDP' or 'KDE'
heatMap = 'inferno_r'

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotKDE=plotKDE, colorKDE=colorKDE, colorKDEbyAge=colorKDEbyAge,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, colorPDPbyAge=colorPDPbyAge, plotHist=plotHist, b=b,
                    plotPIE=plotPIE, agebins=agebins, agebinsc=agebinsc, plotColorBar=plotColorBar, plotHeatMap=plotHeatMap,
                    heatMapType=heatMapType, heatMap=heatMap)

detritalPy can find age peaks using the [peakutils library](https://peakutils.readthedocs.io/en/latest/tutorial_a.html). Set `plotAgePeaks = True` and specify `agePeakOptions`. The distType can be either 'KDE' or 'PDP' (depending on whicih type of age distribution you want to find peaks on). `threshold` refers to a y-axis cutoff from which to exclude potential age peaks. `minDist` refers to an x-axis cutoff which relates to the proximity of adjacent age peaks. `minPeakSize` refers to the minimum peak height threshold for an
age peak. The age of the age peak (Ma) will be plotted if `labels` is set to `True`.

**Exercise:** Try adjusting the `agePeakOptions` and observe the result.

In [None]:
plotAgePeaks = True # Will identify and plot age peaks
agePeakOptions = ['KDE', 0.05, 5, 2, True] # [distType, threshold, minDist, minPeakSize, labels]
w = 20 # Width of the plot

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotKDE=plotKDE, colorKDE=colorKDE, colorKDEbyAge=colorKDEbyAge,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, colorPDPbyAge=colorPDPbyAge, plotHist=plotHist, b=b,
                    plotPIE=plotPIE, agebins=agebins, agebinsc=agebinsc, plotAgePeaks=plotAgePeaks, 
                    agePeakOptions=agePeakOptions)

detritalPy allows one to split the x-axis in 1 or more places. This can be useful when trying to visualize both very young (precise) and very old (imprecise) ages. To split the axis, set `x1` and `x2` to a list with the beginning values of where to split the axis. The example below shows a split at 300 Ma. The `w` variable (width of the plot) but also be changed to a list. The first number gives the width of the pie plot column *and must equal 1*, and the remaining numbers give the width of the split axis portions.

**Exercise:** Try changing the split axis options. Make 2 splits instead of just 1. Adjust the `w` variable and see what happens.

Note: detritalPy automatically scales the y-axis of the plot such that 

In [None]:
# Specify the age range (Myr) that you want to plot
x1 = [0,300]
x2 = [300,2000]

# Specify the plot dimensions
w = [1,6,4] # width of the plot

plotKDE = True
colorKDE = False
colorKDEbyAge = True
bw = 'ISJ' # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', 'ISJ', or a number (bandwidth in Myr)

plotHist = False
b = [5,50]

plotAgePeaks = False

# Sharman et al. 2015 scheme
agebins = [0, 23, 65, 85, 100, 135, 200, 300, 500, 4500]
agebinsc = ['slategray','royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotKDE=plotKDE, colorKDE=colorKDE, colorKDEbyAge=colorKDEbyAge,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, colorPDPbyAge=colorPDPbyAge, plotHist=plotHist, b=b,
                    plotPIE=plotPIE, agebins=agebins, agebinsc=agebinsc, plotAgePeaks=plotAgePeaks, 
                    agePeakOptions=agePeakOptions)

**Exercise:** Try plotting two KDEs with different bandwiths in conunction with the split axis options.

Multiple bandwiths can be specified in a list, and the age axis location to switch from one KDE to the other is provided in the bw_x variable

Requires v1.3.27 or later

In [None]:
# Specify the age range (Myr) that you want to plot
x1 = [0,300]
x2 = [300,2000]

# Specify the plot dimensions
w = [1,6,4] # width of the plot

plotKDE = True
colorKDE = False
colorKDEbyAge = True
bw = [2.5, 20] # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', 'ISJ', or a number (bandwidth in Myr)
bw_x = [300]

plotHist = False
b = [5,50]

plotAgePeaks = False

# Sharman et al. 2015 scheme
agebins = [0, 23, 65, 85, 100, 135, 200, 300, 500, 4500]
agebinsc = ['slategray','royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot=whatToPlot, separateSubplots=separateSubplots, 
                    x1=x1, x2=x2, plotLog=plotLog, w=w, c=c, h=h, normPlots=normPlots, plotCDF=plotCDF,
                    plotCPDP=plotCPDP, plotCKDE=plotCKDE, plotKDE=plotKDE, colorKDE=colorKDE, colorKDEbyAge=colorKDEbyAge,
                    bw=bw, plotPDP=plotPDP, colorPDP=colorPDP, colorPDPbyAge=colorPDPbyAge, plotHist=plotHist, b=b,
                    plotPIE=plotPIE, agebins=agebins, agebinsc=agebinsc, plotAgePeaks=plotAgePeaks, 
                    agePeakOptions=agePeakOptions, bw_x=bw_x)

It's easy to filter data in detritalPy. Let's say you only want to look at the old ages, you can filter the younger ones out. The example below shows this.

In [None]:
analyses_df = analyses_df.loc[(analyses_df['BestAge'] >300)]
main_byid_df = dFunc.loadData(samples_df, analyses_df, ID_col='Sample_ID')

We then have to reselect the samples for the change to take effect

In [None]:
sampleList = [(['POR-1','POR-2','POR-3'],'Point of Rocks Sandstone'),
              (['BUT-5','BUT-4','BUT-3','BUT-2','BUT-1'],'Butano Sandstone')]


#The code below returns a list of arrays used in subsequent calculations
ages, errors, numGrains, labels = dFunc.sampleToData(sampleList,
                                                     main_byid_df,
                                                     sigma = '1sigma')

Finally, let's make the plot again, with the full set of options.

In [None]:
# Enter plot options below
whatToPlot = 'both' # Options: cumulative, relative, or both
separateSubplots = True # Set to True to plot each relative age distribution in a separate subplot (allows histogram and pie)

# Specify the age range (Myr) that you want to plot
x1 = 0
x2 = 3000
plotLog = False # Set to True to plot the x-axis as a log scale

# Specify the plot dimensions
w = 10 # width of the plot
c = 4 # height of CDF panel
h = 5 # height of the relative panel (only required if separateSubplots is False). Options: 'auto' or an integer

# Specify the interval (Myr) over which distributions are calculated
xdif = 1 # Note: an interval of 1 Myr is recommended

# Cumulative distribution options
plotCDF = True # Plot the CDF discretized at xdif interval
plotCPDP = False # Plot the cumulative PDP
plotCKDE = False # Plot the cumulative KDE
plotDKW = False # Plot the 95% confidence interval of the CDF (Dvoretsky-Kiefer-Wolfowitz inequality)

# Relative distribution options
normPlots = False # Will normalize the PDP/KDE if equals True (if separateSubplots is True)

plotKDE = True # Set to True if want to plot KDE
colorKDE = False # Will color KDE according to same coloration as used in CDF plotting
colorKDEbyAge = True # Will color KDE according to age populations if set to True
bw = 10 # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', or a number (bandwidth in Myr)
bw_x = None # Change to a list with x-axis split locations (Ma) if using more than KDE (e.g., bw_x = [300])

plotPDP = False # Set to True if want to plot PDP
colorPDP = False # Will color PDP according to same coloration as used in CDF plotting
colorPDPbyAge = False # Will color PDP according to age populations if set to True

plotColorBar = False # Color age categories as vertical bars, can add white bars to create blank space between other colored bars

plotHist = True # Set to True to plot a histogram (only available when separateSubplots is True)
b = 50 # Specify the histogram bin size (Myr)

plotPIE = False # Will plot a pie diagram (only available when separateSubplots is True)

# Specify  age categories for colored KDE, PDP, and/or pie plots
# Sharman et al. 2015 scheme
agebins = [0, 23, 65, 85, 100, 135, 200, 300, 500, 4500]
agebinsc = ['slategray','royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

plotAgePeaks = False # Will identify and plot age peaks
agePeakOptions = ['KDE', 0.05, 5, 2, True] # [distType, threshold, minDist, minPeakSize, labels]

plotHeatMap = False
heatMapType = 'KDE'
heatMap = 'inferno_r'

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot, separateSubplots, plotCDF, plotCPDP, plotCKDE, 
                    plotDKW, normPlots, plotKDE, colorKDE, colorKDEbyAge, plotPDP, colorPDP, colorPDPbyAge, plotColorBar, 
                    plotHist, plotLog, plotPIE, x1, x2, b, bw, xdif, agebins, agebinsc, w, c, h, plotAgePeaks, agePeakOptions,
                    CDFlw=3, KDElw=1, PDPlw=1, plotHeatMap=plotHeatMap, heatMapType=heatMapType, heatMap=heatMap, bw_x=bw_x)

# Application 1: Plotting detrital zircon U-Pb ages from the Bengal Fan
Blum et al. (2018): Scientific Reports present a very nice dataset of detrital zircon U-Pb ages from the Bengal Fan. You can download the Open Access article [here](https://www.nature.com/articles/s41598-018-25819-5). These samples are well characterized (between 200-300 analyses per sample), and the grain ages span a wide range of geologic time, presenting some challenges with plotting. The exercise below is for you to try and decide how *you* would like to plot these data.

In [None]:
dataToLoad = ['INPUT_Blum et al. (2018) Scientific Reports.xlsx']
main_df, main_byid_df, samples_df, analyses_df = dFunc.loadDataExcel(dataToLoad)

In [None]:
dFunc.plotSampleDist(main_byid_df, numBins=25)

This dataset has a very young analysis (0.6 Ma) with an error that is listed as 0 Ma. This is likely a rounding issue, but will result in problems later on if left in the dataset. For convenience, I will filter that analysis out and move on.

In [None]:
analyses_df = analyses_df.loc[(analyses_df['BestAge_err'] >0)]
main_byid_df = dFunc.loadData(samples_df, analyses_df)

In [None]:
dFunc.plotSampleDist(main_byid_df, numBins=25)

In [None]:
main_byid_df.head(1)

Each of the cells below has a different way of selecting samples - all the data grouped together, each sample listed individually, or samples grouped according to age

In [None]:
# All samples, listed in approximate stratigraphic order
sampleList = list(main_byid_df.Sample_ID)
ages, errors, numGrains, labels = dFunc.sampleToData(sampleList, main_byid_df, sigma = '1sigma');

In [None]:
# All data lumped together into a single group
sampleList = [(list(main_byid_df.Sample_ID),'Bengal Fan')]
ages, errors, numGrains, labels = dFunc.sampleToData(sampleList, main_byid_df, sigma = '1sigma');

In [None]:
# Samples grouped by age
sampleList = [(['M01_GR'],'Modern Ganges River'),
              (['J03'],'Modern Brahmaputra River'),
              (['U1451A_4H-6H','U1450A_6F-8F','U1452B_8F','U1453A_11F','U1451A_13F_combined','U1453A_26F'],'Mid-Pleis'),
              (['U1452B_38F','U1453_32F','U1449A_29,30,31F_combined'],'Early-Ples'),
              (['U1450A_70F','U1450A_78,79,80F','U1450_98F','U1450A_124F'],'Pliocene'),
              (['U1451A_37F','U1451A_41F','U1451A_47,48,49F_pilot','U1451A_60F','U1451A_66F','U1451B_3X','U1451A_80F','U1451A_102F','U1451B_22R'],'Late Miocene'),
              (['U1451B_41R','U1451B_51_54R','U1451B_62R'],'Early- to Mid-Miocene')]

ages, errors, numGrains, labels = dFunc.sampleToData(sampleList, main_byid_df, sigma = '1sigma');

**Exercise:** Try adjusting keyword arguments to make a plot that best communicates detrital zircon U-Pb age distributions in this dataset. Consider all of your options:

*   Separate subplots or not?
*   Plot cumulative and/or relative age distributions?
*   Change the x-axis scale? Split into subaxes? Plot as a log scale?
*   Adjust plot dimensions?
*   PDP vs KDE vs different KDE bandwidth options?
*   Use a histogram or not?
*   Color beneath the relative age distribution(s)? Plot a pie diagram?
*   Plot age peaks?
*   Plot a heat map?

In [None]:
# Enter plot options below
whatToPlot = 'relative' # Options: cumulative, relative, or both
separateSubplots = True # Set to True to plot each relative age distribution in a separate subplot (allows histogram and pie)

# Specify the age range (Myr) that you want to plot
x1 = 0
x2 = 4000
plotLog = False # Set to True to plot the x-axis as a log scale

# Specify the plot dimensions
w = 10 # width of the plot
h = 4 # height of CDF panel
c = 5 # height of the relative panel (only required if separateSubplots is False). Options: 'auto' or an integer

# Specify the interval (Myr) over which distributions are calculated
xdif = 1 # Note: an interval of 1 Myr is recommended

# Cumulative distribution options
plotCDF = False # Plot the CDF discretized at xdif interval
plotCPDP = False # Plot the cumulative PDP
plotCKDE = False # Plot the cumulative KDE
plotDKW = False # Plot the 95% confidence interval of the CDF (Dvoretsky-Kiefer-Wolfowitz inequality)

# Relative distribution options
normPlots = True # Will normalize the PDP/KDE if equals True (if separateSubplots is True)

plotKDE = False # Set to True if want to plot KDE
colorKDE = False # Will color KDE according to same coloration as used in CDF plotting
colorKDEbyAge = True # Will color KDE according to age populations if set to True
bw = 'ISJ' # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', or a number (bandwidth in Myr)
bw_x = None

plotPDP = True # Set to True if want to plot PDP
colorPDP = False # Will color PDP according to same coloration as used in CDF plotting
colorPDPbyAge = False # Will color PDP according to age populations if set to True

plotColorBar = False # Color age categories as vertical bars, can add white bars to create blank space between other colored bars

plotHist = False # Set to True to plot a histogram (only available when separateSubplots is True)
b = 50 # Specify the histogram bin size (Myr)

plotPIE = False # Will plot a pie diagram (only available when separateSubplots is True)

# Specify  age categories for colored KDE, PDP, and/or pie plots
# Sharman et al. 2015 scheme
agebins = [0, 23, 65, 85, 100, 135, 200, 300, 500, 4500]
agebinsc = ['slategray','royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

plotAgePeaks = False # Will identify and plot age peaks
agePeakOptions = ['KDE', 0.05, 5, 2, True] # [distType, threshold, minDist, minPeakSize, labels]

plotHeatMap = False
heatMapType = 'PDP'
heatMap = 'Reds'

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot, separateSubplots, plotCDF, plotCPDP, plotCKDE, 
                    plotDKW, normPlots, plotKDE, colorKDE, colorKDEbyAge, plotPDP, colorPDP, colorPDPbyAge, plotColorBar, 
                    plotHist, plotLog, plotPIE, x1, x2, b, bw, xdif, agebins, agebinsc, w, c, h, plotAgePeaks, agePeakOptions,
                    CDFlw=3, KDElw=1, PDPlw=1, plotHeatMap=plotHeatMap, heatMapType=heatMapType, heatMap=heatMap, bw_x=bw_x)

In [None]:
pathlib.Path('Output').mkdir(parents=True, exist_ok=True) # Recursively creates the directory and does not raise an exception if the directory already exists 
fig.savefig('Output/DZageDistributions_Blum_etal_2018.pdf')

Want to share your plot with the group? Please visit this [shared Google Presentation](https://docs.google.com/presentation/d/1k-oNKAyb7p4XAE4dXOvoDgEFgQ6amyDn/edit?usp=sharing) and paste your plot!

# Application 2: Plotting detrital zircon U-Pb ages from the Colorado Plateau, USA
Gehrels et al. (2020): Gchron includes a dataset of detrital zircon U-Pb ages from drill core taken from Petrified Forest National Park (Permian-Triassic). You can download the Open Access article [here](https://gchron.copernicus.org/articles/2/257/2020/). This dataset illustrates some challenges with plotting data that contain both a young (precise) and old (imprecise) age models. In particular, the abundance of Triassic ages, corresponding to Cordilleran arc vocalnism, is highly variable in this dataset. The exercise below is for you to try and decide how *you* would like to plot these data.

In [None]:
dataToLoad = ['INPUT_Gehrels et al. (2020) GChron.xlsx']
main_df, main_byid_df, samples_df, analyses_df = dFunc.loadDataExcel(dataToLoad)

In [None]:
dFunc.plotSampleDist(main_byid_df, numBins=25)

In [None]:
# All samples, listed in approximate stratigraphic order
sampleList = list(main_byid_df.Sample_ID)
ages, errors, numGrains, labels = dFunc.sampleToData(sampleList, main_byid_df, sigma = '1sigma');

In [None]:
# All data lumped together into a single group
sampleList = [(list(main_byid_df.Sample_ID),'Colorado Plateau - Permian to Triassic')]
ages, errors, numGrains, labels = dFunc.sampleToData(sampleList, main_byid_df, sigma = '1sigma');

In [None]:
# Samples grouped by unit

sampleList = [(['52-2','66-1'],'Chinle - Petrified Forest (Black Forest Bed'),
              (['84-2','92-2','104-3','116-1','131-2'],'Chinle - Petrified Forest'),
              (['158-2','169-1','177-1','182-1','188-2','195-2'],'Chinle - Sonsela (upper)'),
              (['196-3','201-1','210-1','215-2','227-3','243-3'],'Chinle - Sonsela (lower'),
              (['261-1','287-2','297-2'],'Chinle - Blue Mesa'),
              (['305-2'],'Chinle - Mesa Redondo'),
              (['319-2','327-2','335-1','349-3'],'Moenkopi - Holbrook'),
              (['383-2'],'Moenkopi(?) - Wupatki(?)'),
              (['390-1'],'Coconino')]

ages, errors, numGrains, labels = dFunc.sampleToData(sampleList, main_byid_df, sigma = '1sigma');

In [None]:
# Enter plot options below
whatToPlot = 'relative' # Options: cumulative, relative, or both
separateSubplots = False # Set to True to plot each relative age distribution in a separate subplot (allows histogram and pie)

# Specify the age range (Myr) that you want to plot
x1 = 0
x2 = 4000
plotLog = False # Set to True to plot the x-axis as a log scale

# Specify the plot dimensions
w = 10 # width of the plot
h = 4 # height of CDF panel
h = 5 # height of the relative panel (only required if separateSubplots is False). Options: 'auto' or an integer

# Specify the interval (Myr) over which distributions are calculated
xdif = 1 # Note: an interval of 1 Myr is recommended

# Cumulative distribution options
plotCDF = False # Plot the CDF discretized at xdif interval
plotCPDP = False # Plot the cumulative PDP
plotCKDE = False # Plot the cumulative KDE
plotDKW = False # Plot the 95% confidence interval of the CDF (Dvoretsky-Kiefer-Wolfowitz inequality)

# Relative distribution options
normPlots = True # Will normalize the PDP/KDE if equals True (if separateSubplots is True)

plotKDE = False # Set to True if want to plot KDE
colorKDE = False # Will color KDE according to same coloration as used in CDF plotting
colorKDEbyAge = True # Will color KDE according to age populations if set to True
bw = 'optimizedFixed' # Specify the KDE bandwidth. Options are 'optimizedFixed', 'optimizedVariable', or a number (bandwidth in Myr)

plotPDP = True # Set to True if want to plot PDP
colorPDP = False # Will color PDP according to same coloration as used in CDF plotting
colorPDPbyAge = False # Will color PDP according to age populations if set to True

plotColorBar = False # Color age categories as vertical bars, can add white bars to create blank space between other colored bars

plotHist = False # Set to True to plot a histogram (only available when separateSubplots is True)
b = 50 # Specify the histogram bin size (Myr)

plotPIE = False # Will plot a pie diagram (only available when separateSubplots is True)

# Specify  age categories for colored KDE, PDP, and/or pie plots
# Sharman et al. 2015 scheme
agebins = [0, 23, 65, 85, 100, 135, 200, 300, 500, 4500]
agebinsc = ['slategray','royalblue','gold','red','darkred','purple','navy','gray','saddlebrown']

plotAgePeaks = False # Will identify and plot age peaks
agePeakOptions = ['KDE', 0.05, 5, 2, True] # [distType, threshold, minDist, minPeakSize, labels]

plotHeatMap = False
heatMapType = 'PDP'
heatMap = 'Reds'

fig = dFunc.plotAll(sampleList, ages, errors, numGrains, labels, whatToPlot, separateSubplots, plotCDF, plotCPDP, plotCKDE, 
                    plotDKW, normPlots, plotKDE, colorKDE, colorKDEbyAge, plotPDP, colorPDP, colorPDPbyAge, plotColorBar, 
                    plotHist, plotLog, plotPIE, x1, x2, b, bw, xdif, agebins, agebinsc, w, c, h, plotAgePeaks, agePeakOptions,
                    CDFlw=3, KDElw=1, PDPlw=1, plotHeatMap=plotHeatMap, heatMapType=heatMapType, heatMap=heatMap)

In [None]:
pathlib.Path('Output').mkdir(parents=True, exist_ok=True) # Recursively creates the directory and does not raise an exception if the directory already exists 
fig.savefig('Output/DZageDistributions_Gehrels_etal_2020.pdf')

Want to share your plot with the group? Please visit this [shared Google Presentation](https://docs.google.com/presentation/d/11OiSxgh7iv81ul8MnfzE7VWm4A0hPwhDBfVIVQh93uQ/edit?usp=sharing) and paste your plot!

**Bonus:** If you're satisfied with the plot(s) you've made, and we still have time, then why not try your hand at Multi-Dimensional Scaling (which will be covered in more detail later).

First, run the model. Note that this may take a while, depending on the number of options chosen. By default `metric=False` and `n_init='metric'` meaning that non-metric MDS is used with the results from metric MDS as a starting configuration. Refer to the [sci-kit learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html) documentation for more information.

In [None]:
model = dFunc.MDS_class(ages, errors, labels, sampleList, metric=False, criteria='Vmax', bw='optimizedFixed', n_init='metric', 
                        max_iter=1000, x1=0, x2=4500, xdif=1, min_dim=1, max_dim=3, dim=2)

After the model has been run, a number of figures can be generated, in any order.

The QQ matrix plots each sample CDF against the others. A perfect match falls along the dashed line. *Note: This is not recommended for large datasets (e.g., >>30 samples or sample groups)*

In [None]:
model.QQplot(figsize=(12,12), savePlot=False, fileName='QQplot.pdf', halfMatrix=True)

A heatmap of the sample dissimilarity matrix gives an indication of the data that is going into the MDS algorithm. (The default is to use the Vmax, which will be discussed later in the afternoon).

In [None]:
model.heatMap(figsize=(10,10), savePlot=False, fileName='HeatMapPlot.pdf', plotValues=True,
              plotType='dissimilarity', fontsize=10)

By setting the keyword argument `plotType='distance'`, we can plot the Euclidean distance between sample pairs on the MDS plot. There should be a general correlation with the heat map above, as samples that are more different should be farther apart.

In [None]:
model.heatMap(figsize=(10,10), savePlot=False, fileName='HeatMapPlot.pdf', plotValues=True, plotType='distance', fontsize=10)

A stress plot gives an indication of the goodness-of-fit and how this varies depending on how many dimensions are modeled

In [None]:
model.stressPlot(figsize=(6,6), savePlot=False, fileName='stressPlot.pdf', stressType='sklearn')

A Shepard plot compares x-y distance on the MDS plot against the dissimilarity metric. Ideally, sample pairs that are far apart on the MDS plot (large distance) will also be the most dissimilar, and visa versa. The amount of scatter gives a sense of the stress value - a lot of scatter means that there's a lot of variance in how well distance on the MDS plot characterizes sample dissimilarity.

In [None]:
model.shepardPlot(figsize=(6,6), savePlot=False, fileName='shepardPlot.pdf', plotOneToOneLine=False)

The MDS plot is a depiction of sample similarity and dissimilarity (refer to Vermeesch, 2013: Chemical Geology for a more complete description).

In [None]:
model.MDSplot(figsize=(10,10), savePlot=False, fileName='MDSplot.pdf', plotLabels=True, equalAspect=False, 
              stressType='sklearn')

Samples can also be plotted as pie diagrams where bins correspond to different age categories. Note: You may have to revisit what values you used for the `agebins` and `agebinsc` variables.

In [None]:
model.MDSplot(figsize=(10,10), savePlot=False, fileName='MDSplot.pdf', plotLabels=True, 
              plotPie=True, pieType='Age', pieSize=0.04, agebins=agebins, agebinsc=agebinsc, equalAspect=False)