# Day 3-Part 1: Statistics

Statistics is a powerful tool to understand or learn from your data. This notebook covers basic concepts of statistics, which will be useful for the second part of the course on machine learning. The material in this notebook comes from the book [Introduction to Python in Earth Science Data Analysis](https://link.springer.com/book/10.1007/978-3-030-78055-5) by Maurizio Petrelli.

# Univariate analysis

Univariate analysis, also called descriptive statistics, is the statistical analysis of a single variable, which involves qualitative (plots) and quantitative (calculation of parameters) analyses. To illustrate these analyses, we'll use the same dataset we used before, of trace elements concentrations in thepra deposits from a volcano in Italy, and specifically the concentration of Zr. Let's load the dataset and get the Zr concentration: 

In [None]:
# import required libraries
import os
import pandas as pd
import matplotlib.pyplot as plt

# load dataset and read Zr
path = os.path.join("..", "data", "Smith_glass_post_NYT_data.xlsx")
my_dataset = pd.read_excel(path, sheet_name="Supp_traces")
zr = my_dataset.Zr # Zr concentration

## Visualizing univariate sample distributions

Now let's visualize the distribution of Zr as a histogram of absolute frequencies (counts per bin, plot 1), a histogram of probability density (plot 2), or a cumulative density distribution (plot 3). Notice that the probability density histogram is constructed by dividing the bin's count by the total number of counts and the bin width, so that the area under the histogram integrates to 1. The cumulative density histogram is normalized such that the last bin equals 1.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15,4)) # make figure

# histogram of absolute frequencies
ax[0].hist(zr, bins=20, color="#c7ddf4", edgecolor="k")
ax[0].set_xlabel("Zr [ppm]")
ax[0].set_ylabel("Counts")
ax[0].set_title("Plot 1: Absolute frequencies")

# histogram of probability density
ax[1].hist(zr, bins=20, density=True, color="#c7ddf4", edgecolor="k")
ax[1].set_xlabel("Zr [ppm]")
ax[1].set_ylabel("Probability density")
ax[1].set_title("Plot 2: Probability density")

# cummulative density distribution
ax[2].hist(zr, bins=20, density=True, histtype="step", linewidth=2, cumulative=True)
ax[2].set_xlabel("Zr [ppm]")
ax[2].set_ylabel("Likelihood of occurrence")
ax[2].set_title("Plot 3: Cumulative density distribution")

fig.tight_layout();

We can also visualize histograms of all trace elements with just one operation in `pandas`:

In [None]:
# visualize histograms of all traces 
traces = ["Sc","Rb","Sr","Y","Zr","Nb","Cs","Ba","La","Ce","Pr","Nd","Sm","Eu",
          "Gd","Tb","Dy","Ho","Er","Tm","Yb","Lu","Hf","Ta","Pb","Th","U"]

my_dataset[traces].hist(figsize=(17,14), bins=20, color="#c7ddf4", edgecolor="k")
plt.tight_layout();

## Location

In statistics, it is useful to represent an entire dataset with a single value describing its location. This single value is defined as the central tendency. The mean, median, and mode fall into this category.

### Mean

The arithmetic mean is the average of all numbers in a dataset:

### <div align="center">$\mu_{\mathrm{A}}=\bar{z}=\frac{1}{n} \sum_{i=1}^n z_i=\frac{z_1+z_2+\cdots+z_n}{n}$</div>

The geometric mean indicates the location of a dataset by using the product of their values:

### <div align="center">$\mu_G=\left(z_1 z_2 \cdots z_n\right)^{\frac{1}{n}}$</div>

And the harmonic mean is:

### <div align="center">$\mu_H=\frac{n}{\frac{1}{z_1}+\frac{1}{z_2}+\cdots+\frac{1}{z_n}}$</div>

The mean can be computed using the `pandas.mean` method. The geometric and harmonic means can be computed using the `scipy.stats.mstats` `gmean` and `hmean` methods:

In [None]:
# compute the mean, geometric and harmonic means of Zr

from scipy.stats.mstats import gmean, hmean

a_mean = zr.mean()
g_mean = gmean(zr)
h_mean = hmean(zr)

print ("arithmetic mean = {:.1f} [ppm]".format(a_mean))
print ("geometric mean = {:.1f} [ppm]".format(g_mean))
print ("harmonic mean = {:.1f} [ppm]".format(h_mean))

fig, ax = plt.subplots()
ax.hist(zr, bins=20, density=True, color="#c7ddf4", edgecolor="k", label="Measurements Hist")
ax.axvline(a_mean, color="purple", label="Arithmetic mean", linewidth=2)
ax.axvline(g_mean, color="orange", label="Geometric mean", linewidth=2)
ax.axvline(h_mean, color="green", label="Harmonic mean", linewidth=2)
ax.set_xlabel("Zr [ppm]")
ax.set_ylabel("Probability density")
ax.legend();

### Median

The median is the number in the middle of a dataset after ordering the data from the lowest to the highest value. If the number of data values is odd, then the median is the middle value in the ordered list; if it is even, then the median is the average of the two middle values. The median can be calculated using the `pandas.median` method:

In [None]:
# compute median of Zr

median = zr.median()

print ("Median = {:.1f} [ppm]".format(median))

fig, ax = plt.subplots()
ax.hist(zr, bins=20, density=True, color="#c7ddf4", edgecolor="k", label="Measurements Hist")
ax.axvline(median, color="orange", label="Median", linewidth=2)
ax.set_xlabel("Zr [ppm]")
ax.set_ylabel("Probability density")
ax.legend();

### Mode

The mode of a dataset is the value that appears most frequently in the data set. It can be calculated as shown below:

In [None]:
# compute mode of Zr

import numpy as np

hist, bin_edges = np.histogram(zr, bins=20, density=True)
modal_value = (bin_edges[hist.argmax()] + bin_edges[hist.argmax()+1])/2

print ("Modal value: {:.1f} [ppm]".format(modal_value))

fig, ax = plt.subplots()
ax.hist(zr, bins=20, density=True, color="#c7ddf4", edgecolor="k", label="Measurements Hist")
ax.axvline(modal_value, color="orange", label="Modal value", linewidth=2)
ax.set_xlabel("Zr [ppm]")
ax.set_ylabel("Probability density")
ax.legend();

## Dispersion

After having introduced several measures of the central tendency of a dataset, we can now look at measures of its variability. The range, variance, and standard deviation are all estimators of the dispersion (variability) of a dataset.

### Range

The range is the difference between the highest and lowest values in a dataset:

### <div align="center">$R = z_{max} - z_{min}$</div>

It can be computed as follows:

In [None]:
# compute range of Zr

my_range = zr.max()- zr.min()

fig, ax = plt.subplots()
ax.hist(zr, bins=20, density=True, color="#c7ddf4", edgecolor="k", label="Measurements Hist")
ax.axvline(zr.max(), color="purple", label="Max value", linewidth=2)
ax.axvline(zr.min(), color="green", label="Min value", linewidth=2)
ax.axvspan(zr.min(), zr.max(), alpha=0.1, color="orange", label="Range = {:.0f} ppm".format(my_range))
ax.set_xlabel("Zr [ppm]")
ax.set_ylabel("Probability density")
ax.legend();

### Variance and standard deviation

The sample variance is defined as:

### <div align="center">$\sigma^2=\frac{\sum_{i=1}^n\left(z_i-\mu\right)^2}{n-1}$</div>

And the sample standard deviation is the square root of the sample variance:

### <div align="center">$\sigma=\sqrt{\sigma^2}=\sqrt{\frac{\sum_{i=1}^n\left(z_i-\mu\right)^2}{n-1}}$</div>

The sample variance and standard deviation can be estimated using the `pandas` `var` and `std` methods, respectively:

In [None]:
# compute variance and standard deviation of Zr

variance = zr.var() # sample variance, for population variance use ddof = 0
stddev = zr.std() # sample standard deviation, for population stddev use ddof = 0

print ("Variance = {:.1f} [ppm]".format(variance))
print ("Standard Deviation = {:.1f} [ppm]".format(stddev))

fig, ax = plt.subplots()
ax.hist(zr, bins= 20, density = True, color="#c7ddf4", edgecolor="k", label="Measurements Hist")
ax.axvline(a_mean-stddev, color="purple", label=r"mean - 1$\sigma$", linewidth=2)
ax.axvline(a_mean+stddev, color="green", label=r"mean + 1$\sigma$", linewidth=2)
ax.axvspan(a_mean-stddev, a_mean+stddev, alpha=0.1, color="orange", label=r"mean $\pm$ 1$\sigma$")
ax.set_xlabel("Zr [ppm]")
ax.set_ylabel("Probability density")
ax.legend();

### Inter-quartile range

In descriptive statistics, the inter-quartile range is the difference between the 75th and 25th percentiles, or between the third and first quartiles. It can be calculated as follows:

In [None]:
# compute inter-quartile range of Zr

# using numpy

q1 = np.percentile(zr, 25, interpolation = "midpoint") # 1st quartile
q3 = np.percentile(zr, 75, interpolation = "midpoint") # 2nd quartile

# using pandas
#q1 = zr.quantile(0.25)
#q3 = zr.quantile(0.75)

iqr = q3 - q1 # inter-quartile range

print ("Inter-quartile range = {:.1f} [ppm]".format(iqr))

fig, ax = plt.subplots()
ax.hist(zr, bins= 20, density = True, color="#c7ddf4", edgecolor="k", label="Measurements Hist")
ax.axvline(q1, color="purple", label="Q1", linewidth=2)
ax.axvline(q3, color="green", label="Q3", linewidth=2)
ax.axvspan(q1, q3, alpha=0.1, color="orange", label="Inter-quartile range (IQR)")
ax.set_xlabel("Zr [ppm]")
ax.set_ylabel("Probability density")
ax.legend();

A boxplot uses the inter-quartile range to describe groups of numerical data. Lines extending from the boxes (i.e. whiskers) indicate the variability outside the third and first quartiles. Outliers are plotted as individual symbols. In detail, the bottom and top of the box represent the first and third quartiles, respectively. A line is drawn inside the box to represent the second quartile (i.e. the median). To construct a boxplot of Zr concentrations divided by epochs, we can use the `seaborn.boxplot` method as following:

In [None]:
# draw boxplot for Zn

import seaborn as sns

fig, ax = plt.subplots()
ax = sns.boxplot(data=my_dataset, x="Zr", y="Epoch", palette="Set3")

we can also do this for all trace elements:

In [None]:
# box plots for all trace elements

for trace in traces:
    sns.boxplot(data=my_dataset, x=trace, y="Epoch", palette="Set3")
    plt.show()

## Skewness

Having introduced several parameters providing information about the central tendency and the variability of a dataset, we can now look at the skewness, which reflects the shape of the distribution. 

The skewness provides information about the symmetry in a distribution of values. In the case of a symmetric distribution, the mean, median, and mode are the same. Conversely, the non-coincidence of these three parameters indicates a skewed distribution.

In the case of the concentration distribution of Zr, the mean ($\mu = \mu_A$), median ($M_e$), and mode ($M_o$) do not coincide, and a tail appears on the right side:

In [None]:
# skewness of Zr

fig, ax = plt.subplots()
ax.hist(zr, bins=20, density=True, color="#c7ddf4", edgecolor="k", label="Measurements Hist")
ax.axvline(a_mean, color="green", label="Arithmetic mean", linewidth=2)
ax.axvline(median, color="purple", label="Median Value", linewidth=2)
ax.axvline(modal_value, color="orange", label="Modal value", linewidth=2)
ax.set_xlabel("Zr [ppm]")
ax.set_ylabel("Probability density")
ax.legend();

By comparison with the figure below, this corresponds to a positive skewness:

<img src='../figures/skewness.png' style="width:600px" align="center">

Another parameter providing information about the skewness of a sample distribution is Pearson's first coefficient of skewness:

### <div align="center">$\alpha_1=\frac{(\mu - M_o)}{\sigma_s}$</div>

A second parameter is the Pearson's second moment of skewness:

### <div align="center">$\alpha_2=\frac{3(\mu - M_e)}{\sigma_s}$</div>

And a third parameter is the Fisher-Pearson coefficient of skewness:

### <div align="center">$g_1=\frac{m_3}{m_2^{3/2}}$</div>

where:

### <div align="center">$m_i=\frac{1}{N} \sum_{n=1}^N(x[n]-\mu)^i$</div>

In Python, these three parameters can be determined as follows:

In [None]:
# compute skewness Pearson"s coefficients

from scipy.stats import skew

a1 = (a_mean - modal_value) / stddev
a2 = 3 * (a_mean - median) / stddev
g1 = skew(zr)

print ("Pearson 1st coeff. of skewness = {:.2f}".format(a1))
print ("Pearson 2nd moment of skewness = {:.2f}".format(a2))
print ("Fisher-Pearson coeff. of skewness = {:.2f}".format(g1))

# Bivariate analysis

Bivariate statistics investigates the relationships between two variables. Let's look at the concentrations of La, Ce, Sc, and U:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10,4)) # make figure

# plot La vs Ce
ax[0].scatter(my_dataset.La, my_dataset.Ce, marker="o", edgecolor="k", color="#c7ddf4")
ax[0].set_xlabel("La [ppm]")
ax[0].set_ylabel("Ce [ppm]")

# plot Sc vs U
ax[1].scatter(my_dataset.Sc, my_dataset.U, marker="o", edgecolor="k", color="#c7ddf4")
ax[1].set_xlabel("Sc [ppm]")
ax[1].set_ylabel("U [ppm]")

fig.tight_layout();

Another way to visualize these data is using the `seaborn.pairplot` function. This function plots pairwise relationships from a dataset:

In [None]:
# plot pairwise relationships for La, Ce, Sc, and U

my_sub_dataset = my_dataset[["La","Ce","Sc","U"]] 
sns.pairplot(my_sub_dataset);

The plot La versus Ce clearly shows an increase in Ce as La increases, while U seems to decrease as Sc increases, although this second correlation is not as clear. To capture the relationships between these variables, we can use the concepts of covariance and correlation.

## Covariance and Correlation

The covariance of two sets of univariate samples $x$ and $y$ derived from two randoms variables $X$ and $Y$ is a measure of their joint variability, or their degree of correlation:

### <div align="center">$\operatorname{Cov}_{x y}=\frac{\sum_{i=1}^n\left(y_i-\bar{y}\right)\left(x_i-\bar{x}\right)}{n-1}$</div>

${Cov}_{xy} > 0$ indicates a positive relationship between $Y$ and $X$. In contrast, if ${Cov}_{xy} < 0$, the relationship is negative. If $X$ and $Y$ are statistically independent, then ${Cov}_{xy} = 0$.

Note that the covariance depends on the magnitudes of the two variables inspected. Consequently, it does not tell much about the strength of such relationship. The normalized version of the covariance is the correlation coefficient. This coefficient ranges from -1 to 1 and shows, by its magnitude, the strength of the linear relation. The correlation coefficient $r_{xy}$ for two univariate sets of data, $X$ and $Y$, characterized by a covariance ${Cov}_{xy}$ and standard deviations $\sigma_{sx}$ and $\sigma_{sy}$ is:

### <div align="center">$r_{x y}=\frac{\operatorname{Cov}_{x y}}{\sigma_{s x} \sigma_{s y}}=\frac{\sum_{i=1}^n\left(y_i-\bar{y}\right)\left(x_i-\bar{x}\right)}{\sqrt{\sum_{i=1}^n\left(y_i-\bar{y}\right)^2 \sum_{i=1}^n\left(x_i-\bar{x}\right)^2}}$</div>

With `pandas`, the covariance and the correlation coefficient can be easily computed using the `cov`and `corr` functions. These functions calculate the covariance and the correlation matrices for a `DataFrame`, respectively. A covariance matrix is a table showing the covariances ${Cov}_{xy}$ between variables in the `DataFrame`. Each cell in the table shows the covariance between two variables. The correlation matrix follows the same logic, but gives the correlation coefficients. In the correlation matrix, the diagonal cells all contain unity, which corresponds to self-correlation. We can use the `seaborn.heatmap` function to graphically display these matrices:

In [None]:
# plot covariance and correlation matrices for La, Ce, Sc, and U

cov = my_sub_dataset.cov()
cor = my_sub_dataset.corr()

fig, ax = plt.subplots(1, 2, figsize=(11,5))

ax[0].set_title("Covariance Matrix")
sns.heatmap(cov, annot=True, cmap="cividis", ax=ax[0])

ax[1].set_title("Correlation Matrix")
sns.heatmap(cor, annot=True, vmin= -1, vmax=1, cmap="coolwarm", ax=ax[1])

fig.tight_layout();

## Linear regression

Considering a response variable $Y$ and a predictor $X$, we can define a linear model as follows:

### <div align="center">$Y = \beta_0 + \beta_1 X + \epsilon$</div>

$\beta_0$ is the intercept or predicted value of $Y$ at $X = 0$. $\beta_1$ is the slope or the change in $Y$ per unit change in $X$. The quantity $\epsilon$ is the residual error. Using the least-squares method (i.e. minimizing the sum of squares of the vertical distances from each point to the line defined by the equation above), $\beta_1$ and $\beta_0$ are estimated as:

### <div align="center">$\beta_1=\frac{\sum_{i=1}^n\left(y_i-\bar{y}\right)\left(x_i-\bar{x}\right)}{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}=\frac{\operatorname{Cov}_{x y}}{\sigma_{s x}^2}=r_{x y} \frac{\sigma_{s y}}{\sigma_{s x}}$</div>

### <div align="center">$\beta_0 = \bar{y} - \beta_1\bar{x}$</div>

The square of the correlation coefficient, $r^2_{xy}$, is a value between 0 and 1, and it is typically used to make a preliminary estimate of the quality of the regression model. 

Let's do a linear regression to the La versus Ce data.

In [None]:
# linear regression of La versus Ce

from scipy.stats import linregress

fig, ax = plt.subplots()

# plot La vs Ce
ax.scatter(my_dataset.La, my_dataset.Ce, marker="o", edgecolor="k", color="#c7ddf4")

# linear regression of La vs Ce
b1, b0, rho_value, p_value, std_err = linregress(my_dataset.La, my_dataset.Ce)

# best-fit line
xr = np.linspace(my_dataset["La"].min(),my_dataset.La.max(), 10) # 10 values between min and max La (x values)
yr = b0 + b1*xr # 10 Ce (y) values that follow the best-fit line

# plot best-fit line and add to the legend b0, b1 and r^2
ax.plot(xr, yr, linewidth=1, color="#ff464a", linestyle="--", label=r"fit param.: $\beta_0$ = " 
            + "{:.1f}".format(b0) + r", $\beta_1$ = " + "{:.1f}".format(b1) 
            + r", $r_{xy}^{2}$ = " + "{:.2f}".format(rho_value**2))

ax.set_xlabel("La [ppm]")
ax.set_ylabel("Ce [ppm]")
ax.legend(loc = "upper left");

And a linear regression to the Sc versus U data:

In [None]:
# linear regression of Sc versus U

fig, ax = plt.subplots()

# plot Sc vs U
ax.scatter(my_dataset.Sc, my_dataset.U, marker="o", edgecolor="k", color="#c7ddf4")

# linear regresssion of Sc vs U
b1, b0, rho_value, p_value, std_err = linregress(my_dataset.Sc, my_dataset.U)

# best-fit line
xr = np.linspace(my_dataset.Sc.min(),my_dataset.Sc.max(), 10) # 10 values between min and max Sc (x values)
yr = b0 + b1*xr # 10 U (y) values that follow the best-fit line

# plot best-fit line and add to the legend b0, b1 and r^2
ax.plot(xr, yr, linewidth=1, color="#ff464a", linestyle="--", label=r"fit param.: $\beta_0$ = " 
            + "{:.1f}".format(b0) + r", $\beta_1$ = " + "{:.1f}".format(b1) 
            + r", $r_{xy}^{2}$ = " + "{:.2f}".format(rho_value**2))

ax.set_xlabel("Sc [ppm]")
ax.set_ylabel("U [ppm]")
ax.legend(loc = "upper left");

So as we concluded by visual inspection before, the correlation is stronger for La versus Ce than for Sc versus U.


# Probability Density Functions

A probability density function (PDF) is a function associated with a continuous random variable whose value at any point in the sample space (i.e. the possible set of values for the random variable) is an estimate of the likelihood of occurrence of that event. 

## The Normal Distribution

The normal distribution is a bell-shaped PDF that occurs in many situations. The normal PDF is defined as follows:

### <div align="center">$\operatorname{PDF}_{\mathrm{N}}(x, \mu, \sigma)=\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}$</div>

where $\mu$ and $\sigma$ are the mean and the standard deviation, respectively.

The `scipy.stats.norm.pdf` function provides the PDF for a normal distribution. In the code below, we also use the `numpy.random.normal` function to generate a random sample from a normal distribution:

In [None]:
# normal distribution

from scipy.stats import norm

mu = 0 #mean
sigma = 1 # standard deviation
normal_sample = np.random.normal(mu, sigma, 15000) # random sample from normal distribution

# plot the histogram of the sample distribution
fig, ax = plt.subplots()
ax.hist(normal_sample, bins="auto", density=True, color="#c7ddf4", edgecolor="k",
        label="Random sample with normal distribution")

# probability density function
x = np.arange(-5,5, 0.01)
normal_pdf = norm.pdf(x, loc= mu, scale = sigma)
ax.plot(x, normal_pdf, color="#ff464a", linewidth=1.5, linestyle="--",
        label=r"Normal PDF with $\mu$=0 and $\sigma$=1")
ax.legend(title="Normal Distribution")
ax.set_xlabel("x")
ax.set_ylabel("Probability Density")
ax.set_xlim(-5,5)
ax.set_ylim(0,0.6);

# Descriptive statistics
s_mean = normal_sample.mean()
s_stddev = normal_sample.std()
print("Sample mean = {:.4f}".format(s_mean))
print("Sample standard deviation = {:.4f}".format(s_stddev))

## The Log-Normal Distribution

The log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. The PDF for a log-normal distribution is:

### <div align="center">$\operatorname{logPDF}_{\mathrm{N}}\left(x, \mu_n, \sigma_n\right)=\frac{1}{x} \frac{1}{\sqrt{2 \pi \sigma_n^2}} \exp \left\{-\frac{\left[\log (x)-\mu_n\right]^2}{2 \sigma_n^2}\right\}$</div>

where $\mu_n$ and $\sigma_n$ are the mean and the standard deviation of the normal distribution and are obtained by calculating the natural logarithm (log) of the random variable.

The `scipy.stats.lognorm.pdf` function provides the PDF for a log-normal distribution, and the `scipy.stats.lognorm.rvs` function generates random numbers following the log-normal distribution. The code below generates random samples with log-normal distributions for several values of $\mu_n$ and $\sigma_n$:

In [None]:
# log normal distribution

from scipy.stats import lognorm

colors = ["#342a77", "#ff464a", "#4881e9"]
normal_mu = [0,0.5,1]
normal_sigma = [0.5,0.4,0.3]
x = np.arange(0.001, 7, .001) # for the log-normal PDF 
x1 = np.arange(-2.5, 2.5, .001) # for the normal PDF

fig, ax = plt.subplots(1, 2, figsize=(14,4))

for mu_n, sigma_n, color in zip(normal_mu, normal_sigma, colors):
    
    # log-normal distribution of x
    lognorm_pdf = lognorm.pdf(x, s=sigma_n, scale=np.exp(mu_n)) # pdf
    r = lognorm.rvs(s=sigma_n, scale=np.exp(mu_n), size=15000) # random sample
    ax[0].plot(x, lognorm_pdf, color=color, label=r"$\mu_n$ =" + str(mu_n) + r"- $\sigma_n$ =" + str(sigma_n))
    ax[0].hist(r, bins="auto", density=True, color=color, edgecolor="k", alpha=0.5)
    
    # normal distribution of log(x)
    normal_pdf = norm.pdf(x1, loc= mu_n, scale = sigma_n) # pdf
    logr= np.log(r) # random sample
    ax[1].plot(x1, normal_pdf, color=color, label=r"$\mu_n$ =" + str(mu_n) + r"- $\sigma_n$ =" + str(sigma_n))
    ax[1].hist(logr, bins="auto", density=True, color=color, edgecolor="k", alpha=0.5)
    my_mu = logr.mean()
    ax[1].axvline(x=my_mu, color=color, linestyle="--", label=r"calculated $\mu_n$ =" + str(round(my_mu,3)))
    
ax[0].legend(title="log-normal distributions") 
ax[0].set_xlabel("x") 
ax[0].set_ylabel("Probability Density") 
ax[1].legend(title="normal distributions") 
ax[1].set_xlabel("ln(x)") 
ax[1].set_ylabel("Probability Density")

fig.tight_layout(); 

The [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/stats.html) module has many other probability distributions, that are useful in geological applications.


## Kernel density estimators

Kernel density estimators (KDE) are used to estimate a PDF from the observed data. Assume $(x_1, x_2, x_i, ..., x_n)$ is a univariate, independent, and identically distributed sample (i.e. all $x_i$ have the same probability distribution) belonging to a distribution with unknown PDF. We are interested in estimating the shape $\hat{f}$ of this PDF. The equation defining a KDE is:

### <div align="center">$\hat{f}(x)=\frac{1}{n h} \sum_{i=1}^n K\left(\frac{x-x(i)}{h}\right)$</div>

where $K$ is the kernel, a non-negative function that integrates to unity (i.e. $\int_{-\infty}^{\infty} K(x) d x=1$), and $h>0$ is a smoothing parameter called the bandwidth. A range of kernel functions are commonly used: normal, uniform, triangular, etc. 

Python offers several different implementations for the development of a KDE. The code below plots the kernel functions available in the `statsmodel` package, `KDEUnivariate` function:

In [None]:
# plot kernel functions

from statsmodels.nonparametric.kde import KDEUnivariate 

kernels = ["gau", "epa", "uni", "tri", "biw", "triw"] # kernel functions
kernels_names = ["Gaussian", "Epanechnikov", "Uniform", "Triangular", "Biweight", "Triweight"] 

fig, ax = plt.subplots()

for kernel, kernel_name in zip(kernels, kernels_names):
    # kernels
    kde = KDEUnivariate([0]) # kernel at x = 0
    kde.fit(kernel= kernel, bw=1, fft=False, gridsize=2**10) 
    ax.plot(kde.support, kde.density, label = kernel_name, linewidth=1.5, alpha=0.8)

ax.set_xlim(-2,2)
ax.grid()
ax.legend(title="Kernel functions");

The code below shows the application of KDE to the data of Zr concentrations, and how the bandwidth affects the resulting KDE estimate:

In [None]:
# Application of the kernel to Zr concentrations

zr_eval = np.arange(0, 1100, 1)

fig, ax = plt.subplots(2, 1, figsize=(7,8))

# Density Histogram
ax[0].hist(zr, bins=20, density=True, label="Density Histogram", color="#c7ddf4", edgecolor="k")
# KDE
kde = KDEUnivariate(zr)
kde.fit() # This uses the Gaussian kernel and normal_reference bandwidth
my_kde = kde.evaluate(zr_eval)
ax[0].plot(zr_eval, my_kde, linewidth=1.5, color="#ff464a", label="gaussian KDE - auto bandwidth selection") 
ax[0].set_xlabel("Zr [ppm]") 
ax[0].set_ylabel("Probability density") 
ax[0].set_ylim(0, 0.006)
ax[0].legend();

# Density Histogram
ax[1].hist(zr, bins=20, density=True, label="Density Histogram", color="#c7ddf4", edgecolor="k")
# KDE
# Effect of bandwidth
for my_bw in [10,50,100]:
    kde = KDEUnivariate(zr)
    kde.fit(bw = my_bw)
    my_kde = kde.evaluate(zr_eval)
    ax[1].plot(zr_eval, my_kde, linewidth = 1.5, label="gaussian KDE - bw: " + str(my_bw))
    
ax[1].set_xlabel("Zr [ppm]") 
ax[1].set_ylabel("Probability density") 
ax[1].set_ylim(0, 0.006)
ax[1].legend()

fig.tight_layout();

That's it. In the next notebook, we will look at uncertainties or errors and their propagation.

To practice, try the exercise in [lab3_1](../lab/lab3_1.ipynb)