# Module 5: Statistics

## Overview: Descriptive vs. Comparative 
This notebook will focus on simple descriptive and comparative statistics. When I say descriptive statistics, I mean things like distribution shapes, quantiles, mean, and median - sometimes these are also called parameters. So then, what are comparative statistics? Exactly what they sound like, a stat that compares two distributions - we will focus on familiar things like the student's t-test (a parametric test, which assumes a normal distribution that is parameterized by a mean and standard deviation) and also some much more useful non-parametric methods that don't assume a distribution shape - the KS test is a great example of a non-parametric method that is useful for geoscience data that is often not normally distributed. Don't be too scared of the jargon here, it is relatively easy to understand and use. 

For questions on this notebook, ask them on the [GEOL 557 slack](https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA)<a href="https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA">
<img src="https://cdn.brandfolder.io/5H442O3W/as/pl546j-7le8zk-ex8w65/Slack_RGB.svg" alt="Go to the GEOl 557 slack" width="100">
</a>

## Instructions
Work through this notebook - there will be several places where you need to fill-in-the-blank or write some code into an open cell. When you are finished, make sure to use the Colab menu (not the browser-level menu) to do the following:
- Expand all the sections - in the Colab menu, choose View --> Expand sections) 
- Save the notebook as a pdf, again using the Colab menu, using File --> Print --> Save as PDF. 

--- 
## Course
**GEOL 557 Earth Resource Data Science I: Fundamentals**. GEOL 557 forms part 2 of the four-part course series for the "Earth Resource Data Science" online graduate certificate at Mines - [learn more about the certificate here](https://online.mines.edu/er/)

Notebook created by **Zane Jobe** and **Thomas Martin**, [CoRE research group](https://core.mines.edu), Colorado School of Mines

[![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/ZaneJobe.svg?style=social&label=Follow%20%40ZaneJobe)](https://twitter.com/ZaneJobe)
and [![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/ThomasM_geo.svg?style=social&label=Follow%20%40ThomasM_geo)](https://twitter.com/ThomasM_geo) on Twitter  

-------------------
# OK, let's do some coding!
## First, let's import some packages and data

Updated October - 2023 - Thomas Martin

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

import seaborn as sns
import scipy.stats as stats

In [None]:
# these next two things shuoldnt need to be changed if you set up your Google Drive folder correctly (see Module 1)
folder_path = '../1_data/' # makes a path
file_name = 'Fryer_and_Jobe_2019_turbidite_beds_partial.csv' # file name

df=pd.read_csv(folder_path + file_name) # uses pandas to read in the csv as a 'DataFrame' called df

df.head()

### And make a simple plot
Let's look at the data using a CDF

In [None]:
# load data

fig, [ax, ax2] = plt.subplots(1, 2, figsize=[12, 6])

# linear data
sns.kdeplot(
    df.thickness_m[df.lithology == 'mud'],
    ax=ax,
    cumulative=True,
    color="xkcd:grey",
    fill=True,
    linewidth=5,
    label='mud',
)
sns.kdeplot(
    df.thickness_m[df.lithology == 'sand'],
    ax=ax,
    cumulative=True,
    color="xkcd:yellow",
    fill=True,
    linewidth=5,
    label='sand',
)

ax.set_xlabel('Bed Thickness (m)', fontsize=14)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Bed Thickness Distribution (linear)', fontsize=18)

ax.axhline(y=0.1, color='k', linestyle='--', linewidth=0.5)
ax.text(10, 0.11, r'$ P_{10}$', fontsize=14)

ax.axhline(y=0.5, color='k', linestyle='--', linewidth=0.5)
ax.text(10, 0.51, r'$ P_{50}$', fontsize=14)

ax.axhline(y=0.9, color='k', linestyle='--', linewidth=0.5)
ax.text(10, 0.91, r'$ P_{90}$', fontsize=14)

ax.text(10, 0.7, 'mean = ' + str(round(df.thickness_m[df.lithology == 'sand'].mean(), 2)) + ' m')

# logged data
sns.kdeplot(
    df.log10thickness_m[df.lithology == 'mud'],
    ax=ax2,
    cumulative=True,
    color="xkcd:grey",
    fill=True,
    linewidth=5,
    label='mud',
)
sns.kdeplot(
    df.log10thickness_m[df.lithology == 'sand'],
    ax=ax2,
    cumulative=True,
    color="xkcd:yellow",
    fill=True,
    linewidth=5,
    label='sand',
)

ax2.set_xlabel('Bed Thickness (m)', fontsize=14)
ax2.set_xlim(-3, 2)
ax2.set_xticklabels([10**p for p in range(-3, 3)], fontsize=12, rotation=45)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title('Bed Thickness Distribution (log10)', fontsize=18)

ax2.axhline(y=0.1, color='k', linestyle='--', linewidth=0.5)
ax2.text(1, 0.11, r'$ P_{10}$', fontsize=14)

ax2.axhline(y=0.5, color='k', linestyle='--', linewidth=0.5)
ax2.text(1, 0.51, r'$ P_{50}$', fontsize=14)

ax2.axhline(y=0.9, color='k', linestyle='--', linewidth=0.5)
ax2.text(1, 0.91, r'$ P_{90}$', fontsize=14)

ax2.text(0, 0.7, 'mean = ' + str(round(10**df.log10thickness_m[df.lithology == 'sand'].mean(), 2)) + ' m')

plt.tight_layout()
plt.show()


The mean is printed above in the plot - let's calculate it again here:


In [None]:
print('The mean bed thickness is',round(np.mean(df.thickness_m[df.lithology=='sand']),3),'meters')

print('Converted back to meters, the mean sand bed thickness calculated from the log10 converted data is'
,round(10**df.log10thickness_m[df.lithology=='sand'].mean(),3),'m')

### Now you comment
What?? Why are they different? Which one is right? Please leave a comment below explaining why this is:

your comment here

### Distributions
You can see below how to use `numpy` to get the exact percentiles for the data, both the non-logged data and the logged data:


In [None]:
print(10**np.percentile(df.log10thickness_m[df.lithology=='sand'], [10,50,90]))
print(np.percentile(df.thickness_m[df.lithology=='sand'], [10,50,90]))

Note that the percentiles calculated using the non-logged data and the log10 data are exactly the same. 
A few questions to answer as a comment:
- Why is that? 
- Which one of the two plots above would you use to estimate P10, P50, and P90 values
- What would that tell you about the 'normality' of this data?

your comment here

## Now you try (coding)
You already can see that the data is log-normal - see if you can find a descriptive statistic for the non-logged data that approximates the mean of the logged data 

(hint 1 - there are at least two metrics you can use)

(hint 2 - we imported `scipy.stats` above - maybe there is something in there to use?) 

In [None]:
# your code here

### Q-Q plots
It's also super easy to make Q-Q plots to test if your data is normally distributed or not. If you think your data is log10-normal, you can log10 the data and then Q-Q plot it, like I do below:

In [None]:
fig, ax = plt.subplots(1,2, figsize=[10,5])
stats.probplot(df.thickness_m[df.lithology=='sand'], plot=ax[0], dist='norm')
ax[0].set_title('Sand bed thickness')
stats.probplot(df.log10thickness_m[df.lithology=='sand'], plot=ax[1], dist='norm')
ax[1].set_title('Log10 sand bed thickness')

Leave a comment below explaining what these plots are telling you about the distribution of the data:

your comment here

### Now you try (coding and comment)
Make a Q-Q plot for mud interval thickness, and leave a comment describing this plot

In [None]:
# your code here

your comment here

## OK, one final plot to summarize the thickness data
Using a hist-kde combo:


In [None]:
fig, [ax1, ax2] = plt.subplots(1,2, figsize=[12,6], sharex=True)
sns.distplot(df.log10thickness_m[df.lithology=='mud'],ax=ax1, hist=True, color='xkcd:dark grey', hist_kws=dict(edgecolor="black"), kde_kws=dict(linewidth=5))
sns.distplot(df.log10thickness_m[df.lithology=='sand'],ax=ax2, hist=True, color='xkcd:dark yellow',hist_kws=dict(edgecolor="black"), kde_kws=dict(linewidth=5))
plt.show()

Don't those look nice? Would be nice to have some statistics on these though. 

## Now you try
Calculate the P10, P50, P90, arithmetic mean, and geometric mean for the `sand` and `mud` bed thickness data plotted above - should you use the log10 data or the non-logged data? 

Then think about how you would compare them using comparative methods (but don't do any of that now, we will do it below)...

In [None]:
# your code here

![halfway there](https://media1.tenor.com/images/34ed039f6dddf44f77fd34394f0a9322/tenor.gif?itemid=13850588)

# Comparative statistics

OK, now you are a pro at descriptive stats, let's so some comparisons!

The most interesting comparison you might want to make is between sand and mud bed thickness data you described above. 

First, let's just plot them again, this time just for submarine channel settings:

In [None]:
df.environment.value_counts()

In [None]:
channel = df[df.environment == 'Channel']
sns.kdeplot(channel.log10thickness_m[channel.lithology=='mud'], color='xkcd:grey')
sns.kdeplot(channel.log10thickness_m[channel.lithology=='sand'], color='xkcd:yellow');

OK, looks like the sand and mud distributions are quite different! Let's do a simple linear regression using the `regplot` in seaborn

In [None]:
sns.regplot(channel.log10thickness_m[channel.lithology=='mud'],
            channel.log10thickness_m[channel.lithology=='sand'])

# THIS WILL ERROR - read below to find out why

Hmm, an error. Why doesn't that work? What can we do to allow the data to plot?

---

Resampling is clearly the way to go, but how do we do that? 

In [None]:
# first a primer on resampling:

np.random.seed(0)

mu = 200
sigma = 25
n_bins = 100

# make up some data
x = np.random.normal(mu, sigma, size=100)

#x = np.arange(len(y))
fig = plt.figure(figsize=[15,5])

# let's make a quick function to avoid copy-pasting plotting routines below
def plot_and_sort_legend(ax,marker):
  ax.plot(np.sort(x), np.linspace(0, 1, len(x), endpoint=False), marker, label='Empirical CDF')
  sns.kdeplot(x, cumulative=True, ax=ax, label='KDE model of CDF')
  plt.legend()

ax1 = plt.subplot(131)
plot_and_sort_legend(ax1,'-')
ax1.set_ylim([-0.05, 1.05])
ax1.set_title('A: Comparing CDFs')
ax1.add_patch(Rectangle((180,0.3),30,0.2,linewidth=1,edgecolor='r',facecolor='none'))
ax1.text(180,0.45,'part B')
ax1.add_patch(Rectangle((130,0),50,0.2,linewidth=1,edgecolor='r',facecolor='none'))
ax1.text(130,0.15,'part C')

ax2=plt.subplot(132)
plot_and_sort_legend(ax2,'-o')
ax2.set_ylim([0.3, 0.5])
ax2.set_xlim([180, 210])
ax2.set_title('B: Comparing CDFs - zoom in 1')

ax3=plt.subplot(133)
plot_and_sort_legend(ax3,'-o')
ax3.set_ylim([0, 0.2])
ax3.set_xlim([130, 180])
ax3.set_title('C: Comparing CDFs - zoom in 2')

plt.tight_layout()
plt.show()

As you can see above, the blue is the original data, and tyhe orange is resampled data (using a KDE). The resampled data (orange) pretty closely matches the original data (blue), but there are a few places where it's not perfect. This is an issue that doesn't go away and propagates through future calculations when you resample data, so keep that in mind...

That is how a KDE works, by modeling the data - keep that in mind too, that it is a model, not data. It's so nice and smooth though, so that's why people love them...

OK, back to resampling. Here is a more concrete example. It's pretty easy to do this with `numpy` using the random choice function - this function basically collects random samples from a distribution, and there are MANY options for how to do this. This one is about the simplest way to do it:

In [None]:
n=1000 # how many samples to make when doing the resampling

channel_sand_resamp = np.random.choice(channel.log10thickness_m[channel.lithology=='sand'], n)
channel_mud_resamp = np.random.choice(channel.log10thickness_m[channel.lithology=='mud'], n)

print('number of original sand thickness samples:',len(channel.log10thickness_m[channel.lithology=='sand']))
print('length of resampled sand thickness:',len(channel_sand_resamp)) # or you could use len(n)...

We chose to resample the data using 1000 samples - you could resample to 10 or 1 million, but it's usually not wise to resample your data beyond its original length. 

Let's make a plot to see how closely the original data matches the resampled data. If you want, change the `n=1000` to a different value and rerun these cells to see how it changes the accuracy of the resampling

In [None]:
sns.kdeplot(channel.log10thickness_m[channel.lithology=='mud'],label='all mud data') # data
sns.kdeplot(channel.log10thickness_m[channel.lithology=='sand'],label='all sand data') #data
sns.kdeplot(channel_mud_resamp, label='resamp mud data'); # resampled
sns.kdeplot(channel_sand_resamp, label='resamp sand data'); #resampled

You can see the original data, and the resampled data - they look very similar, right? That is good, it means the resampling is doing a good job of representing the data. 

IMPORTANTLY: This type of resampling is the way you can plot and compare two datasets with each other that have different lengths. So, the way to fix the error we got above is to reample the data, and then make a new linear regression:

In [None]:
# let's do a regression
sns.regplot(x=channel_mud_resamp,y=channel_sand_resamp);

No error! But, that looks pretty gross. The reason this doesn't really work is that you are plotting x-y pairs that have been randomly resampled from a distribution - so, this is NOT a good plot to make. 

Instead, we need to use a comparative statistic that compares the *distributions*, not individual x-y pairs

## Comparing distributions

First, let's start with a parametric test:

In [None]:
print('log10 data =',stats.ttest_ind(df.log10thickness_m[df.lithology=='mud'],df.log10thickness_m[df.lithology=='sand']))

print('non-logged data =',stats.ttest_ind(df.thickness_m[df.lithology=='mud'],df.thickness_m[df.lithology=='sand']),'\n') # add a blank line with \n

print('🙁🙁 - T test results are way different if you use log10-converted data')

So let's try a non-parametric test instead and see what happens:

In [None]:
print('log10 data =',stats.ks_2samp(df.log10thickness_m[df.lithology=='mud'],df.log10thickness_m[df.lithology=='sand']))

print('non-logged data =',stats.ks_2samp(df.thickness_m[df.lithology=='mud'],df.thickness_m[df.lithology=='sand']),'\n') # add a blank line with \n

print('😀😀 - Same p-value for KS Test - gotta love non-parametric methods')

### Now you try
Take the resampled data from above, and compare it to the original data using a t-test and a KS-test. Also, go to the [scipy-stats docs](https://docs.scipy.org/doc/scipy/reference/stats.html) and choose one other non-parametric comparative statistic, and implement it as well. So, three tests total!

Also - try changing the `n value` for resampling above, and re-running these tests to see how the `n` affects the results. You don't need to show that in your answer, but try it out just to see! 

In [None]:
# your code here

In [None]:
# Bonus - it's also handy to loop these tests for multiple distributions:

for groups, values in df.groupby('environment'):
  mud_vals=values["log10thickness_m"][values["lithology"]=='mud']
  sand_vals=values["log10thickness_m"][values["lithology"]=='sand']
  print(groups,'sand vs mud KS test',stats.ks_2samp(mud_vals,sand_vals)) # KS Test
  print(groups,'sand vs mud Kruskal-Wallis test',stats.kruskal(mud_vals,sand_vals)) # Kruskal-Wallis
  print(' ') # white space

![Glad that's over](https://media1.tenor.com/images/0fc841206cdd725df4aea0c17b99bfc0/tenor.gif?itemid=15513320)