# Module 4: Data viz

## Overview: `matplotlib` and `seaborn`
Matplotlib is the workhorse for python plotting, and seaborn makes things look pretty. Since you will spend lots of time plotting things to explore datasets and show off your results, these two packages are very, very useful to be familar with. 

For questions on this notebook, ask them on the [GEOL 557 slack](https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA)<a href="https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA">
<img src="https://cdn.brandfolder.io/5H442O3W/as/pl546j-7le8zk-ex8w65/Slack_RGB.svg" alt="Go to the GEOl 557 slack" width="100">
</a>

## Instructions
Work through this notebook - there will be several places where you need to fill-in-the-blank or write some code into an open cell. When you are finished, make sure to use the Colab menu (not the browser-level menu) to do the following:
- Expand all the sections - in the Colab menu, choose View --> Expand sections) 
- Save the notebook as a pdf, again using the Colab menu, using File --> Print --> Save as PDF. 

--- 
## Course
**GEOL 557 Earth Resource Data Science I: Fundamentals**. GEOL 557 forms part 2 of the four-part course series for the "Earth Resource Data Science" online graduate certificate at Mines - [learn more about the certificate here](https://online.mines.edu/er/)

Notebook created by **Zane Jobe** and **Thomas Martin**, [CoRE research group](https://core.mines.edu), Colorado School of Mines

[![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/ZaneJobe.svg?style=social&label=Follow%20%40ZaneJobe)](https://twitter.com/ZaneJobe)
and [![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/ThomasM_geo.svg?style=social&label=Follow%20%40ThomasM_geo)](https://twitter.com/ThomasM_geo) on Twitter 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# these next two things shuoldnt need to be changed if you set up your Google Drive folder correctly (see Module 1)
folder_path = 'gdrive/My Drive/GEOL557_F22/data/' # makes a path
file_name = 'Sharman_ExampleDataset_1.xlsx' # file name

df=pd.read_excel(folder_path + file_name, sheet_name='ZrUPb')
df.info()

In [None]:
df.plot(x='75Age', y='68Age')

hmm, that doesnt look awesome, as all the points are connected with a line by default. Since the plotting functions in pandas as built using matplotlib anyways, let's just use matplotlib to do the plotting - that way, we will have more control over how things look. 

In [None]:
plt.plot(df['75Age'], df['68Age'])

See how that gives us exactly the same thing? That's because pandas just calls the matplotlib `plot` function. We can make it look way better, but first let's learn about:

## Anatomy of a figure
### Read this:
This article is a great place to [learn about the components of a matplotlib figure\.](https://matplotlib.org/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py)
Here is the key diagram:
![Anatomy of a figure](https://matplotlib.org/_images/anatomy.png)



In [None]:
# now try it with red circles
plt.plot(df['75Age'], df['68Age'], 'ro')

See that text above the plot `[<matplotlib.lines.Line2D at random_numbers>]`? We can make that go away by either adding a semicolon to the end of the plotting statement, or having a line that says `plt.show()`

In [None]:
# or blue stars
plt.plot(df['75Age'], df['68Age'], 'b*')
plt.show()

In [None]:
# It's generally good to set things up this way

# Initialize a new figure
fig, ax = plt.subplots()

fig.suptitle('Hello, I am a figure title')

# Make the plots, one for points, and another for the line
ax.plot(df['75Age'], df['68Age'], 'ro')
ax.plot([0, 3500], [0, 3500], '--b') # 1:1 line

# Set the label for the x-axis
ax.set_xlabel("75 Age")

# Set the label for the y-axis
ax.set_ylabel("68 Age")

# this cleans stuff up - comment this out to see what happens
plt.show()

### Now you try!
Create a x-y plot with two subplots, using any data in the DataFrame. See some examples of subplots in the matplotlib [docs](https://matplotlib.org/3.1.0/gallery/subplots_axes_and_figures/subplots_demo.html
)


In [None]:
# your code here

Now let's look at `scatter`. [plt.scatter](https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.scatter.html) is handy for scatterplots where you want to control the color or size of each dot (e.g., so-called 'bubble plots'). 

In [None]:
fig, ax = plt.subplots(1,2)

ax[0].scatter(df['75Age'], df['68Age'],
              facecolors='none', 
              edgecolors='r')

ax[1].scatter(df['75Age'], df['68Age'],
              c=df['BestAge'], 
              s=df['BestAge']/100, 
              facecolor=None)

plt.tight_layout()

and `hist`, which makes pretty nice histograms - see examples [here](https://matplotlib.org/3.3.1/tutorials/introductory/pyplot.html#working-with-text) 

In [None]:
plt.hist(df.BestAge)
plt.show()

In [None]:
# Let's mess with the bin spacing and subset the data
plt.hist(df.BestAge[df.BestAge<500],bins=50); # only plottiung ages less than 500 Ma

## Let's get some new data
This is automated mineralogy data from Katha Pfaff's amazing [TIMA lab](https://geology.mines.edu/laboratories/automated-mineralogy-laboratory/). Katha does great work, consider using her lab! This particular data comes from Clark Gilbert's PhD thesis, and we will use it to make a stacked bar chart.

In [None]:
# these next two things shuoldnt need to be changed if you set up your Google Drive folder correctly (see Module 1)
folder_path = 'gdrive/My Drive/GEOL557_F22/data/' # makes a path
file_name = 'min_data_mod.csv' # file name

ventura = pd.read_csv(folder_path + file_name)
ventura.head(10)

In [None]:
# Bar charts

bar_names = ventura.name                  # pull out the names to use as labels
pos = np.arange(len(bar_names))           # make a vector for the labeling
plt.bar(pos,ventura.Quartz)               # plot the bar
plt.xticks(pos, bar_names, rotation=90)   # re-label the x-axis
plt.show()                                # clean up

In [None]:
"""
Usually we would do a stacked bar like this - https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/bar_stacked.html 
But sometimes pandas is the easiest way! 
Doing it in pandas avoids having to loop through the DataFrame and plotting each bar on top of the other bar.
As with all python, there are always several ways to do something, but sometimes there is an easy way and a hard way
""" 

ventura.drop(columns='name_ind').plot.bar(stacked=True, legend=False) # one-liner!!
plt.xticks(pos, bar_names, rotation=90);

In [None]:
# let's make a new variable that subsets ventura
QPO = ventura[['name','Quartz','Plagioclase','Orthoclase']]
QPO_thin_sections = QPO.iloc[[2, 4, 6, 9, 11, 13, 15]]
QPO_thin_sections

In [None]:
QPO_thin_sections.plot.bar(stacked=True)
plt.title('These don\'t add up to 100')

bar_names = QPO_thin_sections.name
pos = np.arange(len(bar_names))
plt.xticks(pos, bar_names, rotation=90)
plt.ylim([0, 100]);

## Assignment
1. Create a new dataset that uses thin sections only, and uses the following minerals:
  1. Quartz
  1. Plagioclase
  1. Orthoclase
  1. and sums all the other minerals into an `Other` column
  1. Then, normalize the dataframe 
    1. ([hint 1](https://stackoverflow.com/questions/35678874/normalize-rows-of-pandas-data-frame-by-their-sums/35679163))
    1. hint 2 - you can't use sum or div on a string, so you need to only do it on the float values (not the names...) 
1. Make a figure with two subplots
  1. a stacked bar for the `thin_sections` dataset, but normalized to 100. 
  1. A pie chart for the `Vasquez_Alternate_Thin_Section`

In [None]:
# your code here
thin_sections = ventura.iloc[[2, 4, 6, 9, 11, 13, 15]] # this should get you started
thin_sections.head()

## Dealing with logged axes
This is an example using fake data, but we will get into other examples using real data in the next cell below.

In [None]:
# Original code from https://matplotlib.org/tutorials/introductory/pyplot.html#logarithmic-and-other-nonlinear-axes

# Fixing random state for reproducibility
import numpy as np
np.random.seed(0)

# make up some data in the open interval (0, 1)
y = np.random.normal(loc=0.5, scale=0.4, size=1000) # this is normally-distributed data
y = y[(y > 0) & (y < 1)]
y.sort()
x = np.arange(len(y)) # just integers increasing from 0 to the length of y

# plot with various axes scales
plt.figure(figsize=[12,8])

# linear
plt.subplot(221)
plt.plot(x, y)
plt.yscale('linear')
plt.title('linear y axis')
plt.grid(True)

# log
plt.subplot(222)
plt.plot(x, y)
plt.yscale('log') # this is the key line
plt.title('logged y axis')
plt.grid(True)

In [None]:
fig, ax = plt.subplots(1,2, figsize=[12,6])

fig.suptitle('Which one looks like the better option for showing the data?',fontsize=14)

#plot linear data
ax[0].hist(np.sort(df['BestAge']), bins=20)
ax[0].set_title('Linear x axis')
ax[0].set_xlabel('U-Pb zircon Ages (Ma)')

# now plot log10 data
ax[1].hist(np.sort(np.log10(df['BestAge'])), bins=20);
ax[1].set_title('Log10 x axis');
ax[1].set_xlabel('U-Pb zircon Ages (Ma)')

# the only tricky thing is to make sure and get the x axis right
ax[1].set_xlim(1, 4)
ax[1].set_xticks(np.arange(1,5))
ax[1].set_xticklabels(10**p for p in range(1, 5)) # this is the tricky one
plt.tight_layout(3)
plt.show()

## Assignment

Go to the [xkcd plotting library page](https://matplotlib.org/3.1.1/gallery/showcase/xkcd.html#sphx-glr-gallery-showcase-xkcd-py) and learn how to make an xkcd-style figure. Then, be imaginative, and make your own!


In [None]:
# xkcd plot goes here
with plt.xkcd():

  fig = plt.figure()
  # your code goes here
  # your code goes here
  # your code goes here
  # your code goes here
  # your code goes here
  # your code goes here
  plt.show()

That is only scratching the surface of matplotlib, but you can see the power of plotting automation... Now for a random GIF
![Ricky Ross](https://media1.tenor.com/images/93427cc9e205aee20c6e03cfa82ea00e/tenor.gif?itemid=9743366)

# Seaborn
Seaborn is amazing. 
The [seaborn tutorial](https://seaborn.pydata.org/tutorial.html) is very good, and I encourage you to go through it to learn about the power of seaborn. 

When seaborn is combined with pandas dataframes, you can do some pretty amazing things with very little code, as you will see below. 


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive
drive.mount('/content/gdrive')

### Turbidite bed thickness dataset
The dataset we will load in is from Rosie Fryer's MS thesis at Colorado School of Mines entitled *Quantification of the bed‐scale architecture of submarine depositional environments* - Rosie measured almost 30,000 individual measurements of bed thickness and thinning rate for turbidite depositional environments - if you want more detail on this study, [here is the open-access article](https://doi.org/10.1002/dep2.70). 

In [None]:
# these next two things shuoldnt need to be changed if you set up your Google Drive folder correctly (see Module 1)
folder_path = 'gdrive/My Drive/GEOL557_F22/data/' # makes a path
file_name = 'Fryer_and_Jobe_2019_turbidite_beds_partial.csv' # file name

df=pd.read_csv(folder_path + file_name) # uses pandas to read in the csv as a 'DataFrame' called df

df.info() # that's a bunch of data!

Let's use seaborn to plot some bed thickness distributions, using both a histogram and a kernel-density estimate:

In [None]:
# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(7, 7))

# Now use seaborn to plot things
# first, linear thickness
sns.distplot(df.thickness_m, kde=False, color="b", ax=axes[0, 0])
sns.distplot(df.thickness_m, kde=True, hist=False, color="b", ax=axes[0, 1])
# now the logged data
sns.distplot(df.log10thickness_m, kde=False, color="b", ax=axes[1, 0])
sns.distplot(df.log10thickness_m, kde=True, hist=False, color="b", ax=axes[1, 1])

# clean it up
axes[0,0].set_title('Histogram')
axes[0,1].set_title('KDE')
axes[1,0].set_title('Histogram (log10)')
axes[1,1].set_title('KDE (log10)')
plt.setp(axes, yticks=[])
plt.tight_layout()
plt.show()

In [None]:
# I like combining the KDE with a cumulative density function (CDF)
fig, ax = plt.subplots(figsize=[10,6])

sns.kdeplot(df.log10thickness_m[df.lithology=='mud'].values, color="xkcd:grey", shade=True, label='mud')
sns.kdeplot(df.log10thickness_m[df.lithology=='mud'].values, cumulative=True, color="xkcd:grey", shade=False, linewidth=5, label='mud')

sns.kdeplot(df.log10thickness_m[df.lithology=='sand'].values, color="xkcd:yellow", shade=True, label='sand')
sns.kdeplot(df.log10thickness_m[df.lithology=='sand'].values, cumulative=True, color="xkcd:yellow", shade=False, linewidth=5, label='sand')

ax.set_xlabel('Bed Thickness (m)',fontsize=14)
ax.set_xlim(-3, 2)
ax.set_xticklabels([10**p for p in range(-3, 3)],fontsize=12,rotation=45)
ax.set_ylabel('Frequency',fontsize=12)
ax.set_title('Bed Thickness Distribution',fontsize=18)
plt.show()

In [None]:
# if you are unsure, and want to test to make sure the axis tick labels are correct, you can add text and a point to test it
fig, ax = plt.subplots()
sns.kdeplot(df.log10thickness_m[df.lithology=='sand'].values, ax=ax, cumulative=False, color="xkcd:yellow", shade=True)
ax.set_xlim(-3, 2)
ax.set_xticklabels([10**p for p in range(-3, 3)],fontsize=12,rotation=45)
ax.set_xlabel('Bed Thickness (m)',fontsize=14)

plt.plot(df.log10thickness_m[df.lithology=='sand'].min(), 0.1, 'ro')
plt.text(df.log10thickness_m[df.lithology=='sand'].min(), 0.1, str(10**df.log10thickness_m[df.lithology=='sand'].min()))

plt.plot(df.log10thickness_m[df.lithology=='sand'].max(), 0.1, 'ro')
plt.text(df.log10thickness_m[df.lithology=='sand'].max(), 0.1, str(10**df.log10thickness_m[df.lithology=='sand'].max()))

In [None]:
# Another handy thing is axvline, which makes a vertical line at a certain value using matplotlib

# first create the median, as we are going to use it a bunch
median_sand = df.log10thickness_m[df.lithology=='sand'].median()

# create the figure and plot the kde
fig, ax = plt.subplots()
sns.kdeplot(df.log10thickness_m[df.lithology=='sand'], ax=ax, cumulative=False, color="xkcd:yellow", shade=True)

# this makes a nice vertical line
ax.axvline(median_sand,ymin=0, ymax=0.8,color='k')

# text placement
ax.text(median_sand, 0.5, str('median is '+str(round(10**median_sand,3))+' m'))

# clean it up
ax.set_xlim(-3, 2)
ax.set_xticklabels([10**p for p in range(-3, 3)],fontsize=12,rotation=45)
ax.set_xlabel('Sand Bed Thickness (m)',fontsize=14);

In [None]:
# box plots are awesome, and amazingly easy to make using pandas and seaborn
sns.boxplot(x=df.environment,y=df.log10thinning_rate);

In [None]:
"""
try doing this in Excel...
or in matplotlib for that matter, the loop you would have to create would really suck;
using a combination of pandas and seaborn is super nice here
"""

fig, ax = plt.subplots(figsize=[8,6])

sns.violinplot(x=df.environment,y=df.log10thinning_rate, 
               split=True, 
               hue=df.lithology, 
               palette={'mud' : 'xkcd:grey','sand' : 'xkcd:yellow'}, 
               inner=None, 
               ax=ax)

ax.set_title('And boom goes the dynamite (google it if you don\'t recognize the meme)')
ax.set_xlabel('Environment',fontsize=14)
plt.show()

### 2D histograms
Seaborn is great for visualizing the density of x-y data, and can plot contours or colors to show you where data sits in a parameter-space:

In [None]:
# 2D contour plot (a bit computationally intensive, so takes about a minute to run)
fig, ax = plt.subplots()
ax.scatter(df.log10thickness_m, df.log10thinning_rate, s=0.5, color='xkcd:light grey')
sns.kdeplot(df.log10thickness_m, df.log10thinning_rate, ax=ax);
ax.text(-4,-7,str('n= '+str(len(df))+' measurements'));

In [None]:
#hex plot, much faster, and has 'jointplots' as histograms on the axes too - nice!

sns.jointplot(x="log10thickness_m", y="log10thinning_rate", data=df, kind="hex");

In [None]:
# Another format for a contour plot (a bit computationally intensive, so takes about a minute to run)

g = sns.JointGrid(x="log10thickness_m", y="log10thinning_rate", data=df)
g = g.plot_joint(sns.kdeplot, cmap="Purples_d")
g = g.plot_marginals(sns.kdeplot, color="m", shade=True)
g.ax_joint.set_xlim([-3,1])
g.ax_joint.set_ylim([-7,0])
g.ax_joint.text(-2,-6,str('n= '+str(len(df))+' measurements'));

### Now you try!
Read the [jointplot documentation,](https://seaborn.pydata.org/generated/seaborn.jointplot.html) and create your own 2D plot that shows a jointplot of sand (colored yellow) and mud (colored grey). Label the axes appropriately if they need to be in log10 scale

In [None]:
# your code here

df.lithology.unique() # a hint


### Pairplots
[Pair plots are amazing](https://medium.com/@jaimejcheng/data-exploration-and-visualization-with-seaborn-pair-plots-40e6d3450f6d), and a quick way to visualize if there are correlations between different variables. They are used extensively in data exploration, and seaborn makes pretty nice versions of the pair plot. 

In [None]:
keep_cols = ['log10distance_m','log10thickness_m','log10thinning_rate','environment','lithology']
df_dropped = df.drop(df.columns.difference(keep_cols), 1)

sns.pairplot(df_dropped, hue='environment')
# this takes a minute to run, be patient

In [None]:
sns.pairplot(df_dropped, 
             hue='lithology', 
             palette = {'mud' : 'xkcd:grey',
                        'sand' : 'xkcd:dark yellow'},
             plot_kws=dict(s=2, edgecolor=None, linewidth=2, alpha=0.1)
)
plt.show()
# this takes a minute to run, be patient

## Simple regressions
Just a quick example to show you that seaborn can plot simple linear regressions too - we will do much more of this next week using scipy and seaborn, just just a teaser here...

In [None]:
ax=sns.lmplot(x='log10distance_m', 
              y='log10thinning_rate', 
              data=df, 
              hue='lithology', 
              markers=[".", "."], 
              palette={'mud' : 'xkcd:grey',
                       'sand' : 'xkcd:dark yellow'
                       }
              )

![Boom](https://media1.tenor.com/images/0185c37f7d3fb9a3fe9b2253bbe6b853/tenor.gif?itemid=8478631)