<a href="https://colab.research.google.com/github/odu-cs432-websci/public/blob/main/REU_DataVis_Python_Seaborn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# REU Site Data Vis with Python using Seaborn Notebook

## Setup

Initial setup includes loading any needed libraries (Matplotlib, Numpy, Pandas, Seaborn) and setting the default style of charts.

In [None]:
import matplotlib.pyplot as plt  # will need some Matplotlib functions
import numpy as np 
import seaborn as sns
import pandas as pd              # will use Pandas for data manipulation
sns.set(rc={"figure.figsize":(8, 6)}) # width=8, height=6
sns.set_style("white")   # white background, no grid

For a few of the charts, we'll use a dataset about the characteristics of penguins from the Palmer Station in Antarctica.  It's one of the datasets directly accessible by the Seaborn function `load_dataset()`. 

Original penguin data: https://github.com/allisonhorst/palmerpenguins   
Seaborn datasets: https://github.com/mwaskom/seaborn-data


In [None]:
penguins = sns.load_dataset("penguins")
penguins.head()

## Scatterplot

Here's a basic scatterplot, showing the relationship between flipper length and body mass.  

`scatterplot()` - https://seaborn.pydata.org/generated/seaborn.scatterplot.html

In [None]:
sns.scatterplot(data=penguins, x="flipper_length_mm", y="body_mass_g")

As we would expect, the longer the flipper, the heavier the penguin.

Now we're going to color the dots based on the species of penguin. Since this is categorical data, we use color `hue`. 

The `scatterplot()` function returns an Axes object, so we'll name the resulting plot `ax`.  Then we can use `ax` to modifiy the axis labels and give the chart a title that describes what's going on. 

In [None]:
ax = sns.scatterplot(data=penguins, x="flipper_length_mm", y="body_mass_g", hue="species")
ax.set_xlabel ('Flipper Length (mm)')
ax.set_ylabel ('Penguin Body Mass (g)')
ax.set_title('The longer the flipper, the heavier the penguin');   # use a semicolon at the end, so won't print output

Here's one example of some of the differences between using the Axes-level function `scatterplot()` and the Figure-level function `relplot()`.  Note that to access the title and labels, we need to access the Axes object, using `g.ax`.

`replot()` - https://seaborn.pydata.org/generated/seaborn.relplot.html

In [None]:
g = sns.relplot(data=penguins, x="flipper_length_mm", y="body_mass_g", hue="species", kind="scatter")
g.ax.set_xlabel ('Flipper Length (mm)')
g.ax.set_ylabel ('Penguin Body Mass (g)')
g.ax.set_title('The longer the flipper, the heavier the penguin');

## Bar Chart

Given a set of data, `barplot()` will calculate the mean and confidence interval for the set of observations.  In this example, it's showing the average body mass (g) over all penguins of each species in our dataset.

`barplot()` - https://seaborn.pydata.org/generated/seaborn.barplot.html

In [None]:
sns.barplot(data=penguins, x="species", y="body_mass_g");

Seaborn also has a `countplot()` function that will create a bar chart of counts.  In this example, we're showing the number of penguins in each species.

`countplot()` - https://seaborn.pydata.org/generated/seaborn.countplot.html

In [None]:
sns.countplot(data=penguins, x="species");

By default, the bars are sorted in alphabetical order.  To order them by the counts, we can add the `order` parameter.

In [None]:
sns.countplot(x="species", data=penguins, order=penguins['species'].value_counts().index);

For charts that have long x-axis labels, you can turn the chart sideways for a horizontal bar chart. Just switch x and y axes.  This also demonstrates making all bars the same color.

In [None]:
ax = sns.countplot(y="species", data=penguins, order=penguins['species'].value_counts().index, color="steelblue")
ax.set_xlabel("Number of Penguins")
ax.set_ylabel("Species")
ax.set_title("There are over twice as many Adelie penguins as Chinstrap penguins");

Both `barplot()` and `countplot()` are Axes-level functions under the Figure-level function `catplot()`, for categorical plot.

`catplot()` - https://seaborn.pydata.org/generated/seaborn.catplot.html

## Line Chart

Line charts are typically used for timeseries, when the x-axis is a temporal.  For this, we'll use a different dataset.  The flights dataset shows the number of passengers per month between Jan 1949 and Dec 1960.

In [None]:
flights = sns.load_dataset("flights")
flights.head()

If we want to examine the most popular months for flying, we can group the data by month and sum the number of passengers.  `reset_index()` just puts the data back into a DataFrame.

In [None]:
per_month = flights.groupby('month')['passengers'].sum().reset_index(name='passengers')
per_month

We'll use the Axes-level function `lineplot()`, which has a similar setup as the other charts we've seen so far. 

`lineplot()` - https://seaborn.pydata.org/generated/seaborn.lineplot.html

In [None]:
sns.lineplot(data=per_month, x="month", y="passengers")

We can also use the entire dataset and `lineplot()` will calculate the mean number of passengers in each month and 95% confidence interval.

In [None]:
sns.lineplot(data=flights, x="month", y="passengers");

If we want to split this out and show a line for each year, we can use the `hue` option to specify that.  Note that to get the full set of items in the legend to display, we have to use `legend="full"`. The second line shows how to move the legend outside of the chart area.

In [None]:
ax = sns.lineplot(data=flights, x="month", y="passengers", hue="year", legend="full")
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left');
ax.set_xlabel ("")
ax.set_ylabel ("Passengers")
ax.set_title ("Summer months see an increase in passengers over all years");

Note that the Figure-level `relplot()` puts the legend outside the plot by default.  Example is also demonstrating the use of the `palette` option to choose the colormap.

In [None]:
g = sns.relplot(data=flights, x="month", y="passengers", hue="year", legend="full", kind="line", palette="tab20")
g.ax.set_xlabel ("")
g.ax.set_ylabel ("Passengers")
g.ax.set_title ("Summer months see an increase in passengers over all years");

## Stacked Bar Chart

Seaborn doesn't directly support stacked bar charts. You can find pages online that talk about creating charts for the bars separately and plotting them on top of each other, but that's a lot of trouble.  Pandas `plot()` though does easily support stacked bars.

We want to create a stacked bar chart that shows the number of male and female penguins in each species. First, we'll group our penguins by species and sex and count the rows in each set.

In [None]:
num_species_sex = penguins.groupby(['species', 'sex']).size().reset_index(name='count')
num_species_sex

For the stacked bar chart, we need the data in a different format, so we can use `pivot_table()` to transform the data.

In [None]:
penguin_pivot = pd.pivot_table(data=num_species_sex, index=['species'], columns=['sex'], values='count')
penguin_pivot

With our data in the proper format, we can create the stacked bar chart using `plot.bar()` and the `stacked=True` option.

`plot.bar()` - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html

In [None]:
penguin_pivot.plot.bar(stacked=True);

Use `plot.barh()` to create a horizontal stacked bar chart.

`plot.barh()` - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.barh.html

In [None]:
penguin_pivot.plot.barh(stacked=True);

## Grouped Bar Chart

For a grouped bar chart, we just need to add the `hue` option to a bar chart (either `countplot` or `barplot`).  In this example, we'll look at which island the penguins were found on.

In [None]:
sns.countplot(data=penguins, x="species", hue="island")

## Pie Chart

Pandas Plot has a `pie()` function to generate pie charts. 

`plot.pie()` - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.pie.html  
*Note: The example pie charts on that page make no sense.*

In [None]:
data = penguins.groupby("species").size()
data

The data can be set up as a Series rather than a DataFrame, so we don't need to transform the result.

In [None]:
data.plot.pie(autopct="%.1f%%");

## Heatmap

For a heatmap, we need the data formatted so that we have 2D data (2 keys).  We can do this using `pivot()` to create a pivot table.  We specify the row attribute (key 1), column attribute (key 2), and values.

In [None]:
flights.head()

In [None]:
pivot = flights.pivot("month", "year", "passengers")
pivot.head()

Once we have the data in the proper format, we can use the `heatmap()` function to generate the heatmap.

`heatmap()` - https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
sns.heatmap(pivot);

We can use the `cmap` option to choose a Matplotlib colormap. See "Choosing Colormaps in Matplotlib", https://matplotlib.org/stable/tutorials/colors/colormaps.html

In [None]:
sns.heatmap(pivot, cmap="plasma");

## Scatterplot Matrix

We can use the `pairplot()` function to plot a scatterplot matrix.  It plots "pairwise relationships in a dataset". Instead of plotting the data in the diagonals, it plots a histogram of that attribute. 

`pairplot()` - https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
sns.pairplot(penguins);

By default, `pairplot()` plots all quantitative attributes, but you can specify particular attributes to include using the `vars` option.

In [None]:
g = sns.pairplot(penguins, vars=['bill_length_mm', 'flipper_length_mm', 'body_mass_g']);

## Histogram

A histogram uses the data from just a single attribute.  We use the `histplot()` function and pass in a specific column of the data.

`histplot()` - https://seaborn.pydata.org/generated/seaborn.histplot.html

In [None]:
sns.histplot(penguins['body_mass_g']);

We can use the `bins` option to change the number of bins in the histogram.

In [None]:
sns.histplot(penguins['body_mass_g'], bins=100);

## Boxplot

We can use `boxplot()` to create a boxplot for an attribute.  If you only supply the dataset, the function will create a boxplot for each of the quantitative attributes.  If you supply the `y` option, it will create a single boxplot for that attribute only.  If you add in `x` with a categorical attribute, it will split the data accordingly and produce a boxplot for each category.

`boxplot()` - https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
sns.boxplot(data=penguins, y="body_mass_g");

In [None]:
sns.boxplot(data=penguins, x="species", y="body_mass_g"); 