# Introduction to Data Visualization in Python
## [dataservices.library.jhu.edu](https://dataservices.library.jhu.edu/)
### Pete Lawson and Srinithi Srinivasan, JHU Data Services
### Date: November 16, 2021
This course will introduce data visualization in Python with the libraries `matplotlib` and `seaborn`.

----
## Table of Contents

#### Introduction
[Software and materials](#Software-and-materials)   
[Prerequisites](#Prerequisites:)   
[Learning objectives](#Learning-Objectives)   

#### Section 1: Matplotlib
[Introduction to Matplotlib](#Introduction-to-Matplotlib)<br>
[Exercise 1: Plotting a Line](#Exercise-1:-Plot-a-Lineplot-with-Matplotlib)<br>
[Palmer Penguns Dataset](#Meet-the-Penguins!)<br>
[Scatter Plots in Pyplot](#Let's-make-a-scatter-plot-comparing-bill-length-to-bill-depth)<br>
[Exercise 2: Plot a Scatterplot Comparing Bill Length to Depth](#Exercise-2:-Plot-a-Scatterplot-Comparing-Bill-Length-to-Bill-Depth)<br>
[Matplotlib Object Oriented Interface](#Matplotlib-Object-Oriented-Interface)

#### Section 2: Seaborn
[Introduction to Seaborn](#Introduction-to-Seaborn)<br>
[Replicate our Scatterplot in Seaborn](#Seaborn-Scatterplot)<br>
[Seaborn Regression Plot](#Seaborn-Regression-Plot)<br>
[Seaborn Joint Plot](#Seaborn-Joint-Plot)

## Software and materials   

- Jupyter Notebook ([Anaconda distribution](https://www.anaconda.com/products/individual) recommended)   
    - Please install the following libraries:
        - `seaborn` 
        - `matplotlib`
        - `palmerpenguins`
- [Zip folder](https://github.com/jhu-data-services/intro-to-data-visualization-in-python/archive/refs/heads/main.zip) containing:
    - data-visualization-in-python.ipynb

## Prerequisites:

- Knowledge of basic programming concepts
    - Data types
    - Variable assignment
    - Function calls
- Introductory experience in Python or R (like Data Services' Intro to Python or Intro to R workshops)

## Learning Objectives
<div class="alert alert-block alert-warning">
    <b>Learning Obectives:</b>
    <br>
    <ol type="1">
        <li>Understand different plotting libraries in Python, and which is appropriate when.</li>
        <li>Understand difference between pyplot imperative syntax (state-based interface) and Object-Oriented syntax.</li>
        <li>Be able to perform a simple visualization using provided data in both Matplotlib and Seaborn.</li>
        <li>Be able to apply some recommended practives for data visualization to plots in Matplotlib and Seaborn.</li>
    </ol>
</div>

## Introduction to Matplotlib
`matplotlib` is a Python library for producing publication-quality figures.

There are two interfaces to `matplotlib`;
1. ``matplotlib.pyplot`` and 
2. An *Object Oriented* interface.

The `pyplot` interface provides a simpler (but less flexible) means of interfacing with `matplotlib`. 

We will explore the `pyplot` interface first, identify some of its limitations, and then explore the addtional flexibility and functionality provided by the *Object Oriented* interface.

### Importing `matplotlib.pyplot`

In [1]:
#This allows us to display our plots inline in our Jupyter Notebook
%matplotlib inline

import matplotlib.pyplot as plt


`matplotlib.pyplot` is a collection of functions, known as a submodule, that make `matplotlib` work like the `MATLAB` `plot` function. We import `matplotlib.pyplot` to the alias `plt` as a matter of common convention.

Each pyplot function makes some change to a figure, for example: 
- creates a figure
- creates a plotting area in a figure 
- plots some lines or points in a plotting area
- decorates the plot labels

### Lets make a simple line-plot:

<div class="alert alert-block alert-info"><b>Tip:</b> The syntax below for our <code>y</code> list might look a little strange. Instead of adding values to a list, we put in an expression that populates the list. This technique is called <b>list comprehension</b> and it makes it much easier to populate a list with a function. <br><br>The simplist list expression looks like <code>[<i>expression</i> for <i>value</i> in <i>collection</i>]</code>. <br><br>The <i>expression</i> is a function, in our case <code>val ** 2</code>, where each <code>val</code> is squared. The <i>expression</i> generates the elements in the list by evaluating the expression for every <i>value</i> in the <i>collection</i>, in this case the list <code>x</code>. So in our example the new list is created by squaring every element in the list <code>x</code>.</div>

In [2]:
x_values = [1, 2, 3, 4, 5]
y_values = [val ** 2 for val in x_values]

Note that each x-value is squared, so $1^{2}=1$, $2^{2}=4$, $3^{2}=9$ and so on.

The syntax for the `pyplot` plot function is:

`plt.plot(x, y)` where `x` and `y` are your x and y-values respectively. 

We imported `import matplotlib.pyplot as plt` as a matter of convention. Since we are calling the `plot` function, in the library of `pyplot` functions, we need to prepend the library name (imported as `plt`) to the function with a dot, so we call `plt.plot()`, not `plot()`. 


### Exercise 1: Plot a Lineplot with Matplotlib

<div class="alert alert-block alert-success"><b>Exercise 1:</b> Plot your <code>x_values</code> and <code>y_values</code> using the <code>plt.pyplot(x,y)</code> function.</div>

In [None]:
# Exercise 1 Code

## Meet the Penguins!
We will be using the Palmer Penguins datatset for today's visualization exercises. Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/). 

We will use the library `pandas` to load the data into data dataframe. First, load `pandas`:

In [4]:
import pandas as pd

Then read in the `penguins.csv` from the `/Data` folder:

In [5]:
penguins = pd.read_csv("../Data/palmerpenguins.csv")

The dataset `penguins` contains data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

<img src="https://camo.githubusercontent.com/1d187452ac3929cfde8f5760b79f37cc117c1a332227d37a8c50db50d3db632a/68747470733a2f2f616c6c69736f6e686f7273742e6769746875622e696f2f70616c6d657270656e6775696e732f7265666572656e63652f666967757265732f6c7465725f70656e6775696e732e706e67" alt="Palmer Penguins" width="500" />

Lets explore the contents of the dataset:

In [2]:
# Dataset exploration

### Now let's look at some penguin characteristics:

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" alt="Bill diagram" width="500">

### Let's make a scatter plot comparing bill length to bill depth
Let's begin by selecting two variables for visualizing, in this case **bill length** and **bill depth**.

We can retrieve the **bill length** by selecting that column of the `penguins` dataframe:
`penguins['bill_length_mm']`. 

Let's take a look at the contents of the `penguins['bill_length_mm']` column:

In [3]:
# Bill length

Let's assign that column of data to a variable `bill_length`

In [4]:
# Assign bill length

We can do the same for **bill depth**:

In [5]:
# Assign bill depth

### Exercise 2: Plot a Scatterplot Comparing Bill Length to Bill Depth

<div class="alert alert-block alert-success"><b>Exercise 2:</b> Plot the <b>bill length</b> on the x-axis and <b>bill depth</b> on the y-axis using the <code>plt.scatter(x,y)</code> function.</div>

In [6]:
# Exercise 2 Result

Now let's add some axes labels:

In [7]:
# Add axis labels

Now lets make this a more publication friendly figure by removing the top and right border (or *spines* in matplotlib language). 

In [8]:
# Remove spines

### Exercise 4: Remove the Top and Right Spine from the Scatterplot

<div class="alert alert-block alert-success"><b>Exercise 4:</b> Using the syntax <code>plt.gca().spines['location'].set_visible(False)</code> remove the <b>top</b> and <b>right</b> spines.</div>

In [11]:
# Exercise 4 Result

### Exercise 5: Practice Creating a Scatterplot

<div class="alert alert-block alert-success"><b>Exercise 5:</b> Create a scatter plot that compares <b>body mass</b> on the x-axis to <b>bill length</b> on the y-axis. </div>

In [13]:
# Exercise 5 Result

When we called the `plt.plot` function, and later the `plt.scatter` function, a couple of things were happening in the background: 

- matplotlib created a **figure** instance; the figure is a container which holds everything you see. 
- matplotlib created an **axes** instance; the axes are the part of the figure that holds the data. This is the "canvas" on which we will paint with our data.

These two pieces make up our matplotlib figures:

<img src="https://matplotlib.org/stable/_images/anatomy.png" alt="matplotlib fig anatomy" width="600">

Image Source: https://matplotlib.org/stable/_images/anatomy.png

The **figure** object (`fig`) is a container which holds everything you see. This is the top level container for all other elements that make up the graphic.

The blue elements are called **artists**. The Figure object can be thought of as a canvas, upon which different artists act to create the final graphic image. 

Each plot we see in a figure is an **axes** object. We can have one **figure** (container) but multiple **axes** (plots).

In [15]:
# Show Figure

Notice that when we run this plot, nothing is actually produced. This is because we haven't put anything into our container yet! 

The **axes** (`ax`) is the part of the figure that holds the data. This is the "canvas" on which we will paint with our data.

In [16]:
# Show Axes

Now our plot shows up! We don't have any data yet, so our canvas is blank right now. We used `plt.subplots` which will allow us to return multiple axes for a single figure. By default only one axes is displayed, although we can specify more:

In [17]:
# Multiple Axes

### Matplotlib Object Oriented Interface

Now, let's re-build our plot with the Object Oriented API.

In [18]:
# Rebuild our scatterplot using the Object Oriented API

This just seems like a more confusing way of creating the same plot. Why would we want to do this?

We can actually plot multiple axes. We are already comparing **Body Mass** on the x-axis and **Bill Length** on the y-axis.
Lets add another y-axis with **Bill Depth** as well!

In [19]:
# Plot with multiple axes

 That is a little difficult to read, let's add some transparency, otherwise known as `alpha`.

In [20]:
# Add transparency

## Introduction to Seaborn

`Seaborn` is a Python data visualization library that extends the `Matplotlib` library and provides a high-level interface for plotting statistical graphics with attractive default settings.

<img src="https://camo.githubusercontent.com/03fc087cdb5874584b09518e5acbd74abea968ff677c0359247ba853f2b57a91/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f313033302f312a3356674377635a726141307535684d487052684a63772e706e67" alt="Seaborn example plots">

### Seaborn Plot Types
![Seaborn Plot Types](https://seaborn.pydata.org/_images/function_overview_8_0.png)

### Seaborn Scatterplot

Recreate a scatter plot that compares `body mass` on the x-axis to `bill length` on the y-axis using `seaborn`.

We will want to:
* Add axis labels
* Scale the font size to make it more readable
* Remove the extraneous spines

In [21]:
# Recreate a scatterplot using seaborn

Add axis labels:

In [22]:
# Add axis labels

Scale the font size

In [23]:
# Scale fontsize

Remove the spines

In [24]:
# Remove spines

### Color by species
Now that we have a similar plot to that from our `Matplotlib` section, lets also add color, by species:

In [25]:
# Color by Species

### Seaborn Regression Plot

In [26]:
# Plot regression plot

### Seaborn Joint Plot

In [27]:
 # Plot joint plot