Copyright 2020 Andrew M. Olney, Natasha A. Sahr and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Plotting

Data visualization is the discipline of trying to understand data by using graphic context so patterns, trends, and correlations that might not otherwise be detected can be exposed.

Data visualization is an important tool to understand data.

Charts, plots, graphs, and maps (and many more) are all types of data visualizations. 

There are many facets involved in data visualization; this tutorial is just the introduction in your Python plotting journey. 

Today we will focus on the most often used plots:

- Scatter plots
- Bar plots
- Line plots
- Histograms

**Each type of plot requires a specific type of data and has a specific purpose.**

<!-- By the end of this introduction, you will have mastered:

- plot basics and how to read a graph
- data transformation
- a normal vs. not normal distribution
- misleading graphs -->

## Plotly

In Python, there are many options for visualizing data and is often challenging to choose which library to use. 

<!-- The most common libraries used for plotting include:

- <a href="https://altair-viz.github.io" target="_top">`altair`</a>
- <a href="https://docs.bokeh.org/en/latest/" target="_top">`bokeh`</a>
- <a href="http://ggplot.yhathq.com" target="_top">`ggplot`</a>
- <a href="https://matplotlib.org" target="_top">`matplotlib`</a>
- <a href="https://pandas.pydata.org" target="_top">`pandas`</a>
- <a href="https://plotly.com" target="_top">`plotly`</a>
- <a href="http://www.pygal.org/en/stable/" target="_top">`pygal`</a>
- <a href="http://seaborn.pydata.org" target="_top">`seaborn`</a>

The documentation for each library can be found in the links above.  -->

For the purpose of this tutorial, we will focus on understanding, programming, and interpreting plots from `plotly`.

`plotly` is a Python library that produces interactive plots.

That means you can use your mouse to interact with the plot after you've created it.

<!-- It has a robust API, including one for python. Versions of `plotly`  $< 4$ are online. Versions of `plotly`  $\geq 4$ are offline, and the online functionality has been moved to the `chart-studio` library.  -->

To use `plotly`, 

- `import plotly.express` as `px`

**Make sure you run the cell using the &#9658; button or Shift + Enter**

<!-- We will cover plotting and reading the following data visualizations in `plotly`:

- scatter plots 
- line graphs
- bar charts
- box plots
- plots for distributions (histograms and density plots)

We will specifically use the `plotly.express` library for quick and easy plotting. 

Keep in mind: some of the same plots can be done `plotly.graph_objs` and `plotly.figure_factory`. These libraries also allow for additional (and more complicated) plots or have additional flexibility.  -->

## Iris Data

We'll use the classic `iris` dataset to illustrate some plots.

The `iris` dataset contains 5 variables describing iris flowers:

| Variable    | Type    | Description           |
|:-------------|:---------|:-----------------------|
| SepalLength | Ratio   | the sepal length (cm) |
| SepalWidth  | Ratio   | the sepal width (cm)  |
| PetalLength | Ratio   | the petal length (cm) |
| PetalWidth  | Ratio   | the petal width (cm)  |
| Species     | Nominal | the flower species    |

<div style="text-align:center;font-size: smaller">
 <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/iris">UCI Machine Learning Repository library
    </a></div>
<br>


In order to plot, we need to load the data into a dataframe, so the first step is to import the dataframe library, `pandas`:

- `import pandas` as `pd`

Now we can read the dataset into the dataframe

- Create variable `dta_iris`
- Set it to `with pd do read_csv` using `datasets/iris.csv`
- Place a `dta_iris` block below it in order to display the dataframe

**Remember the `with ... do` block is in VARIABLES**

We can see there are 150 rows in this dataset.
Each row is a datapoint (also called an observation).

When we plot the data, we will typically use all the datapoints, but we typically only use 1-2 variables (i.e. columns).

Each plot allows us to look at properties of a variable or relationships between variables.

In a dataset with as many variables as `iris`, we would expect to do many plots if we wanted to explore all of these relationships.

## Scatter Plots

Scatter plots are one of the most basic and useful plots for looking at the relationship between two variables.

Scatterplots:

- Require each variable to be on an interval or ratio scale
- Show each datapoint

A simple scatter plot in `plotly.express` is defined by three things:

- the dataframe
- the x (or independent) variable
- the y (or dependent) variable


For example, if we want to evaluate the relationship of sepal width on sepal length, we will have:

- `dta_iris`
- `x="SepalWidth"`
- `y="SepalLength"`

These three pieces of information are called **arguments** in programming terminology.

Two important things to note:

- We need to put these arguments in a list block (from LISTS)
- For the last two of these, we need to use a FREESTYLE block as highlighted below

![image.png](attachment:image.png)

Let's continue this example with actual code.

Follow these steps:

- Get a `with px do scatter using` block
- Inside that block put a `create list with` block, and inside that block put
    - `dta_iris` (from VARIABLES)
    - a freestyle block with `x="SepalWidth"` in it
    - a freestyle block with `y="SepalLength"` in it
    
**It may take ~10 seconds for the plot to appear. You know Jupyter is working on it because `[*]` will appear next to the cell.**

Try hovering your mouse over each datapoint to see its values.
This is just one of plotly's many interactive features; others can be accessed from the popup menu at the top right of the plot.

From this plot, it looks like perhaps sepal width and sepal length increase together, because you can imagine a diagonal line going from the bottom left to the top right through the datapoints.

However, it also appears like there might be two groups of datapoints, and upper and a lower group.

Let's make some tweaks to this plot to illustrate some of what is possible in plots.

Copy the block for the plot above using the following steps:

- Click on the code cell
- Click on the block that appears
- Press Ctrl-C to copy
- Click on the empty code cell below
- Press Ctrl-V to past
- Click "Blocks to Code" to save your blocks in the code cell

Once you have copied the block, add two more slots to the list and fill them with two more freestyles blocks:

- `title="Relationship between Sepal Length and Sepal Width"`
- `color="Species"`

The title that we have added is just one example of how plots can be annotated with descriptive text.
We could use custom labels for our x/y variables as well, e.g. "Sepal Width (cm)".

The color component we added is more interesting and gives us a clearer view into the data.
We can now clearly see three groups corresponding to the three species of iris:

- Versicolor and virginica are very similar in terms of their relationship between sepal width and length
- Setosa is distinctly different from these other two species

It's worth stressing that we now have three variables represented in the scatterplot: sepal width, sepal length, and species.
While sepal width and length are represented by position on the x and y axes, species is represented by color.
Color works well for species because it is a categorical variable.

This example just scratches the surface of [what is possible with scatterplots in plotly](https://plotly.com/python/line-and-scatter/#set-size-and-color-with-column-names).

## Bar plots

Bar plots are very commonly used in both science and the business world.

Bar plots:

- Require the x to be discrete values
- Require the y to be a single number per x
- Are best for showing summary values like averages

In other words, while scatterplots show all the datapoints, bar plots only show a summary value of y for each x.

Let's make a bar plot using the average, or `mean` of the variables as a summary value.

First, let's look at the mean by itself:

- `with dta_iris do mean using`

We can see the mean for each variable, but notice this output is not formated like a dataframe, because there are no column labels.

Instead, this is something `pandas` calls a **series**, which is like a single column in a dataframe.
The difference here is that the variable names, e.g. `SepalWidth`, are axis labels rather than numeric axis labels we've seen previously.

Since `dta_iris.mean()` is a series, it has column names we can use for x and y in our plot.
However, `plotly` is smart enough to plot it anyways, like this:

- `with px do bar using with dta_iris do mean using`

**If you have trouble connecting these blocks, move your mouse slowly as you make the final connection, and try letting go even if you don't hear the click.**

Notice that our axis labels and legend aren't very nice because no x or y names were given, but the plotted data is correct.

While this usage of a bar plot is interesting, we can do something even better: group datapoints by species and then calculate the mean.

First, let's group by species:

- Create variable `groups`
- Set it to `with dta_iris do groupBy using "Species"`

`groups` now contains three dataframe's worth of datapoints, one for each species.
To get the mean of each of these, do this:

- `with groups do mean()`

The output here is a dataframe showing the average values of the four numeric variables separated out by species, giving us 12 values.
In contrast, the earlier output we looked at combined all species, so we only had 4 values.

We can now make a more interesting bar chart:

- Get a `with px do bar using` block
- Inside that block put a `create list with` block, and inside that block put
    - `with groups do mean using`
    - a freestyle block with `y="PetalWidth"` in it

**Notice we omitted x because we want to use the axis labels as x.**

The plot very nicely shows the increasing in petal width across species.

## Line plots

Line plots are virtually identical to bar plots in usage because they:

- Require the x to be discrete values
- Require the y to be a single number per x
- Are best for showing summary values like averages

However, line plots, unlike bar plots, have the advantage that you can show multiple **sets** of lines at once.
In a bar plot, these would be overlapping, and potentially difficult to see.

To make a line plot with multiple sets of lines:

- Get a `with px do line using` block
- Inside that block put a `create list with` block, and inside that block put
    - `with groups do mean using`

`plotly` nicely draws each variable in its own color, so we can see that all variables except `SepalWidth` seem to increase across species.

There are two important points to make here:

- Normally in line plots, the x axis is an ordered variable, like year. With a nominal variable like `Species`, we are fortunate to get such nice lines and not "spaghetti."

- Drawing multiple lines at once on one plot only makes sense if the variables have the same units of measurement, here centimeters. Otherwise the plot can mislead anyone not looking closely at the y axis.

## Histograms

Histograms introduce a new idea, **probability distributions**, into the discussion.
A probability distribution is simply a table listing the probability that a variable will have a particular value.

In our work, you can think in terms of **count distributions** or the number of times a variable has a particular value.
We will use the term **distribution** to refer to either count or probability distributions interchangeably. 

There are as many different types of distributions - as many as different types of animals in the zoo!
For our purposes, we highlight five general shapes of distributions:

- **Uniform:** a flat distribution where every value is equally likely
- **Normal:** a bell curve distribution where values toward the middle are most likely
- **Skewed right:** a declining distribution were small values are likely and large values unlikely
- **Skewed left:** the opposite of skewed right
- **Mixtures:** appear as two or more of the above distributions

The purpose of generating histograms is to visually determine the approximate distribution of a variable. 
Histograms can reveal extreme values, missing ranges, or skew, that may require special care in later analysis.

Histograms:

- Require x 
- Automatically determine bar widths for x
- Automatically define y as the count of values for x
- Are used to show the distribution of a **single** variable

Let's first look at the distribution of `Species` in our dataset:

- `with px do histogram using` a list containing
    - `dta_iris`
    - freestyle `x="Species"`

The histogram shows a uniform distribution because each species is equally represented in our data.
Recall that there are 150 rows, or datapoints, and the histogram shows each species corresponds to 50 of those datapoints.

This histogram is a little unusual because the x axis is not numeric. 
Let's look at a numeric example.

- Copy the blocks above but change `"Species"` to `"PetalLength"`

This distribution appears to be a mixture of two normal distributions.
As before, we can test this idea by adding color based on species:

- Copy the blocks above and add freestyle `color="Species"`

Now it is clear that there are different distributions related to species.
It is worth noting that when distributions overlap, as they do here, part of the coloring may be obscured by the overlapping color.

Finally, let's look at a simple histogram:

- `with px do histogram using` a list containing
    - `dta_iris`
    - freestyle `x="SepalLength"`

This distribution appears to be approximately normal, or bell curve shaped.
Not only does it have a clear middle peak, but the distribution is basically symmetric.

Let's pause for a moment and point out something about histograms using numeric values: the counts are "binned" rather than exact counts.

Hover your mouse over the leftmost bar in the plot above. 
Notice it says "SepalLength=4-4.4" and "count=4".
That means that there are four datapoints that have a sepal length between 4 and 4.4.

Using bins in this way **smooths** histograms, which would otherwise have a jagged appearance for small datasets. 
In most cases, the bin width provided by `plotly` is reasonable, but you should be aware that very wide bins can distort distributions, e.g. make mixed distributions look normal.



## Summing up

There are many types of plots, and which you should choose depends on the variables you want to visualize as well as the purpose of your visualization:

- Scatterplots are the only plot we covered that show individual data points. However they require x and y to be numeric
- Bar plots show a single value for each x, typically an average or other summary value
- Line plots are like bar plots but have an advantage for showing multiple lines at once
- Histograms are the only plot we covered that uses a single variable. Histograms show the distribution of a variable
