<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Principles of Data Visualization With Python

### Learning Objectives
*After this lesson, you will be able to:*
- Describe why data visualization is important.
- Identify the characteristics of a great data visualization.
- Describe when you would use a bar chart, pie chart, scatter plot, and histogram.

### Lesson Guide

- [Why Use Data Visualization?](#why)
- [Anscombe's Quartet](#anscombe)
- [Attributes of Good Visualization](#good)
- [Choosing the Right Chart](#choosing)
- [Visualization Programming Libraries](#visualization)
- [Conclusion](#conclusion)
- [Next Step](#next)

<a id='why'></a>
### Question: Why Use Data Visualization?

---

**Paired Exercise *(~5 mins)***

- **In pairs, find a data visualization that you have used or enjoyed.**
- **Ask yourself:** Why do you think data visualization is useful? Why is it important?
- **Slack me the url** to that visualization.


In [None]:
## data are just visualizations waiting to happen
# let's look at what we usually start with: an array
import pandas as pd

# check out my awesome text-based data viz below
df = pd.read_csv("./datasets/sales_info.csv")
df.head()

### Adam's Top 5 Plots (of the day)

#### Figure 1A
Some plots are great because they are simple, and ask *good questions*.
<img src="https://johnwmillr.github.io/assets/images/FreqPlot_beer_and_truck.png" style="width: 500px;"/>

#### Figure 1B
<img src="https://johnwmillr.github.io/assets/images/FreqPlot_girl_and_love.png" style="width: 500px;"/>

#### Figure 2
Others plots are great because while they are more complicated, they *clearly convey information*.
<img src="http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.png" style="width: 700px;"/>

#### Figure 3
Sometimes plots are great purely out of **dumb luck**.
<img src="https://fivethirtyeight.com/wp-content/uploads/2018/04/hickey-cage-01.png?w=1150" style="width: 500px;"/>

#### Figure 4
But, often plots are great because they manage to clearly represent otherwise ***complicated relationships.***
<img src="https://fivethirtyeight.com/wp-content/uploads/2018/05/wezerek-derby-01.png?w=2048" style="width: 800px;"/>

#### Figure 5
And more and more often, plots are great because they offer a greater range of information through **interactive** inputs.

In case anyone is [***getting hungry***](https://fivethirtyeight.com/burrito/#brackets-view)...

### Answer: Why Use Data Visualization?

---

Because of the way the human brain processes information, charts or graphs that visualize large amounts of complex data are easier to understand than spreadsheets or reports. 

- Data visualization is a quick, easy way to convey concepts in a universal  manner — and you can experiment with different scenarios by making slight adjustments.

- Here's a helpful overview of the importance of data visualization: [SAS: Data Visualization](http://www.sas.com/en_us/insights/big-data/data-visualization.html)

#### Challenger Shuttle Disaster: A Case Study

Although *Edward Tufte* made great contributions to our understanding of the visual communication of information, through numerous books and articles, his most famous contribution was his critique of the handling of the **1986 Challenger Shuttle disaster**.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/STS-51-L_-_Space_Shuttle_Challenger_on_the_Crawler-Transporter.jpg/220px-STS-51-L_-_Space_Shuttle_Challenger_on_the_Crawler-Transporter.jpg" style="width: 250px;"/>

In his critique, he highlights the poor representation of data by NASA scientists and engineers in the days leading up to the deaths of all seven crew members. We can see an example of the representations used below:

<img src="./assets/o_ring_rockets.jpg" style="width: 450px;"/>

> **After thorough inspection** of the figure above, it would seem NASA engineers had collected data indicating that ***rubber O-rings failed to seal correctly in cold weather.***
> - But these charts were unable to sufficiently convey the danger of a liftoff in the colder-than-average weather that morning. 
> - *e.g.* Vertical typing is hard to read, the rocket cartoons obscure information...

*Most importantly,* had the data on O-ring performance been arranged by the most critical factor, temperature, instead of by launch date, decision makers would have had a much better chance of seeing that the launch, proposed to occur in weather below 66 degrees, would very likely lead to O-ring failure (see below).

<img src="./assets/ring_damage.jpg" style="width: 700px;"/>

Instead, as we all know, the launch proceeded when the temperature was 36 degrees, and the shuttle exploded 73 seconds after liftoff.

[Source](https://stanfordmag.org/contents/elemental-evidence)

#### *Heavy,* right?

> Rest assured, most of our visualizations won't be involved in life and death decision-making like they do at NASA...
>
> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/98/Mort.jpg/440px-Mort.jpg" style="width: 250px;"/>

<a id='anscombe'></a>
### _Refresher:_ Anscombe's Quartet

---

Below are the summary statistics for four plots. What do you think the visualization for each plot would look like? 

<img src="./assets/anscombs_quartet.png" style="width: 500px;"/>


You can probably already guess what the answer is: Although the four plots have the same summary statistics, 
they are actually completely different. This can be seen when we visualize them together. 

<img src="./assets/anscombs_viz.png" style="width: 700px;"/>

These descriptive statistics come from a data set constructed in 1973 by the statistician Francis Anscombe. It is a classic demonstration of the importance of data visualization.

- It highlights the limitations of summary statistics.
- It shows the effect of outliers on statistical properties.
- Anscombe's intention was to attack the impression among statisticians that "numerical calculations are exact, but graphs are rough."

<a id='choosing'></a>
### Choosing the Right Chart

---


In addition to considering data visualization attributes, you should also carefully choose the type of chart or graph you'll use. 
- Below is a really useful flow-chart to help you select the right visualization for your problem. 

<img src="http://www.perceptualedge.com/blog/wp-content/uploads/2015/07/Abelas-Chart-Selection-Diagram.jpg" style="width: 900px;"/>

PDF version [here.](http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf)

<a id='good'></a>
### Attributes of Good Visualization

---

What are some attributes you think are important for data visualizations to have? 

Let's take a look at what Jeffrey Shaffer, who teaches data visualization at the University of Cincinnati, thinks:

<img src="./assets/data_attributes.png" style="width: 700px;"/>

Interestingly, some attributes have more of an effect on our brains than others. The ones we tend to focus on most are:

1. #### Position
2. #### Size
3. #### Color


Let's take a look at three visualizations. Which one catches your attention most? Why?

## 1.
<img src="./assets/mixed_shapes.png" style="width: 200px;"/>

## 2.
<img src="./assets/stacked_shapes.png" style="width: 200px;"/>

## 3.
<img src="./assets/colored_shapes.png" style="width: 200px;"/>


#### Focusing on *color*
- Generally, in data visualizations, you’re going to use color in one of three ways: *sequential, divergent, or categorical.*

***Sequential colors*** are used to show values ordered from low to high.

<img src="./assets/sequential.png" style="width: 700px;"/>

***Divergent colors*** are used to show ordered values that have a critical midpoint, like an average or zero.

<img src="./assets/divergent.png" style="width: 700px;"/>

***Categorical colors*** are used to distinguish data that falls into distinct groups.

<img src="./assets/categorical.png" style="width: 700px;"/>


[Images via MediaShift](http://mediashift.org/2016/02/checklist-does-your-data-visualization-say-what-you-think-it-says/)

<a id='chart_choice'></a>

### Choosing the Right Chart

---

In addition to considering data visualization attributes, you should also carefully choose the type of chart or graph you'll use. Let's look at a few commonly used charts and graphs.

![]()
<img src="https://imgs.xkcd.com/comics/stove_ownership.png" style="width: 500px;"/>


### Bar Charts

Bar charts are one of the most common ways of visualizing data. 

> ***Why?*** 
> - Because they make it easy to compare information, revealing highs and lows quickly. 
> - Bar charts are most effective when you have numerical data that splits neatly into different categories.

<img src="./assets/bar_chart.png" style="width: 800px;"/>


### Pie Charts

Pie charts are the most *commonly misused* chart type. 
- They should be only used to show relative proportions or percentages of information. 

> **If you want to compare data, leave it to bars or stacked bars.**
> - If your viewer has to work to translate pie wedges into relevant data or compare pie charts to one another, the key points you're trying to convey might go unnoticed.

### The Best Use of a Pie Chart

<img src="http://i.imgur.com/uhTf6Ek.jpg" style="width: 550px;"/>


<img src="./assets/pie_chart.jpg" style="width: 500px;"/>

Source: [Pie chart via TV.com](http://www.tv.com/news/learning-about-the-2013-pilot-season-through-pie-charts-136243394841/)

### Scatter Plots

Scatter plots are a great way to give you a sense of trends, concentrations, and outliers. This will provide a clear idea of what you may want to investigate further. 

<img src="./assets/scatter_plot.png" style="width: 500px;"/>

[Scatter plot via Wikibooks](https://en.wikibooks.org/wiki/Statistics/Displaying_Data/Scatter_Graphs)

### Histograms 

Histograms are useful when you want to see how your data are distributed across groups.

<img src="./assets/histogram.png" style="width: 600px;"/>

> **Pro-tip:** A useful rule of thumb for deciding on how many bins to use in our histogram is to use the square root of the number of values.
>
>$$\sqrt{N_{values}}$$

This is not an all-inclusive list of chart and graph types, but the point is to remember that you have options. 
- You should always consider which one is most appropriate for representing a particular data set. 

[Charts and graphs via Tableau](https://drive.google.com/file/d/0Bx2SHQGVqWasT1l4NWtLclJJcWM/view)

[Another suggestion](http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf)

<a id='visualization'></a>
### Visualization programming libraries

---

In this lesson, we will use the Python libraries [Matplotlib](https://matplotlib.org/) (Python plotting) and [Seaborn](https://seaborn.pydata.org/) (statistical data visualization).

Many other Python libraries exist for making visualizations. Some of the most popular include:

- **[Bokeh](http://bokeh.pydata.org/en/latest/):** Python visualization library that targets the web browser (e.g., in Jupyter). Makes interactive plots, dashboards, data applications, etc.

- **[Graphviz](http://graphviz.readthedocs.io/en/stable/manual.html):** Popular visualization library for graph data structures (e.g., edges, vertices, etc). Has Python extensions.

- **[Basemap](http://matplotlib.org/basemap/):** Python Matplotlib extension for drawing static maps. There are many other Python libraries for plotting geographic data, including ones that might be easier to use, but many are not actively developed.

One of the most popular libraries for interactive visualizations in the web browser is D3. Because web browsers only natively run JavaScript, D3 requires knowledge of JavaScript:

- **[D3.js](https://d3js.org/):** JavaScript library for interactive web visualizations [D3.js](https://d3js.org/) | [Examples](https://github.com/mbostock/d3/wiki/Gallery)

### Other visualization tools

Although this course emphasizes a Python approach to data science, a variety of non-programming tools are also used in industry. Often, these tools can be applied much more quickly than creating a custom Python solution. For example:

- **Excel:** For quick data cleaning and simple graphs
- **Power BI:** A suite of business analytics tools
- **Tableau:** Business intelligence and analytics software
- **Periscope Data:** Data analysis platform
- **Plotly:** Create charts and dashboards


<a id='conclusion'></a>

### Conclusion: Things to consider

---

- Why is data visualization so important? 
- What are some considerations to keep in mind when creating a visualization? 
- When would you use the following types of charts or graphs?
    - Bar chart
    - Pie chart
    - Scatter plot
    - Histogram 

<a id='next'></a>

### Next step: Python plotting with pandas and seaborn

---

Open up the [independent research notebook](./practice/python-data-viz-lab.ipynb) to explore plotting the sales data with Python. 
