# DS104 Data Wrangling and Visualization : Lesson Four Companion Notebook

### Table of Contents <a class="anchor" id="DS104L4_toc"></a>

* [Table of Contents](#DS104L4_toc)
    * [Page 1 - Introduction](#DS104L4_page_1)
    * [Page 2 - Histograms](#DS104L4_page_2)
    * [Page 3 - Histograms in Python](#DS104L4_page_3)
    * [Page 4 - Bar Charts](#DS104L4_page_4)
    * [Page 5 - Bar Charts in R](#DS104L4_page_5)
    * [Page 6 - Bar Charts in Python](#DS104L4_page_6)
    * [Page 7 - Stacked Bar Charts](#DS104L4_page_7)
    * [Page 8 - Stacked Bar Charts in R and Python](#DS104L4_page_8)
    * [Page 9 - The Pareto Principle](#DS104L4_page_9)
    * [Page 10 - Scatter Plots](#DS104L4_page_10)
    * [Page 11 - Scatter Plots in Python](#DS104L4_page_11)
    * [Page 12 - Line Charts](#DS104L4_page_12)
    * [Page 13 - Line Charts with Multiple Dependent Variables](#DS104L4_page_13)
    * [Page 14 - Line Charts in R](#DS104L4_page_14)
    * [Page 15 - Line Charts in Python](#DS104L4_page_15)
    * [Page 16 - Area Charts](#DS104L4_page_16)
    * [Page 17 - Google Trends](#DS104L4_page_17)
    * [Page 18 - Key Terms ](#DS104L4_page_18)
    * [Page 19 - Lesson 4 Hands-On](#DS104L4_page_19)    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS104L4_page_1"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Histogram and Bar Charts
VimeoVideo('241240235', width=720, height=480)

# Introduction

Usually the first thing people want to do with a set of data is get a feel for the distribution of those data. Whether the data are quantitative or categorical, there is a way to visualize the most important findings.

In this lesson, you will learn how to create: 

* Histograms
* Bar charts
* Stacked bar charts
* Scatter plots
* Line charts

You will also learn about the concepts behind: 

* Pareto plots
* Area charts
* Google Trends

This lesson will culminate in a hands on in which you will create several different charts in the language of your choice.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/432027835"> recorded live workshop that goes over the Python material </a> in this lesson or this <a href="https://vimeo.com/438414531"> recorded live workshop on the R material in this lesson. </a>. </p>
    </div>
</div>



In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Histogram and Bar Charts
VimeoVideo('432027835', width=720, height=480)

In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Histogram and Bar Charts
VimeoVideo('438414531', width=720, height=480)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Histograms<a class="anchor" id="DS104L4_page_2"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Histograms

A histogram is a graphical representation of the distribution of quantitative data. It was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval.

The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins must be adjacent, and are usually of equal size. 

A typical simple histogram looks like this:

![A histogram of monthly salary. The x-axis represents gross monthly salary and the y-axis represents frequency. The x-axis ranges from 800 dollars to 1800 dollars in 11 units. The y-axis ranges from 0 to 200 in five units.](Media/L03-01.png)

The horizontal axis shows the range of values for the quantitative variable that is being displayed. The vertical axis tracks the counts for each value in each bin. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Histograms in Python<a class="anchor" id="DS104L4_page_3"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Histograms in Python

Since you have already learned how to create a histogram in R, you will now go over how to create them in Python. You will be using **[data from a human resources department](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/HR_data.zip)**. 

---

## Histograms with Pandas

The easiest way to create a histogram in Python is to use the ```pandas``` package, with the function ```.hist()```.  All you need to do is specify the dataset name, then the variable name and the function ```.hist()```: 

```python
HR.satisfaction_level.hist()
```

And here is the result: 

![A bar chart represents 0 to 1.0 with six units on the x-axis and 0 to 2000 with 8 units on the y-axis.](Media/quant1.png)

You may encounter a situation where the histogram doesn't populate, only text somewhat like this: 

```text
<matplotlib.axes._subplots.AxesSubplot at 0x18ff12bfc50>
```

If that happens, don't panic! All that is required is to re-run the cell again, and you'll be good to go.

---

## Single Histogram with Seaborn

The next way to make histograms is to use a package called ```seaborn```.  You will need to import it with this import statement: 

```python
import seaborn as sns
```

Then you can use the function ```displot``` out of seaborn to initiate an automated histogram with a best-fit curved line: 

```python
sns.distplot(HR['satisfaction_level'])
```

And here is the result: 

![A graph depicts the satisfaction level. The x-axis ranges from 0 to 1.0 in six units and the y-axis ranges from 0 to 2.00 in 8 units.](Media/quant2.png)

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>You will get an error if you try to run seaborn plots with missing data!</p>
    </div>
</div>

---

## Multiple Histograms with Seaborn

What if you wanted to see histograms for all your quantitative data at once? Well, then you can make use of the ```seaborn``` function ```pairplot()```, run on the entire ```HR``` dataset: 

```python
sns.pairplot(HR)
```

And here's what you'll receive in return for that little snippet of code! The histograms are shown on the middle diagonal.

![Sixty-four graphs depicting the plot of satisfaction level, last evaluation, number project, average monthly hours, time spend company, work accident, left, and promoters list five years.](Media/quant3.png)

---

## Histograms with Matplotlib

While the histogram options in ```pandas``` and ```seaborn```are quick and easy to use, they lack the ability to customize the graphs with titles, labels, colors, and other parameters.  For that, you will need the holy grail of data visualization in Python: the package ```matplotlib```. 

Start by importing it and it's related subpackages: 

```python
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
```

Then you can start to get funky! You can create a variable for the number of bins you'd like in your histogram, like this: 

```python
num_bins = 5
```

Then specify the details of the plot. You basically will call the function ```plt.hist()```, then specify your variable, make a call to your bins variable, ```num_bins```, and then you can specify the argument of ```facecolor=``` to the color you'd like and specify the argument ```alpha=```. 

```python
n, bins, patches = plt.hist(HR['satisfaction_level'], num_bins, facecolor='blue', alpha=.05)
```

Then you will call this plot with the function ```.show()```:

```python
plt.show()
```

These three snippets of code can all go in the same Jupyter Notebook cell and be run together.  When you do so, here is the result: 

![A graph in which the x-axis ranges from 0 to 1.0 in six units and the y-axis ranges from 500 to 4000 in eight units. The plot shows a stairs-like structure.](Media/quant4.png)

The ```alpha=``` argument is for adjusting the color transparency - if you leave it at .05, you get the pretty washed out colors that you see above.  But if you change it to something higher, like 2, you get a much deeper color, like below: 

![A graph in which the x-axis ranges from 0 to 1.0 in six units and the y-axis ranges from 500 to 4000 in eight units. The plot shows a stairs-like structure.](Media/quant5.png)

---

### Adding Labels and a Title

If you want to add labels to any of your ```matplotlib``` graphs, you can do so like this: 

```python
plt.xlabel('Satisfaction Level')
plt.ylabel('Frequency')
```

Where you are using the function ```xlabel()``` for labeling the x-axis, and you are using the function ```ylabel()``` for labeling the y-axis.  Just specify the text you'd like inside of the parentheses and quotes.

If you'd like to add a title to the graph, you can do so with the function ```title()```, which functions the same way as the labels - just place your text in the parens and quotes: 

```python
plt.title('Histogram of Satisfaction Level')
```

When you combine it all together, your code might look something like this for the whole enchilada: 

```python
num_bins = 5
n, bins, patches = plt.hist(HR['satisfaction_level'], num_bins, facecolor='blue', alpha=.05)
plt.xlabel('Satisfaction Level')
plt.ylabel('Frequency')
plt.title('Histogram of Satisfaction Level')
plt.show()
```

And here is what the graph would look like: 

![A histogram of satisfaction level. The x-axis represents the satisfaction level and it ranges from 0 to 1.0 in five units and the y-axis represents the frequency that ranges from 0 to 4000 in eight units.](Media/quant6.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Bar Charts<a class="anchor" id="DS104L4_page_4"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Bar Charts

At a glance, a bar chart looks a lot like a histogram. But when you get into the details, there is one huge difference between bar charts and histograms: 

> A histogram is a visualization of a quantitative variable, whereas a bar chart is a visualization of a categorical variable.

Bar charts can be represented either vertically, like this: 

![A bar chart depicts the grades of students. The x-axis represents four bars labeled A, B, C, and D and the y-axis represents students and it ranges from 0 to 15 in four equal intervals. The first bar reaches four on the y-axis, the second bar reaches 12 units on the y-axis, the third bar reaches 10 units on the y-axis, and the fourth bar reaches 2 units on the y-axis.](Media/L03-05.png)

The bar chart above shows student grades. The horizontal axis lists the grades received, and the vertical axis shows a count of students earning that particular grade. 

Bar charts can also be represented horizontally, like this:  

![A bar chart with four units on the x-axis and three units on the y-axis. The x-axis ranges from 0 to 20. The y-axis has three units labeled giraffes, orangutans, and monkeys. The bar for monkeys crosses 20 units on the x-axis. The bar for orangutans about to reach 15 units on the x-axis and the bar for giraffes reaches 20 units on the x-axis.](Media/L03-06.png)

The height (or length) of each individual bar is a representation of how many units are in that category.

---

## Ordering Bars 

How do you know what order to put the categories in? There are no rules, but several options: 

* **Logical progression:** You can order them according to some sort of logical progression. For example, the student grades graph shown above is ordered logically, from A to D.
* **Height**
* **Time:** If a bar graph of death tolls from several wars was created, it would make sense to order them in time order of when the war took place.
* **Alphabetically:** Lots of software programs will default to ordering the categories alphabetically, which may or may not make any sense. 

---

## Multiple Categorical Variables

Bar charts are also great at showing differences between groups.  Now, instead of a single categorical variable being tracked, you are tracking two different categorical variables in the bar chart. This is often called a *grouped bar chart*. Check out this bar chart:

![A bar chart depicts the sales of chocolate, vanilla, and strawberry flavors for Harpo, Chico, and Groucho. The sales of chocolate flavor for Chico is high, Groucho is low, and Harpo is average. The sales of Vanilla flavor for Harpo is high, Chico is medium, and Groucho is low. The sales of strawberry flavor for Chico is high, Groucho is medium, and Harpo is low.](Media/L03-07.png)

As you can see, the two categorical variables are ```flavor``` and ```Marx brother```. The vertical axis shows the sales for each of these products, and by person. This type of bar chart enhances the ability to perform eyeball analysis, because you can quickly and easily compare between flavors, brothers, and the interaction between flavor and brother.  

---
 


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Bar Charts in R<a class="anchor" id="DS104L4_page_5"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">




# Bar Charts in R

There are many ways to create bar charts in R, with a wide variety of packages available for plotting.  However, you will learn two systems: one building on your ```ggplot``` knowledge, and one using the ```lattice``` library.

---

## Bar Charts with ggplot

You can add the ```geom_bar()``` function to your ```ggplot()``` function to make a bar chart, like this: 

```{r}
ggplot(HR_data, aes(salary))+ 
  geom_bar() +
  ggtitle("Frequency of Salary") +
  xlab("Salary Category") +
  ylab("Frequency")
```

And here is the result: 

![A bar chart depicts the frequency of salary. The x-axis represents the salary category as high, low, and medium. The y-axis represents the frequency from 0 to 6000 in four units. The biggest bar is the category low, the second biggest bar is the category medium, and the smallest bar is the category high.](Media/quant7.png)

---

## Bar Charts with Lattice

You can also install and work with a library called ```lattice``` for plotting in R.  It is  quick and easy way to plot. All you really need to do is to specify your variable, using the function ```barchart()```:

```{r}
barchart(HR_data$salary)
```

And here is the resulting graph below.  Notice that by default the bar chart is horizontal, not vertical. 

![A bar chart depicts the frequency of salary. The y-axis represents the salary category as high, low, and medium. The x-axis represents the frequency from 0 to 6000 in four units. The biggest bar is the category low, the second biggest bar is the category medium, and the smallest bar is the category high.](Media/quant9.png)

A word of caution - your variable must be a factor in order to use the ```barchart()``` function from ```lattice```.  If you use a variable in a different format, even if it represents categorical data, things will come up pretty screwy.  Take a look at what happens when you chart ```Work_accident```, which is a categorical variable, but currently structured as numeric: 

```{r}
barchart(HR_data$Work_accident)
```

![A graph depicts the HR data work accident. The x-axis has five units and it ranges from 0.0 to 1.0. A rectangular bar plotted till it reaches point 1.0 on the x-axis. The bar is divided at the point X equals 0.0.](Media/quant8.png)

Quite confusing and nonsensical. 

---

### Adding Labels and a Title

Now, if you want to pretty up your ```lattice``` chart, you can add some arguments to it to do so.  ```main=``` will provide a title, ```ylab=``` will allow you to label the y-axis, and ```xlab=``` will allow you to label the x-axis.  You can also add in a ```color=``` argument to change the color from the default glaring blue. Maybe you'd prefer neon green instead?

```{r}
barchart(HR_data$salary, main="Frequency of Salary", ylab = "Salary Category", xlab = "Frequency", col="green")
```

![A bar chart depicts the frequency of salary. The y-axis represents the salary category as high, low, and medium. The x-axis represents the frequency from 0 to 6000 in four units. The biggest bar is the category low, the second biggest bar is the category medium, and the smallest bar is the category high.](Media/quant10.png)

---



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Bar Charts in Python<a class="anchor" id="DS104L4_page_6"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">




# Bar Charts in Python

There are a few different ways to make bar charts in Python as well.  The easiest way is probably to use ```pandas```, however. Call the variable you want, then use the function ```.value_counts()```, and call on top of it the function ```plot()```, specifying that you want it to be a ```'bar'``` chart as an argument. Altogether, it should look like this:  

```python
HR['salary'].value_counts().plot('bar')
```

And here is what it will end up looking like:

![A bar chart depicts the frequency of salary. The x-axis represents the salary category as low, medium, and high. The y-axis represents the frequency from 0 to 7000 in seven units. The biggest bar is the category low, the second biggest bar is the category medium, and the smallest bar is the category high.](Media/quant11.png)

---

### Adding Labels and a Title

To add labels and a title, you will first need to give your graph a name, so that you can call the functions on your graph.  You will call your graph ```SalaryFreq```. 

```python
SalaryFreq = HR['salary'].value_counts().plot('bar')
SalaryFreq.set_title("Salary Frequency")
SalaryFreq.set_xlabel("Salary Categories")
SalaryFreq.set_ylabel("Frequency")
```

And then you can use the functions ```set_title()``` to add a title to your graph, ```set_xlabel()``` to add a label for the x-axis, and ```set_ylabel()``` to add a label for the y-axis.

Now the result is a beautifully labeled graph. 

![A bar chart depicts the frequency of salary. The x-axis represents the salary category as low, medium, and high. The y-axis represents the frequency from 0 to 7000 in seven units. The biggest bar is the category low, the second biggest bar is the category medium, and the smallest bar is the category high.](Media/quant12.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Stacked Bar Charts<a class="anchor" id="DS104L4_page_7"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Stacked Bar Charts

One other variety of bar chart that is useful to compare two different categorical variables at the same time is called a *stacked bar chart*. A stacked bar chart looks like this:

![A bar chart depicts the number of referrals in March, April, and May for four persons. The y-axis represents the number of referrals and the x-axis represents Bob, Alice, Tony, and Cathy. Each bar has three subdivisions represent March, April, and May. The bar for Bob reaches 45 on the y-axis. The bar is divided in such a way that the month May takes the majority of the portion, then, April, and then March. The bar for Alice reaches 45 on the y-axis. The bar is divided in such a way that the month May and March have almost equal portions. The bar for Tony reaches 35 on the y-axis. The bar is divided in such a way that three divisions are almost equal. The bar for Cathy reaches 40 units on the y-axis. The bar is divided in such a way that referrals in March are a little higher than in April.](Media/quant13.jpg)

Two categorical variables are being measured.  The first is the person, which is labeled along the x-axis, and the second is the Month, which is color-coded and has a key in the upper left. The frequencies, in number of referrals, is shown on the y-axis. 

One caution about using a stacked bar graph is this: For all colors except for the bottom color (March, in this example), it can be challenging to determine which bar is higher, because the bottom of each colored section is often at different places vertically.

For example, look at ```April``` for ```Bob``` and ```Tony```. Can you tell for sure which is taller? It's hard to say, because they both start at different points on the graph. Stacked bar charts may rob you of the ability to compare between bars for a certain group. Sure, you could include a table of values, or put the values into the colored portions of the bars themselves. But the whole point of visualization is to enhance interpretation. If you cannot interpret the findings without explicitly including the data, have you really gained anything?

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Stacked Bar Charts in R<a class="anchor" id="DS104L4_page_8"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Stacked Bar Charts in R

Stacked bar charts are easy to create in R using ```ggplot```. Simply add a ```mapping=``` argument to your ```geom_bar()``` function that allows you to specify your ```x=``` variable and the variable that you will be stacking, which goes in the argument ```fill=```. Here's what that code will look like:

```{r}
ggplot(data=HR_data) +
  geom_bar(mapping = aes(x = sales, fill=salary)) + 
  ggtitle("Sales Categories by Salary Level") +
  xlab("Sales Category") +
  ylab("Frequency") 
```

Which will then produce this graphic: 

![A bar chart depicts the plot of sales category by salary level. The x-axis represents the salary category. The salary categories are labeled accounting, hr, IT, management, marketing, product mng, Rand D, sales, support, and technical. The y-axis represents the frequency. Each bar has three divisions represent high, medium, and low. The frequency for the category sales is the highest. The frequency for the category management is the lowest.](Media/quant14.png)

---

## Making Bar Heights the Same

One of the difficulties with stacked bar charts is the ease of interpretation - it's hard to do when the bars are all varying heights! But if you are able to set the bar heights the same, interpretation becomes much easier! Simply add another argument to ```geom_bar()``` called ```position=``` and set it to ```"fill"```: 

```{r}
ggplot(data=HR_data) +
  geom_bar(mapping = aes(x = sales, fill=salary), position = "fill") + 
  ggtitle("Sales Categories by Salary Level") +
  xlab("Sales Category") +
  ylab("Frequency")  
```

Now you can see that it is much easier to compare the amount in each colored bar! 

![A bar chart depicts the plot of sales category by salary level. The x-axis represents the salary category. The salary categories are labeled accounting, hr, IT, management, marketing, product mng, Rand D, sales, support, and technical. The y-axis represents the frequency. Each bar has three divisions represent high, medium, and low and all the bars reach 1.00 on the y-axis.](Media/quant15.png)

---

## Multiple Categorical Variables

When you do a stacked bar chart, you are including two categorical variables.  However, one is stacked! You can also compare them side by side if you'd rather.  That's also an easy fix in ```ggplot```: just change the ```postion=``` argument to ```"dodge"```, which will place your bars side by side.  If you have the room, this is one of the best ways to compare the differences between categories. Here is the code: 

```{r}
ggplot(data=HR_data) +
  geom_bar(mapping = aes(x = sales, fill=salary), position = "dodge") + 
  ggtitle("Sales Categories by Salary Level") +
  xlab("Sales Category") +
  ylab("Frequency")  
```

And it will produce a plot that looks like this: 

![A bar chart depicts the plot of sales category by salary level. The x-axis represents the salary category. The salary categories are labeled accounting, hr, IT, management, marketing, product mng, Rand D, sales, support, and technical. The y-axis represents the frequency. Each category has three bars represent high, medium, and low and all the bars reach 1.00 on the y-axis.](Media/quant16.png)

# Stacked Bar Charts in Python

Stack bar charts are easy to create with ```pandas```! You create a data frame with ```pd.crosstab``` to create a cross table of two categorical variables, then add ```plot.bar(stacked=True)``` to the data frame variable like so:

```{r}
crosstab_df = pd.crosstab(HR_data['sales'], HR_data['salary'])
crosstab_df.plot.bar(stacked=True)
```

Which will produce an image like the one below.

![Stacked Bar Chart in Python](Media/StackedBCiPy.png "Stacked Bar Chart in Python")

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to make stacked bar charts with matplotlib and seaborn, please visit <a href="https://randyzwitch.com/creating-stacked-bar-chart-seaborn/">this site.</a></p>
    </div>
</div>

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - The Pareto Principle<a class="anchor" id="DS104L4_page_9"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# The Pareto Principle

Vilfredo Pareto was an Italian engineer and philosopher who lived in the late 19th and early 20th century. He is credited with crating the *80/20* rule, which also came to be known as the *Pareto principle*.

The Pareto principle was built on the observation that about 80% of the land in Italy at the time was owned by about 20% of the landowners. It turned out that this sort of breakdown applies to all sorts of natural phenomena. Pareto also noticed that in his garden, 20% of the peapods contained 80% of the peas.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>The Pareto principle is sometimes called 'the law of the vital few,' or the 'principle of factor sparsity.'</p>
    </div>
</div>

---

## Applications of the Pareto Principle

The Pareto principle has applications in the following areas:

* **Business management:** 80% of sales come from 20% of clients
* **Economics:** 80% of the wealth is controlled by 20% of the population
* **Software:** Microsoft noted that fixing the 20% most common bugs would fix 80% of the errors and crashes
* **Sports training:** roughly 20% of the exercises and habits have 80% of the impact on a trainee
* **Occupational Health:** 20% of the hazards account for 80% of the injuries
* **Health care:** 20% of the patients use 80% of the resources
* **Crime:** 80% of all crimes are committed by 20% of criminals

---

# The Pareto Chart

As with a bar chart, a pareto chart is set up so that the vertical axis is the frequency of occurrence for most pareto charts, but it can also be some sort of quantitative measure, such as cost or another important unit of measure.

Pareto charts are one of the 'seven basic tools of quality.' They are especially useful in problem solving. Here is an example:

![A Pareto chart of late arrivals by reported cause. The x-axis represents traffic, child care, public transportation, weather, overslept, and emergency. The left side of the chart ranges from 0 to 60 with 10 unit intervals and the right side of the chart ranges from 0-percent to 100-percent. The bars are arranged in decreasing order and a curve drawn in an increasing pattern.](Media/quant17.png)

This pareto chart is a list of causes for late arrival, listed along the horizontal axis. The vertical axis shows the count of defects for each type. One other feature of pareto charts is that they have a cumulative percentage axis (usually on the right), and a line trend that shows the cumulative amount of total defects, starting from the left.

So, if you wanted to decrease the number of late arrivals to your company, then you could take a peek at the heavy hitting categories of ```Traffic```, ```Child Care```, and ```Public Transportation```, and brainstorm how you could impact those factors.  For instance, perhaps if you staggered employee start times a little later in the day, rush hour traffic could be avoided, and most likely employees would be happier as well! 

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Scatter Plots<a class="anchor" id="DS104L4_page_10"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Scatterplots
VimeoVideo('241240275', width=720, height=480)

# Scatter Plots

Have you ever been curious about how two quantitative variables interact? A graphical way to represent two quantitative variables and their interaction is a *scatterplot*. A scatter plot (also called a scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot used to display values for two continuous variables for a set of data. The data are displayed as a collection of points, with the value of one variable determining the position on the horizontal axis, and the value of the other variable determining the position on the vertical axis.

![A graph depicts the plot of height against weight. The unit of height is represented in meters and the unit of weight is represented in kilograms. Fifteen points are plotted on the graph in an increasing pattern.](Media/quant18.png)

This graph is a basic scatterplot, where two quantitative variables are being plotted against each other. In this case, the horizontal variable, or x, is ```Height```, and the vertical variable, or y, is ```Weight```. Each dot on the scatterplot represents a single person, and by drawing an imaginary line straight down and straight to the left from each data point will tell you the height and weight of a person.

---

## Using Color in Scatter Plots

There are lots of great options with scatterplots. Using colors or symbols or both, you can create scatterplots that have multiple groups on them, which can be really useful in telling a story. Here is an example:

![A graph depicts the plot of sepal length against the sepal width. The x-axis represents the sepal length and the y-axis represents the sepal width. The x-axis ranges from 0 to 7.8 and the y-axis ranges from 2 to 4.4. The points are plotted for three categories labeled Iris-setosa, Iris-versicolor, and Iris-virginica.](Media/L05-21.png)

This is a scatterplot where two quantitative variables _and_ a categorical variable are demonstrated on the same graph. 

---

## Scatterplot Matrices

You also didn't cover a scatterplot matrix. There may be a time when you have more than two quantitative variables in your dataset, and you want to do all possible combinations of two variables. Scatterplot matrices will give you graphs that look like this: 

![Ten graphs are placed in four rows. The first row has four graphs and they are labeled Sepal.length, R equals minus 0.12, R equals 0.87, and R equals 0.82. The second row has three graphs and they are labeled Sepal.width, R equals minus 0.43, R equals minus 0.37. The third row has two graphs and they are labeled Petal.Length, R equals 0.96. The last row has one graph labeled Petal.width.](Media/L05-22.png)

Note that there are four different quantitative variables in the data set, so there are six possible combinations of those four variables, hence six scatterplots. 


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Scatter Plots in Python<a class="anchor" id="DS104L4_page_11"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Scatter Plots in Python

Since you have previously learned how to create scatter plots in R, you'll now learn how to do scatter plots in Python! You will use the [USDA Nutrition Dataset](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/USDA_nutrition.zip), which lists the nutritional content for a lot of different foods.


The simplest way to make scatterplots in Python is to use ```pandas```.  A simple scatter plot can be created by specifying the data frame now, and then calling the ```.plot.scatter()``` function, and specifying an ```x=``` and ```y=``` variable as arguments: 

```python
Plot = nutrition.plot.scatter(x='fats', y='sat fats')
Plot.set_title("Fats by Saturated Fats")
Plot.set_xlabel("Fats")
Plot.set_ylabel("Saturated Fats")
```

And here is the resulting plot: 

![A graph depicts the plot of fats by saturated fats. The x-axis represents fats and the y-axis represents saturated fats. The x-axis ranges from 0 to 100 with five equal unit intervals and the y-axis ranges from 0 to 80 with four-unit intervals.](Media/quant19.png)

---

## Scatter Plots with a Third Quantitative Variable

You can add a third quantitative variable to your scatter plot if you are feeling wild and crazy! It can be added as either a size or color component.  Here's the most simple color component, which can be created by adding as an argument ```c=```: 

```python
Plot = nutrition.plot.scatter(x='fats', y='sat fats', c='protein')
Plot.set_title("Fats by Saturated Fats and Protein")
Plot.set_xlabel("Fats")
Plot.set_ylabel("Saturated Fats")
```

You'll get a graph that is in greyscale, and the depth of the grey indicates the amount of protein in the data points: 

![A graph depicts the plot of fats by saturated fats and protein. The y-axis represents saturated fats and it ranges from 0 to 80 with four-unit intervals. The points plotted are in different shades. A color scale with different shades is placed on the right side of the figure. The color bar is labeled protein. The scale ranges from 0 to 35.](Media/quant20.png)

---

### Adding Color

Want to add some pizzazz to that grey on black color scheme? Then you are in luck, because you can specify a color palette if you'd like, with the argument ```cmap=```. 

```python
Plot = nutrition.plot.scatter(x='fats', y='sat fats', c='protein', cmap='coolwarm')
Plot.set_title("Fats by Saturated Fats and Protein")
Plot.set_xlabel("Fats")
Plot.set_ylabel("Saturated Fats")
```

Things may be a little clearer now on a blue to red color theme: 

![A graph depicts the plot of fats by saturated fats and protein. The y-axis represents saturated fats and it ranges from 0 to 80 with four-unit intervals. The points plotted are in different shades. A color scale with different shades is placed on the right side of the figure. The color bar is labeled protein. The scale ranges from 0 to 35.](Media/quant21.png)

---

### By Size

You can also display that third variable by size instead if you'd like, with the argument ```s=```. Here's what that would look like:

```python
Plot = nutrition.plot.scatter(x='fats', y='sat fats', s=nutrition['protein'])
Plot.set_title("Fats by Saturated Fats and Protein")
Plot.set_xlabel("Fats")
Plot.set_ylabel("Saturated Fats")
```

And the graph it creates:

![A graph depicts the plot of fats by saturated fats. The x-axis represents fats and the y-axis represents saturated fats. The x-axis ranges from 0 to 100 with five equal unit intervals and the y-axis ranges from 0 to 80 with four-unit intervals.](Media/quant22.png)

If the dots are too small to read, you can always change the size by adding a multiplier. Just add ```*``` to the end and the number by which you want to multiply: 

```python
Plot = nutrition.plot.scatter(x='fats', y='sat fats', s=nutrition['protein']*2)
Plot.set_title("Fats by Saturated Fats and Protein")
Plot.set_xlabel("Fats")
Plot.set_ylabel("Saturated Fats")
```

And the result is a little more readable: 

![A graph depicts the plot of fats by saturated fats. The x-axis represents fats and the y-axis represents saturated fats. The x-axis ranges from 0 to 100 with five equal unit intervals and the y-axis ranges from 0 to 80 with four-unit intervals.](Media/quant23.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Line Charts <a class="anchor" id="DS104L4_page_12"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [5]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Line Charts
VimeoVideo('241241450', width=720, height=480)

# Line Charts

A *line chart* or *line graph* is a type of chart that displays information as a series of data points called *markers* connected by straight line segments. A line graph differs from a scatterplot in that a line graph will have a single data point for each horizontal axis value, and a scatterplot can have multiple data points for each horizontal axis value.

The horizontal axis of the line graph must be some sort of ordered quantitative data, and it is usually time. However, it can be other types of quantitative data, too. The consecutive points are usually connected by a line in dot-to-dot fashion. This is done because it draws the eye to the trend. If you think about it, connecting the points on a scatterplot would turn most scatterplots into a rat's nest of a mess, because there is no sort of progression on the x-axis.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>If the line graph is using a time stamp as the horizontal axis variable, then sometimes they are called run charts.</p>
    </div>
</div>

---

## Single Variable Line Charts

You will start with the most basic line chart. A line chart is used to track a quantitative variable that is measured on a frequent (and preferably consistent) basis through time or space. That description probably seems a bit goofy, so here are some examples.

---

### Time Example

Suppose you work in a school district, and are tracking student scores on a standardized exam. You have a student's scores for each year, starting from 1<sup>st</sup> grade through 9<sup>th</sup> grade. Maybe your data table looks like this:

![A table has ten columns and two rows. The column headings range from A to J and the row headings are 1 and 2. The row entries are as follows. Row 1, Grade, 1, 2, 3, 4, 5, 6, 7, 8, 9. Row 2, Jerry, 339, 350, 361, 366,381,390, 398, 421, 446.](Media/L06-01.png)

You could plot this on a scatterplot like you in the previous lesson, and it would look like this:

![A graph depicts the plot of Jerry's score by grade. The x-axis represents the grade and it ranges from 0 to 10 in five units with 2 units of interval. The y-axis represents the score and it ranges from 0 to 500 with 50 intervals of units. Nine points are plotted on the graph. The plots represent the scores 339, 350, 361, 366,381,390, 398, 421, and 446.](Media/L06-02.png)

Note that there is exactly one data point for each value on the horizontal axis, and there is a 'continuation' of sorts, in that as you move from left to right, time is marching forward at a constant pace. This is the ideal setting for a trend chart, and it makes sense to connect the points with a dot-to-dot line:

![A graph depicts the plot of Jerry's score by grade. The x-axis represents the grade and it ranges from 0 to 10 in five units with 2 units of interval. The y-axis represents the score and it ranges from 0 to 500 with 50 intervals of units. Nine points are plotted on the graph and they are connected. The plots represent the scores 339, 350, 361, 366,381,390, 398, 421, and 446.](Media/L06-03.png)

Now, to be clear, the line was not really necessary here. It was easy enough to see what is happening in the trend without the connected points, this example is for illustrative purposes only. But if you include the connected trend, it does a couple of things:

* It draws your eye naturally to the direction and movement of the trend. 
* It allows you to create a trend graph without showing the individual data points at all. This can be especially helpful if there is a ton of data on the trend graph. It is easy to get the picture all gummed up with individual data points, to the extent where you lose sight of the bigger picture. 

---

### Distance Example

Okay, you will now look at an example where the horizontal axis is not a time stamp. Suppose you are an agricultural scientist, and you are working on a new way to compost organic waste.

You have created a large compost pile, and introduced an enzyme that should speed up the process of breaking down the organic waste to mulch.Temperature is a big factor in decomposition, and you have devised a probe that has several high tech thermometers at evenly spaced intervals on the probe. You can stick the probe into the waste pile, press a button, and have it simultaneously read the temperature at several different locations.

![A mountain range covered with mist.](Media/L06-04.png)

One day, you stick the probe into the pile, and click the button. It measures the temperature every eight inches. Your data may look something like this:

![A graph depicts the plot of compost pile temperature. The x-axis represents the distance from the surface in inches and the y-axis represents temperature. The x-axis ranges from 0 to 120 with 8 units of interval and the y-axis ranges from 60 to 120. The plotted curve starts at the point just below 80 on the y-axis and ends at point 110 on the y-axis.](Media/L06-05.png)

Here again, there is exactly one data point for each unique value on the horizontal axis. To reiterate the contrast, a scatterplot potentially has multiple data points for each unique value on the horizontal axis.

---

### No Individual Data Points 

Here is an example of a trend chart with so much data, that there is no reason to include the individual data points: the trend is enough.

![A graph depicts the plot of IBM stock price. The x-axis represents the date from 8-12-2013 to 6-1-2017 and the y-axis represents the price and it ranges from 100 to 200.](Media/L06-07.png)

This trend chart shows almost four years of daily high stock prices for IBM. As you can see, the individual data points aren't necessary. In fact, the data are so packed together, that the dot-to-dot connections all appear to be straight vertical line segments. But the large amount of data give a pretty good indication of the general trend of the data.

---

## Notes about Trend Charts

Here are a couple of things to think about with trend charts:

* It is preferred, but not required, that the horizontal axis variable has consistent spacing. If the spacing is not consistent, it is a good idea to include the actual data points in addition to the trend line, otherwise you might be guilty of misleading readers.

* If 'time' is your horizontal axis variable, then formatting is important. Most software recognizes time stamps and adjusts the spacing automatically. But if your time stamp is something like year and workweek without any spaces, slash, or hyphen, then you would have the end of 2016 look like this: 201649, 201650, 201651, 201652. Then, the first weeks of 2017 would be 201701, 201702, 201703, etc. Your graph would then look something like this:

    ![A graph depicts the plot of measure by workweek. The x-axis represents workweek and it ranges from 201638 to 201708 and the y-axis represents measure and it ranges from 54 to 66. The plotted points are connected with lines.](Media/L06-06.png)

    Which is definitely what you don't want, because it is misleading.

* Trend charts with a time stamp in the horizontal axis are much more common than trend charts with some sort of a spacing variable in the horizontal axis.

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Line Charts with Multiple Dependent Variables<a class="anchor" id="DS104L4_page_13"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Line Charts with Multiple Dependent Variables

Sometimes it is helpful to include several different trends on the same chart for comparison. Here is an example:

![A graph depicts the plot of the product line global revenue. Three curves are plotted for quantum computer, atom synthesizer, and proton cannon. The x-axis represents the month from Jan to Dec. The y-axis represents revenue in US millions and it ranges from 0 to 100.](Media/L06-09.png)

In this chart, monthly revenue for multiple products is plotted on the same set of axes. Comparison is easy.

This next chart is interesting. The creator obviously is trying to point out the huge difference in monthly enrollments year over year. The 2015 enrollments really took off after in April. 

![A graph depicts the plot for enrolments. Two curves are plotted for the years 2014 and 2015. The x-axis ranges from January to June and the y-axis ranges from 0 to 100 in 10 units.](Media/L06-10.png)

---

## Example of Disparate Axes

One thing to look for when putting multiple line trends on the same axis is that the values be approximately the same. Most software tools will automatically fit the range of the vertical axis to include all the data. If you have a trend whose values are much larger than others, it completely washes out what might otherwise be important information.

Take a look at this chart, for example:

![A graph depicts the sales by state. Two lines are graphed representing Nevada sales and California sales. The x-axis ranges from 1/1/2013 to 7/1/2015 and the y-axis ranges from 0 to 30000.](Media/L06-11.png)

This chart shows sales by state for a 32 month period. You can't even really see ```Arizona``` sales, because the ```Nevada``` sales are laying directly on top of them. This chart shows absolutely nothing noteworthy.

However, if you plot the same data, but just include the Nevada and Arizona sales this time, here is what happens:

![A graph depicts the plot of sales by state. Two curves are graphed representing Nevada sales and Arizona sales. The x-axis ranges from 1/1/2013 to 7/1/2015 and the y-axis ranges from 0 to 30000.](Media/L06-12.png)

This chart is nearly identical to the previous one, but the ```California``` trend has been removed, and the vertical axis was allowed to be automatically set by the software. Now, something really important emerges. Clearly, something happened around October of 2014 to the ```Arizona``` sales. The ```Nevada``` sales are flat, but the ```Arizona``` sales increased by about 100 units per month. This is going to be really important to someone in the company, but they would have never seen it if the trend charts weren't done with a bit of thought put into the process.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 14 - Line Charts in R<a class="anchor" id="DS104L4_page_14"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Line Charts in R

Line charts in R can be created using ```ggplot```'s ```geom_line()``` function.  To create your own line graph, you'll need some information with a date in it, so you will investigate **[Earthquakes in India](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/earthquakes.zip)**.

Things can get a little tricky when you try it with a date (which is, of course, when you'd want a line graph the most)! Here's what happens when you try and create a line graph without specifying that your ```Date``` column needs to be a ```date``` type in R: 

![A graph depicts the plot of the magnitude of earthquakes over time. The x-axis represents the date of an earthquake and the y-axis represents earthquake magnitude. A vertical line is plotted at the center of the graph.](Media/quant27.png)

---

## Data Wrangling

That isn't looking right at all, is it? Well, the solution is to format your ```Date``` column as a ```date```, and you can do this right within your ```ggplot()``` function.  You'll also need to make sure that R has categorized ```M``` as numeric, and change it to numeric if it hasn't.  

You can determine the data type by calling the ```str()``` function, which stands for structure.  It gives you some structural details about your data: 

```{r}
str(earthquakes)
```

Here is the result: 

```text
'data.frame':	25 obs. of  8 variables:
 $ Date    : Factor w/ 24 levels "10-Aug-09","12-May-15",..: 20 17 18 2 13 12 12 9 11 23 ...
 $ Time    : Factor w/ 23 levels "00:35 IST","01:21 IST",..: 23 9 6 14 15 13 12 21 7 16 ...
 $ Location: Factor w/ 24 levels "Andaman and Nicobar Islands",..: 15 20 6 19 17 16 18 1 1 14 ...
 $ Lat     : Factor w/ 24 levels "12.50Â°N","14.1Â°N",..: 7 22 8 11 11 13 12 23 24 15 ...
 $ Long    : Factor w/ 24 levels "66.383Â°E","69.8Â°E",..: 20 4 17 14 14 13 12 22 21 7 ...
 $ Deaths  : Factor w/ 22 levels "~1,000",">2,000",..: 7 14 3 12 22 22 20 3 3 4 ...
 $ Comments: Factor w/ 19 levels "","3 injured in Assam earthquake, tremors felt in West Bengal, Meghalaya and Bhutan",..: 15 1 2 9 5 6 10 13 7 14 ...
 $ M       : Factor w/ 19 levels "5.2","5.6","6",..: 8 15 2 12 8 7 16 8 5 1 ...
```

As you can see, this shows that ```M``` is being treated as a factor, not numeric.  There is an easy way to fix it - just use the ```as.numeric()``` function on the data: 

```{r}
earthquakes$M <- as.numeric(earthquakes$M)
```

And you can prove to yourself it worked by running the ```str()``` function again.  Here are the results:

```text
 $ M       : num  8 15 2 12 8 7 16 8 5 1 ...
```

---

## Date Formats

Here are all the possible date/time formats that R accepts: 

<table class="table table-striped">
    <tr>
        <th>Format</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%a</td>
        <td>Abbreviated weekday.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%b</td>
        <td>Abbreviated month.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%c</td>
        <td>Locale-specific date and time.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%H</td>
        <td>Decimal hours (24 hour).</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%j</td>
        <td>Decimal day of the year.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%M</td>
        <td>Decimal minute.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%s</td>
        <td>Decimal second.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%w</td>
        <td>Decimal weekday, with 0 = Sunday.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%x</td>
        <td>Locale-specific date.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%y</td>
        <td>2-digit year.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%z</td>
        <td>Offset from GMT.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%A</td>
        <td>Full weekday.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%B</td>
        <td>Full Month.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%d</td>
        <td>Decimal date.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%I</td>
        <td>Decimal hours (12 hour).</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%m</td>
        <td>Decimal month.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%p</td>
        <td>Locale-specific AM/PM.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%U</td>
        <td>Decimal week of the year (starting on Sunday).</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%W</td>
        <td>Decimal week of the year (starting on Monday).</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%x</td>
        <td>Locale-specific time.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%Y</td>
        <td>4-digit year.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>%z</td>
        <td>Time zone (character).</td>
    </tr>
</table>

If you look at the ```Date``` column in the ```earthquakes``` dataset, it is structured like this: 

```text
3-Jan-16
```

Lining the above date up with the appropriate symbols, the format will look like this: 

```text
%j-%b-%y
```

This is because ```%j``` is the symbol for the decimal (or numerical) day of the year, ```%b``` is the symbol for month abbreviation, and ```%y``` is the symbol for the two-digit year.  Each of these is separate by a hyphen, because that it what is used as a separator in the data itself.  If you had ```/``` as separators instead, which is also common for dates, then you would places ```/``` in, and have a format that looks like this: 

```text
%j/%b/%y
```

---

## Plotting

Now with your variable type correct, it is time to begin plotting.  Instead of just putting ```Date``` into the ```aes()``` function by itself, the trick is to call ```as.Date``` in ```aes()``` on the ```Date``` variable and specify the format the date should take.  Here's what the code as a whole looks like: 

```{r}
ggplot(earthquakes, aes(as.Date(Date, "%j-%b-%y"), M)) +
  geom_line() + 
  xlab("Date of Earthquake") + 
  ylab("Earthquake Magnitude") + 
  ggtitle("Magnitude of Earthquakes over Time")
```

Going over the above line by line, you will call the function ```ggplot()```, then specify your dataset, ```earthquakes```, and then use the function ```aes()``` to include the variables you will be plotting.  Before you specify ```Date```, however, you will need to use ```as.Date()``` on the ```Date``` column.  Then specify the format of the date, which will be ```"%j-%b-%y"```, and specify your y variable, ```M```.  The rest should be much more familiar - you have a call to ```geom_line()```, then are specifying labels and titles.

Here is the resulting plot: 

![A graph depicts the plot of the magnitude of earthquakes over time. The x-axis represents the date of an earthquake and the y-axis represents earthquake magnitude. The x-axis ranges from 1980 to 2060 and the y-axis ranges from 5 to 15. A vertical line is plotted at the center of the graph.](Media/quant26.png)

It looks much more like a line graph should! However, do you notice anything funny along the x-axis? Maybe that the years are extending well beyond the current year, into 2020 and beyond? No, you are not predicting any earthquakes into the future - this was completely a historical dataset! This highlights one of the important foibles of using a two-digit year with your data - they can be misinterpreted.  If you have a choice, always enter and utilize data in a four-digit format for clarity!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 15 - Line Charts in Python<a class="anchor" id="DS104L4_page_15"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Line Charts in Python

You can use the ```matplotlib``` package in Python to make a pretty good basic line chart, using the ```.plot()``` function. This one line of code will give you the basic plot: 

```python
plt.plot(Earthquakes['Date'], Earthquakes['M'])
```

And here it is:

![A graph depicts the plot of the magnitude of earthquakes over time. The x-axis represents the date of an earthquake and the y-axis represents earthquake magnitude. The y-axis ranges from 6.7 to 8.1. A pattern is plotted to represent earthquake magnitude over time.](Media/quant24.png)

Where you are graphing the ```Date``` column on the x-axis and the magnitude of the earthquake, ```M``` on the y-axis.  If you want to add labels, that is easily done! ```plt.xlabel()``` will label your x-axis, ```plt.ylabel()``` will label your y-axis, and ```plt.title()``` will plunk a title on your graph. 

```python
plt.plot(Earthquakes['Date'], Earthquakes['M'])
plt.xlabel('Date')
plt.ylabel('Magnitude')
plt.title("Earthquake Magnitude over Time")
```

The result is a labeled graph:

![A graph depicts the plot of the magnitude of earthquakes over time. The x-axis represents the date of an earthquake and the y-axis represents earthquake magnitude. The y-axis ranges from 6.7 to 8.1. A pattern is plotted to represent earthquake magnitude over time.](Media/quant25.png)

---



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 16 - Area Charts<a class="anchor" id="DS104L4_page_16"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Area Charts

An area chart is a special type of line chart where there is an additive quality to the data. Anytime there is a reason to look at data additively, area charts can give a good feel for how much each item contributes to the whole pile. They differ from a line chart with multiple trends in that lines are not laying on top of each other, sometimes making them hard to distinguish. Take a look at this example, of microbreweries in Washington from 2010-2014:

![A graph depicts the plot of years against barrels. The x-axis represents years from 2010 to 2014 and the y-axis represents barrels. A few patterns are plotted to represent Redhook, Georgetown, Mac and Jacks, Elysian, Fish, and Kona. The pattern for Redhook takes more space and then Georgetown. The least space is taken for Kona.](Media/quant28.png)

In this example, it shows the number of barrels produced by each brewery. Notice that it is difficult to see any trends that may have taken place in the lower producing breweries, because they are all smooshed together.  It's also a little hard to understand the trends for the top producing breweries, because the trends below it are not flat, so you have an uneven surface from which to compare. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 17 - Google Trends<a class="anchor" id="DS104L4_page_17"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Google Trends

Google has put together a fun website called **[Google Trends](https://trends.google.com/trends/?geo=US)**. Have you ever seen a topic or person be described as 'trending?' Google Trends does a sort of web scrape that will give you nearly instant feedback on how popular a certain topic is on the web. It is probably not a good tool for thorough analysis, but it can be interesting. For example, here is a search for 'cucumber soup' with the time criteria to the past 5 years. Here is what came up:

![Two boxes labeled cucumber soup and compare. The next bar has four dropdown list boxes labeled worldwide, past 5 years, all categories, and web search. A graph labeled interest over time are placed next to the dropdown list boxes. The x-axis represents the date and it ranges from Nov 4, 2012 to Apr 30 2017. The y-axis ranges from 25 to 100.](Media/L06-25.png)

Apparently, cucumber soup is a big deal every year in late July and early August! A little further digging revealed that cucumber soup is a traditional Polish dish, popular among vegans, and is served hot or chilled, like a gazpacho. 

Note that the line trend has a scale on the left side, where the tallest peak is at 100. This is a scaling factor of sorts, and not a raw measure of popularity. Google Trends takes the highest point on the trend, and sets it to 100, and then compares every other data point on the trend to that high water mark. It says nothing about the absolute popularity of the word or phrase. In fact, you can compare multiple phrases on the same graphs, and it will give you a relative popularity of phrases by setting the high point of either phrase at 100, and scale the rest of the data to that.

For example, look at the same graph with a comparison thrown in: data science.

![Three boxes labeled cucumber soup, data science, and add comparison. The next bar has four dropdown list boxes labeled worldwide, past 5 years, all categories, and web search. A graph and a bar chart labeled interest over time is placed next to the dropdown list boxes. The x-axis represents the date and it ranges from Nov 4, 2012 to Apr 30 2017. The y-axis ranges from 25 to 100.](Media/L06-26.png)

It looks like in late July of 2013, 'data science' and 'cucumber soup' were at about the same level of popularity on the web, but clearly 'data science' has really taken off since then.

Now add a pop culture icon:

![Three boxes labeled cucumber soup, data science, Selena Gomez, and add comparison. The next bar has four dropdown list boxes labeled worldwide, past 5 years, all categories, and web search. A graph and a bar chart labeled interest over time are placed next to the dropdown list boxes. The x-axis represents the date and it ranges from Nov 4, 2012 to Apr 30 2017. The y-axis ranges from 25 to 100.](Media/L06-27.png)

It looks like cucumber soup is now basically flat-lined compared to Selena Gomez, and data science is not doing so well, either. Add one more trend, just for some perspective:

![Three boxes labeled cucumber soup, data science, Selena Gomez, and iPhone. The next bar has four dropdown list boxes labeled worldwide, past 5 years, all categories, and web search. A graph and a bar chart labeled interest over time is placed next to the dropdown list boxes. The x-axis represents the date and it ranges from Nov 4, 2012 to Apr 30 2017. The y-axis ranges from 25 to 100.](Media/L06-28.png)

Tossing 'iPhone' into the mix makes everything else either inconsequential, or nearly so. As they say, everything is relative...

---

## Summary

* Histograms are used to show a distribution of a single quantitative variable.
* Bar charts are used to show the frequency of a single categorical variable.
* Stacked bar charts and grouped bar charts allow the creator to show the frequency of two categorical variables on a single graph.
* Pareto charts are a specialized type of bar chart, where the bars are in order from tallest to shortest (moving left to right). There is also a cumulative trend line that is helpful in facilitating the 80/20 rule.
* Scatter plots are used to display two quantitative variables.
* Each data point is the indicator for the two quantitative variables for a single experimental unit.
* Line charts are used to visualize a trend through time or space.
* Line charts can be simple and contain a single trend, or can be complex with multiple trends on the same chart.
* If line trends are mutually exclusive, it sometimes makes sense to create a stacked area chart to show breakdown by category.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 18 - Key Terms <a class="anchor" id="DS104L4_page_18"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms 

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pareto Principle</td>
        <td>The 80/20 rule, win which there 20% of something causes 80% of all the problems.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pareto Chart</td>
        <td>A bar chart that has the bars ordered in descending order by frequency and has a cumulative percent line running over top.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>barchart()</td>
        <td>A function in lattice that creates a bar chart.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>main=</td>
        <td>An argument in a lattice chart to add a title.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ylab=</td>
        <td>An argument in a lattice chart to add a y-axis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>xlab=</td>
        <td>An argument in a lattice chart to add a x-axis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>col=</td>
        <td>An argument in a lattice chart to add a color.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>fill=</td>
        <td>An argument in the aes() function of ggplot() to make a bar chart.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>position="fill"</td>
        <td>An argument in geom_bar() that makes the heights of all the bars in a stacked bar chart the same.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>position="dodge"</td>
        <td>An argument in geom_bar() that makes the bars go side by side for a categorical variable instead of stacked.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>as.numeric()</td>
        <td>Changes a variable into a number.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>as.Date()</td>
        <td>Changes a variable into a date format.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Lattice</td>
        <td>A easy-to-use data visualization library with some customization ability.</td>
    </tr>
</table>



---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.hist()</td>
        <td>A pandas function that creates a histogram.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sns.displot()</td>
        <td>Creates a histogram with a best-fit line.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sns.pairplot()</td>
        <td>Creates histograms for your entire dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>plt.hist()</td>
        <td>Creates a histogram in matplotlib.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>facecolor=</td>
        <td>An argument for plt.hist() that changes the color of the histogram.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>alpha=</td>
        <td>An argument for plt.hist() that changes the transparency of the histogram.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>plt.xlabel()</td>
        <td>Adds an x-axis to a graph in matplotlib.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>plt.ylabel()</td>
        <td>Adds a y-axis to a graph in matplotlib.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>plt.title()</td>
        <td>Adds a title to a graph in matplotlib.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>plt.show()</td>
        <td>Prints a plot in matplotlib.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.value_counts().plot('bar')</td>
        <td>Creates a bar chart in pandas.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>set_title()</td>
        <td>Adds a title when graphing in pandas.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>set_xlabel()</td>
        <td>Adds a x-axis when graphing in pandas.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>set_ylabel()</td>
        <td>Adds a y-axis when graphing in pandas.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.scatter()</td>
        <td>Creates a scatter plot in pandas.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>c=</td>
        <td>An argument for .scatter() that provides a third variable by color.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>cmap=</td>
        <td>An argument when graphing in pandas to add a color scheme to a graph.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>s=</td>
        <td>An argument for .scatter() that provides a third variable by size.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>plt.plot()</td>
        <td>Creates a line chart in matplotlib.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Seaborn</td>
        <td>An easy-to-use data visualization package with minimal modification abilities.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Matplotlib</td>
        <td>A complex data visualization package with high ability to modify.</td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 19 - Lesson 4 Hands-On<a class="anchor" id="DS104L4_page_19"></a>

[Back to Top](#DS104L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

---
# Lesson 4 Hands-On

In this hands-on, you will be practicing your quantitative data visualization skills.  Feel free to use R, Python, or both to complete this project. This Hands-On **will** be graded, so make sure you complete each part. When you are done, please submit one document with all of your findings for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Part 1

Here's **[a listing of the number of power boats](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/L5P1.zip)** registered in Florida (in the 1000's) for each year, 35 years worth of data. Tinker with the settings until you have created a histogram with 7 bars.  

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Remember, you can specify the number of bins in Python if you use matplotlib, and in R, there is an argument to ggplot for binwidth= that will let you explore bins! </p>
    </div>
</div>


---

## Part 2

Use the **[following data](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/L5P2.zip)** to create a bar graph. 

---

## Part 3

Use the **[following data](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/L5Part3.zip)** to create a stacked bar chart using either Python or R. 

---

## Part 4

The **[following dataset](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/crocodiles.zip)** contains data for estuarine crocodiles, recording their head length and body length. Visualize the data with a clearly labeled scatter plot.

---

## Part 5

Here's **[data tracking heart attacks](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/L5P5.zip)** treated at a hospital chain in a large US city. Use this data to create a line chart. Be sure to label the graph, and label both axes. 

---
Create a report, make note of any interesting findings, and submit it for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>

