In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab06.ipynb")

# Lab 06: The Grammar of Graphics
Welcome to Lab 6 of DATA 271!

## Overview
The phrase "Grammar of Graphics" was coined by statistician Leland Wilkinson who authored a book in the 1990s with a broad impact on statistics and data science.  The book codifies a consistent way to represent and think about statistical graphs.  The grammar laid out in this book is the foundation of the R graphics package ggplot2 written by Hadley Wickham in 2010. They key idea is that graphs are broken into semantic components such as scales and layers. Python has several packages that support a general grammar of graphics approach to visualization, but `plotnine` was specifically created to mimic the ggplot2 package in R. For this lab, we will use `plotnine`.

The `plotnine` documentation can be found [here](https://plotnine.org/reference/). 

### In today's lab, we will...
- Begin to understand what makes an effective graph and how to represent multiple variables (potentially both numeric and categorical) on the same plot.
- Refresh the rules of "the grammar of graphics."
- Create and modify a plotnine object.
- Build complex plots using a step-by-step approach following the grammar of graphics.
- Intelligently break plots into meaningful subplots to extract insights from data.
- Create scatter plots, box plots, bubble plots and lineplots.


### Install and import required packages

If you haven't done it in a while on your JupyterHub account, you may have to run reinstall `plotnine`. To do so, type 
```python
pip install plotnine
pip install matplotlib==3.8.3
```
in the terminal. Ask Dr. Johnson if you can't remember how to do this. 

In [None]:
from plotnine import *
from plotnine.data import *

import numpy as np
import pandas as pd
import warnings 
warnings.filterwarnings('ignore') 

### 1. Introduction and recap of the Grammar of Graphics

In the article ["A Layered Grammar of Graphics"](https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf) by Hadley Wickham, the author highlights the main components of a statistical graphics:

- data
- aesthetic mapping
- scales
- geometric objects
- statistical transformations
- facets
- coordinate system

Review what each of these components mean and answer the following questions. 

<!-- BEGIN QUESTION -->

**Question 1.1:** What are examples of aesthetics? How are these related to scales?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.2:** What are examples of geometric objects?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.3:** What is a coordinate system? Give an example of a coordinate system beyond Cartesian. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.4:** Give two examples of a statistical transformation.  

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.5:** What are facets? Explain why using facets can lead to clearer visualizations. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 2. Appetizers: ggplot basics with the mpg dataset
`plotnine.data` contains a number of data sets.  One example was collected by the Environmental Protection Agency on different models of cars. 

In [None]:
mpg.head()

**Question 2.1:** How many rows and columns are in the `mpg` DataFrame?

In [None]:
num_rows = ...
num_cols = ...

num_rows,num_cols

In [None]:
grader.check("q2_1")

<!-- BEGIN QUESTION -->

**Question 2.2:** What do the `cyl`, `hwy`, `displ`, and `drv` columns stand for?

*HINT:* Check the plotnine documentation using the link above or type `mpg?` in a cell. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.3:** Use a plotnine ggplot to create a scatter plot of `cyl` mapped to the x-axis and `hwy` mapped to the y-axis. Do you notice anything interesting about your plot? (How many data points do you expect to see?)

In [None]:
cyl_hwy = ...
cyl_hwy.draw()

In [None]:
grader.check("q2_3")

**Question 2.4:** Use a plotnine ggplot to create a scatter plot of `drv` mapped to the x-axis and `class` mapped to the y-axis. Is this a useful plot? Explain.

In [None]:
class_drv = ...
class_drv.draw()

In [None]:
grader.check("q2_4")

**Question 2.5:** Use a plotnine ggplot to create a scatter plot of `displ` mapped to the x-axis and `hwy` mapped to the y-axis.

In [None]:
hwy_displ = ...
hwy_displ.draw()

In [None]:
grader.check("q2_5")

**Question 2.6:** Which variables in the `mpg` dataset are categorical, which are numeric? Assign `categorical` to a Python list of strings containing the names of the categorical columns in `mpg`. Assign `numeric` to a Python list of strings containing the names of the numeric columns in `mpg`.  

*NOTE:* Although certian variables could go either way, assume the variables of type int or float are numeric for this problem. 

In [None]:
categorical = ...
numeric = ...

In [None]:
grader.check("q2_6")

**Question 2.7:** If we inspect our scatterplot from problem 2.5, there are some points which fall between 20-30 on the `hwy` variable, and also have somewhat large engines (with `displ` between 5 and 7).  They seem to be outside of the linear trend we observed.  Is there something different about these variables?

Aesthetic mappings (*aes*) allow us to add additional variables to our visualization. Add `class` to the scatterplot by mapping it to a shape aesthetic.

In [None]:
hwy_displ_class = ...
hwy_displ_class.draw()

In [None]:
grader.check("q2_7")

**Question 2.8:** Interestingly, the plot from the previous problem shows us that most of those vehicles with abnormally large engine sizes are 2-seaters. What kind of cars are those?

Add `manufacturer` to the scatterplot by mapping it to a color aesthetic. 

*HINT:* To see everything in the legends, you might find it helpful to add to your ggplot object
```python
+ theme(figure_size=(8, 8))
```

In [None]:
hwy_displ_class_manuf = ...
                         ...
                         ...
hwy_displ_class_manuf.draw()

In [None]:
grader.check("q2_8")

**Question 2.9:**  What happens if you map a numeric variable to the color aesthetic? Create the same plot as the one above, map `cty` to the color aesthetic instead of manufacturer. Try other numeric variables too. What do you notice?

In [None]:
hwy_displ_class_cty = ...
                         ...
hwy_displ_class_cty.draw()

In [None]:
grader.check("q2_9")

If you find those lighter colors in the plot above difficult to see, it is also possible to set different scales for your color mapping. `plotnine` supports the matplotlib [colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html). For example, we could use the the `seismic` colormap.  

In [None]:
(ggplot(mpg,aes(x='displ',y='hwy', shape = 'class',color = 'cty')) # SOLUTION
                + geom_point() # SOLUTION
                + scale_color_cmap(cmap_name="seismic") # SOLUTION       
).draw()

<!-- BEGIN QUESTION -->

**Question 2.10:** Play around with mapping different variables to different aesthetics. How do these aesthetics behave differently for categorical vs. numeric variables?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.11:** We've seen that a way to add an additional variable is with aesthetics.  Another option, particularly useful for categorical variables, is to split the plot into facets. `facet_wrap()` allows us to do this (as long as the variable we pass in is discrete). This will create subplots that each display one subset of the data. 

Create a plot with `displ` mapped to the x-axis, `hwy` mapped to the y-axis and facetted with the `class` variable. Use this to create a subplot structure with 2 rows. 

In [None]:
hwy_displ_class_fac = ...
                        ...
                        ...
hwy_displ_class_fac.draw()

In [None]:
grader.check("q2_11")

**Questin 2.12:** Geometrical objects (*geoms*) are used to represent data in different ways.  We often describe the type of plot by the geom used by the plot-- for example, bar charts use bar geoms (`geom_bar`), line charts use line geoms (`geom_line`), boxplots use boxplot geoms `geom_boxplot`, scatterplots use the point geom (`geom_point`), etc.

Create a bar plot to show the number of vehicles in the dataset for each `class`. 

In [None]:
class_count = ...
class_count.draw()

In [None]:
grader.check("q2_12")

Statistical transfromations (*stats*) allow us to summarize or smooth the data. For example, `stat_summary()` summarizes the $y$ value for each unique $x$ value by plotting a the median, the minimum, and the maximum values of $y$ for each $x$. See the example below. 

In [None]:
( 
    ggplot(data=mpg, mapping=aes(x="displ", y="hwy")) 
    + stat_summary(
    fun_ymin=np.min,
    fun_ymax=np.max,
    fun_y=np.median)
).draw()

**Question 2.13:** Stats also allow us to smooth the data.

Create a scatter plot with `displ` mapped to the x-axis, `hwy` mapped to the y-axis and a smooth stat. Try using different "methods" of smoothing. You can find the options [here](https://plotnine.org/reference/stat_smooth#plotnine.stat_smooth).

In [None]:
hwy_displ_smooth = ...
                        ...
                        ...
hwy_displ_smooth.draw()

In [None]:
grader.check("q2_13")

**Question 2.14:** Create the same plot as above, but this time map `class` to the color aesthetic. What happens to your smoother lines?

In [None]:
hwy_displ_smooth2 = ...
                        ...
                        ...
hwy_displ_smooth2.draw()

In [None]:
grader.check("q2_14")

### Other things that affect the style of your graph
There are several things you can do to adjust the style of your `plotnine` ggplot. Some of them are demonstrated below. 

In [None]:
# add a title to your plot
(ggplot(mpg,aes('displ','hwy')) 
 + geom_point()
 + ggtitle('Highway MPG vs Engine Size')).draw()

In [None]:
# update axis labels
(ggplot(mpg,aes('displ','hwy')) 
 + geom_point()
 + ggtitle('Highway MPG vs Engine Size')
 + xlab('Engine Size')
 + ylab('Highway MPG')).draw()

In [None]:
# adjust the theme (options can be found at https://plotnine.org/reference/#themes)
(ggplot(mpg,aes('displ','hwy')) 
 + geom_point()
 + ggtitle('Highway MPG vs Engine Size')
 + xlab('Engine Size')
 + ylab('Highway MPG')
 + theme_classic()).draw()

### 3. Main Course: Analysis on Hans Rosling's TED Talk data  
Hans Rosling's TED talk ends with data from 2003.  Two decades has passed, and new data is available.
Gapminder is an independent Swedish foundation with no political, religious, or economic affiliations.  It was founded in 2005 by Hans Rosling and others.  In 2007, some of its software was acquired by Google, and the Gapminder team assisted Google in improving their search to return better results for global statistics from big data providers.  Rosling and coauthors also released the book Factfulness in 2018, which became an international best seller.
Gapminder provides data curated from a number of reputable sources.  Data can be accessed on this website: https://www.gapminder.org/data/.
A csv file called `gapminder_all.csv` has been placed in your local directory if you are working on JupyterHub. If you are working on your own device, download the data or read it in from online. Use that csv file to answer the following questions. 

**Question 3.1:** Import the `gapminder_all.csv` data. 

In [None]:
df = ...
df.head()

In [None]:
grader.check("q3_1")

**Question 3.2:** Consider the data for GDP and life expectancy in 2007.  Make a scatterplot with life expectancy (`lifeExp_2007`) as a function of GDP (`gdpPercap_2007`).  Add the variable `continent` as a color.  What trends do you notice?  Discuss your observations in a few sentences.

In [None]:
scatter_2007 = ...
            ...
scatter_2007.draw()

In [None]:
grader.check("q3_2")

**Question 3.3:** Create a new graph with the same information as above and with each country's 2007 population (`pop_2007`) mapped to a size aesthetic. Explain any trends you see. 

*NOTE:* This will create something called bubble plot. If you want the points to have a little transperency, you can use  `geom_point(alpha = 0.7)`.

In [None]:
bubble_2007 = ...
            ...
bubble_2007.draw()

In [None]:
grader.check("q3_3")

**Question 3.4:** Dr. Rosling warned us it can be problematic to use average data and that context is important.  Therefore, instead of grouping countries by continent, let's focus on a specific continent and look at differences between countries.  

Make a similar graph for only the African countries in the year 2007.  Use size of dots to represent country population, color of dots to represent different countries, and rename axes, plot title and legend titles so they are clear.  Legend labels can be renamed with `labs(color = 'your name', size = 'your name')`. Note, you might have to adjust the figure size too. 

In [None]:
africa_2007 = ...
            ...
            ...
            +ggtitle('Africa in 2007')
            ...
            ...
            ...
africa_2007.draw()

In [None]:
grader.check("q3_4")

**Question 3.5:** Repeat the task above, but this time, focus on the Americas.  

Which two countries in the Americas had an extremely high GDP and an extremely high life expectancy in 2007?  Which country had a very low GDP and a very low life expectancy?

In [None]:
americas_2007 = ...
            ...
            ...
            ...
            ...
            ...
            ...
americas_2007.draw()

In [None]:
grader.check("q3_5")

<!-- BEGIN QUESTION -->

**Question 3.6** Do you have a hypothesis about what factors impact the expected life expectancy in the countries with the two highest GDPs in 2007?  Why do you think they are ordered as they are?  (Feel free to use Google as you brainstorm.)  What about the country with the lowest GDP/life expectancy?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## You're done! 

Gus says congratulations on finishing the lab this week, and he wants you to have these flowers. Run the cell below and submit to Canvas. 


<img src="gus_gives_flowers.JPG" alt="drawing" width="500"/>

### References
- Wickham, Hadley. "A layered grammar of graphics." Journal of Computational and Graphical Statistics 19.1 (2010): 3-28. https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf
- Wilkinson, Leland. The grammar of graphics. Springer Berlin Heidelberg, 2012.
- Factfulness: Ten Reasons We're Wrong About the World and Why Things Are Better Than You Think by Hans Rosling and Anna Rosling Ronnlund
- Elegant Graphics for Data Analysis, Springer 3rd edition by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen.: https://ggplot2-book.org/index.html

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)