# Visualization Exercises
----

There are multiple ways, across multiple packages, to complete these exercises.  Some answers are given, but there are other possibilities.  

Note: if you have better solutions to these please [let me know](mailto:christina.maimone@northwestern.edu).

## Imports

A few imports to avoid having to do them for each exercise.

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import datetime as dt
import matplotlib.dates as dates

import seaborn as sns

%matplotlib inline

## Get Some Data

The first exercises use data from [Gapminder](http://www.gapminder.org). Read `gapminder_5y_tidy.csv` from https://github.com/nuitrcs/pythonworkshops/raw/master/dataanalysis/datasets/gapminder_5y_tidy.csv (or the datasets directory of the repository) into a pandas data frame called `gapminder`.  Look at the first few observations.

In [3]:
gapminder = pd.read_csv("../datasets/gapminder_5y_tidy.csv")
gapminder.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


## Exercise: Scatter Plot

Using the `gapminder` data, plot life expectancy (`lifeExp`) vs. GDP per capita (`gdpPercap`) for the year 2002.  Make sure to label the axes and give the plot a title.

Challenge: Redo the plot, coloring the points by continent.  Make sure to add a legend.  Hint: there are a few ways to do this.  If you get stuck, take a look at https://matplotlib.org/examples/lines_bars_and_markers/scatter_with_legend.html or https://stackoverflow.com/questions/26139423/plot-different-color-for-different-categorical-levels-using-matplotlib for some approaches.  Hint 2: This may be easiest using Seaborn with the `lmplot` and option `fit_reg` set to false.

Challenge (not contingent on the challenge above): Redo the plot and change the gdp axes to be on a log scale (alter the scale, not the data).

## Exercise: Line Plot

Using the gapminder data, plot the average life expectancy in Asia (average across countries) over time.

To start, you'll need to select just the observations for Asia, then group by year, then calculate the mean for each group (you can do this in one line).

Hint: if you use `groupby` in calculating the mean, the groups become the index of the data frame.  You can use function `reset_index()` to make an index into a column again that you can reference when plotting.

## Exercise: Box Plot

Using the gapminder data, make a box plot of life expectancy by continent for the year 2002.  Hint: use Seaborn

## Exercise: Heat Map

Using Seaborn and the gapminder data, make a heat map of a matrix with years as columns, continents as rows, and average life expectancy as the cell value.  You're aiming for something that looks like:

![heatmap](../Images/heatmap.png)

Hint: the trick here is getting the data in the right format; you may need to `unstack` some grouped data.

## Challenging Exercise: Replicate a Plot

Replicate the plot below using matplotlib.  Some code is provided for categories, colors, and offsets.  The steps are to help guide you to some of the steps involved, but you may instead want to start by making a basic version of the plot and then adding in steps to clean it up.  This exercise is really about steps you can take to make a plot look better and convey information more effectively.

![degrees](http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.png)

Some steps/notes:

* You typically want a plot to be ~1.33x wider than tall. This plot is an exception because of the number of lines being plotted on it.  Size: 12 x 14   
  
* Remove the plot frame lines. They are unnecessary.    
  
* Ensure that the axis ticks only show up on the bottom and left of the plot.  Ticks on the right and top of the plot are generally unnecessary.    
   
* Limit the range of the plot to only where the data is.  Avoid unnecessary whitespace.    
   
* Make sure your axis ticks are large enough to be easily read.  You don't want your viewers squinting to read your plot.  Format your y axis labels with a %
       
* Provide tick lines (dotted grid lines) across the plot to help your viewers trace along the axis ticks. Make sure that the lines are light and small so they don't obscure the primary data lines. Then remove the tick marks; they are unnecessary with the tick lines we just plotted. 

* Make sure plot labels are big enough to read

* Create the title using `text()` instead of `title()` so you can control the positioning


In [None]:
gender_degree_data = pd.read_csv("http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv")    
  
# These are the "Tableau 20" colors as RGB.    
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),    
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),    
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),    
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),    
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]    
  
# Scale the RGB values to the [0, 1] range, which is the format matplotlib accepts.    
for i in range(len(tableau20)):    
    r, g, b = tableau20[i]    
    tableau20[i] = (r / 255., g / 255., b / 255.)    
       
# majors in order of the highest % in the final year.    
majors = ['Health Professions', 'Public Administration', 'Education', 'Psychology',    
          'Foreign Languages', 'English', 'Communications\nand Journalism',    
          'Art and Performance', 'Biology', 'Agriculture',    
          'Social Sciences and History', 'Business', 'Math and Statistics',    
          'Architecture', 'Physical Sciences', 'Computer Science',    
          'Engineering']    

# You'll want to set up the plot here

# Actually plot
for rank, column in enumerate(majors):    
    # Replace the line below with code to actually draw each line.
    # rank gives you an index 0, 1, ...
    pass;
  
    # Add text labels for each line
    # To get the text labels positioned nicely, you may need to 
    # offset them a bit to keep them from overlapping
    y_pos = 0 # replace 0 with an expression to get the end of each line
    
    # you may need to adjust these, but they are what the plot author used originally
    if column == "Foreign Languages":    
        y_pos += 0.5    
    elif column == "English":    
        y_pos -= 0.5    
    elif column == "Communications\nand Journalism":    
        y_pos += 0.75    
    elif column == "Art and Performance":    
        y_pos -= 0.25    
    elif column == "Agriculture":    
        y_pos += 1.25    
    elif column == "Social Sciences and History":    
        y_pos += 0.25    
    elif column == "Business":    
        y_pos -= 0.75    
    elif column == "Math and Statistics":    
        y_pos += 0.75    
    elif column == "Architecture":    
        y_pos -= 0.75    
    elif column == "Computer Science":    
        y_pos += 0.75    
    elif column == "Engineering":    
        y_pos -= 0.25    
   
    # create the label now that you know where it goes

# you might have some finishing touches to do down here

This exercise is taken from http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/ and the full answer is there.

## Exercise: Interpolated 3D Surace from Points

Generate 20 random points in 3 dimensions and plot them in 3D.  

Then change the orientation (tilt, rotation) of the plot to a pleasing angle.  [hint](https://matplotlib.org/examples/mplot3d/rotate_axes3d_demo.html)

Challenge: Then use the points to interpolate a 3D surface and plot that (interpolate z given x and y). [hint](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.griddata.html)  

You may want to investigate effects of different grid sizes and interpolation methods.

## Exercise: xkcd for fun (and learning styling and annotations)

Can you produce [this xkcd comic](https://imgs.xkcd.com/comics/self_description.png) in `matplotlib`?  A copy is in the repository as `Images/xkcd.png`.  There's an xkcd style for `matplotlib`.  

![chartviz](../Images/xkcd.png)

Notes:
* For the first panel, you might not be able to get the lines from the text to the plot exactly like the reference image with the `annotate` function, but you can get close.  
* For panel 3, putting arrows on the spines (axis lines) doesn't work well with the XKCD style, so you might want to skip that (the solution skips it -- if you find a way other than manually drawing arrow heads, [let me know](mailto:christina.maimone@northwestern.edu)).
* Drawing a border around each subplot is a manual process (no built in matplotlib function) so skip that.

Plot produced by the solution looks like:

![chartviz](../Images/xkcd_solution.png)

(If you come up with a better answer, [let me know](mailto:christina.maimone@northwestern.edu).)