# Lecture 8: Data Visualization

**Tools**

* `seaborn` - generating tools
* `pandas` - wrangling data
* `matplotlib` - fine tuning plots

**Plotting**

* Quantitative data
* Categorical data

**Customizing Visualizations**

A good visualization can help you:

* Identify anomalies in your data
* Better understand your own data
* Communicate your findings

## Quick Introduction

95%+ of the plots fall into just a few types:

* Single variable
    * Continuous
    * Discrete
* Discrete vs discrete
* Discrete vs continuous
* Continuous vs continuous

## Basic Visualizations

* Histograms
* Desntiy plots
* Scatterplots
* Bar plots
    * Grouped bar plots
    * Stacked bar plots
* Box plots (and related things like violin plots, etc.)
* Line plot

## Variable Types: Plots

* Statistical distribution of quantitative variables
    * Single variable
        * Histogram
        * Density plot
    * Single variable x Categorical variable
        * Box plot
* Count data
    * Count data x Categorical variable
        * Bar plot
    * Count data x 2 Categorical variables
        * Grouped bar plot
        * Stacked bar plot
* Directly view quantitative variables
    * One variable x Time
        * Line plot
    * One variable x Time X Categorical variable
        * Multiple lines on the same plot
    * Two (or maybe 3) quantitative variables
        * Scatterplot

### Histograms

Information about a **single quantitative** variable

### Density Plots

Information about a **single quantitative** variable

### Scatterplot

Relationship between **two quantitative** variables

### Bar Plot

Count of values within a **single categorical** variable

### Grouped Bar Plot

Count of values broken down across **two categorical** variables

### Stacked Bar Plot

Count/proportion of values broken down across **two categorical** variables

### Box Plot

Summary of a **quantitative variable** broken down by a **categorical variable**

### Line Plot

**Quantitative** trend over **time**

* Single series
* Two series
* Multiple series

**CLICKER QUESTION:**

You want to visualize how many people in your dataset prefer chocolate chip cookies and how many prefer oatmeal raisin cookies.

What type of visualization would be most appropriate?

A) Histogram

B) Scatterplot

**C) Bar Plot**

D) Box Plot

E) Line Plot

**CLICKER QUESTION:**

You're interested in visualizing how many servings of milk an individual drinks each day among those who prefer chocolate chip cookies and those prefer oatmeal raisin cookies.

What type of visualization would be most appropriate?

A) Histogram

B) Scatterplot

C) Bar Plot

**D) Box Plot**

E) Line Plot

**CLICKER QUESTION:**

You're interested in visualizing how many servings of milk an individual drinks each year over the course of their life?

What type of visualization would be most appropriate?

A) Histogram

B) Scatterplot

C) Bar Plot

D) Box Plot

**E) Line Plot**

## Plotting in Python: Getting Started

In [None]:
# import working with data libraries
import pandas as pd
import numpy as np

# import seaborn
import seaborn as sns

# import matplotlib
import matplotlib.pyplot as plt
import matplotlib as mpl

# improve resolution
# comment this line if erroring on your machine
%config InlineBackend.figure_format = 'retina'

In [None]:
sns.__version__

### Class Data

With the libraries we imported, the first dataset we'll use today is data from the COGS 108 class survey from the Spring of 2019.

In [None]:
df = pd.read_csv("https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/data/df_for_viz.csv")

In [None]:
df.shape

In [None]:
df.head()

Wrangling that's been done:

* Removed lots of identifying information
* Standardized gender and job
* Separated out programming responses

In [None]:
df.describe()

### Quantitative Variables

* Histograms
* Density Plots
* Scatterplots

### Histograms and Density Plots

Histograms and density plots are helpful for visualizing informaiton about a *single quantitative variable*.

We can use seaborn's `histplot` function (`distplot` in older versions of `seaborn`)

In [None]:
# set plotting size parameter
plt.rcParams['figure.figsize'] = (17, 7)    # default plot size to output

In [None]:
sns.set_theme(context='notebook', style='white', font_scale=2, rc={'axes.spines.right': False, 'axes.spines.top' : False})

In [None]:
# histogram
# `distplot` in older versions of `seaborn`
sns.histplot(df['statistics'], bins=10, kde=False)

One thing to note about histograms is the fact the the number of bind displayed plays a large roles in what the viewer takes away from the visualization. 

In [None]:
# `distplot` in older versions of seaborn
# just histogram - set kde = False
sns.histplot(df['statistics'], bins = 20)

# alternative approach using pandas
# df['statistics'].hist(bins=10)

This doesn't follow "visualization best practices."

### Visualization Best Practices

* Choose the right type of visualization
* Be mindful when choosing colors
* Label your axes
* Make text big enough
* Keep it simple
* Less is more:
    * Aim to to improve your data:ink ratio
    * Everything on the page should serve a purpose. If it doesn't, remove it.

### Best Practices Examples

**Example #1: Pie Chart**

Ideas:

* Pros:
    * Consistent colors from left to right
    * Values provided for each slice
    * Overall picture
* Cons:
    * Text size
    * Legend not ideal
    * Colors are not intuitive
    * Pie chart not ideal because of number of categories

Suggestions:

* Different visualization: Stacked Barplot?

## Less Is More

The *less is more* approach suggests that we should probably get rid of this background color now and remove the gridlines. We'll use less is more approach as we work through the other types of visualizations.

Lets improve that now for our original plot...

In [None]:
# `distplot` in older versions of seaborn
# change color to dark gray
ax = sns.histplot(df['statistics'], kde=False, bins=10, color='#686868')

# remove the top and right lines
sns.despine()

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS 108 students are moderately comfortable with statistics.')
ax.set_ylabel('Count')
ax.set_xlabel('How comfortable are you with statistics?')

In [None]:
# kdeplot to only display the density plot
ax = sns.kdeplot(df['programming'], color='#686868')

# remove the top and right lines
sns.despine()

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS 108 students are pretty comfortable with programming.')
ax.set_ylabel('Count')
ax.set_xlabel('How comfortable are you with programming?')

## Scatterplots

Scatterplots can help visualize the relationship between **two quantitative variables**.

In [None]:
sns.scatterplot(x='programming', y='statistics', data=df, alpha=0.1)

# alternative with pandas
df.plot.scatter('programming', 'statistics')

In [None]:
# jitter points to see relationship, try different levels of it
sns.lmplot(x='programming', y='statistics', data=df, fit_reg=False, height=6, aspect=2, x_jitter=0.15, y_jitter=0.15)

In [None]:
# fit a linear model, showing the line of best fit
# and also a 95% confidence interval on the fit
sns.lmplot(x='programming', y='statistics', data=df, fit_reg=True, height=6, aspect=2, x_jitter=0.20, y_jitter=0.20)

## Scatterplots (By A Categorical Variable)

When you want to plot two numeric values but want to ger some insight about a *third* categorical variable, you can color the points on the plot by the categorical variable.

In [None]:
# control color palette
unique = df['lecture_attendance'].append(df['gender']).unique()
palette = dict(zip(unique, sns.color_palette()))
palette.update({'Total' : 'k'})
print(palette)

In [None]:
# control color palette
unique = df['lecture_attendance'].append(df['gender']).unique()
palette = dict(zip(unique, sns.color_palette()))
palette.update({'Total' : 'k'})

In [None]:
# color points by gender
sns.lmplot(x='programming', y='statistics', data=df, hue = 'gender', fit_reg=True, height=6, aspect=2, x_jitter=0.5, y_jitter=0.5, palette=palette)

In [None]:
# color points by lecture attendance
sns.lmplot(x='programming', y='statistics', data=df, hue = 'lecture_attendance', fit_reg=True, height=6, aspect=2, x_jitter=0.5, y_jitter=0.5, palette=palette)

## Categorical Variables

### Barplots

In [None]:
# generate default barplot
sns.countplot(x='lecture_attendance', data=df)

In [None]:
ax = sns.countplot(x='lecture_attendance', 
                   data=df, color = '#686868')

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS108 students prefer to attend lecture')
ax.set_ylabel('Count')
ax.set_xlabel('Lecture Attendance Preference')
# set tick labels
ax.set_xticklabels(("attend", "not attend"))

In [None]:
ax = sns.countplot(x='gender', data=df, color='#686868')

# add title and axis labels (modify x-axis label)
ax.set_title('There are more males than females in COGS108')
ax.set_ylabel('Count') 
ax.set_xlabel('Gender')

It's often a good idea to order axes from largest to smallest for categorical data.

In [None]:
ax = sns.countplot(x='gender', data=df, color = '#686868',
             order=['male', 'female', 'other or prefer not to say'])

# add title and axis labels (modify x-axis label)
ax.set_title('Male is the most prevalent gender in COGS108.')
ax.set_ylabel('Count')
ax.set_xlabel('Gender')

In [None]:
# warning: not seaborn
# pandas approach
# proportion of the class familiar with each programming language
a = df.iloc[:,5:11].sum()/len(df)
a = a.sort_values(axis=0, ascending=False)
a.plot.bar(color='#686868', rot=0)

### Grouped Barplots

In [None]:
# same color palette as defined earlier
# generate grouped barplot by specifying hue
ax = sns.countplot(x='lecture_attendance', hue='gender',
                   data=df, palette=palette, )

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS108 students prefer to attend lecture')
ax.set_ylabel('Count')
ax.set_xlabel('Lecture Attendance Preference')
ax.set_xticklabels(('attend', 'not attend'))

Because we have different numbers of males and females, comparing counts is not all that helpful...

We need proportions.

### Stacked Barplots

In [None]:
# warning: this is not seaborn
df2 = df.groupby([ 'lecture_attendance','gender'])['lecture_attendance'].count().unstack('gender').fillna(0)
sub_df2 = np.transpose(df2.div(df2.sum()))

# generate plot
ax = sub_df2.plot(kind='bar', stacked=True, rot=0,
                  title='Lecture Attendance does not appear to differ by gender')

# customize plot
ax.legend(('not attend','attend'), loc='center left', bbox_to_anchor=(1.0, 0.5))
ax.set_ylabel("Proportion of students")

## More Plots

* Boxplots (quantitative + categorical)
* Lineplots (quantitative over time)

### Box Plots

By default, the box delineates the 25th and 75th percentile. The line down the middle represents the median. "Whiskers" extend to show the range for the rest of the data, excluding outliers. Outliers are marked as individual points outside of the whiskers.

In [None]:
# generate boxplots
sns.boxplot(y='statistics', x='gender', data=df)

### Outlier Determination

Outliers show up as individual points on boxplots. But, we don't see any on this boxplot. Let's see why...

In [None]:
# determine the 25th and 75th percentiles
lower, upper = np.percentile(df['statistics'], [25, 75])
lower, upper

In [None]:
# calculate IQR
iqr = upper - lower
iqr

Typically, the inter-quartile range (IQR) is used to determine which values get marked as outliers. The IQR is: 75th percentile - 25th percentile. Values greater than 1.5 x IQR above the 75th or below the 25th percentile are marked as outliers.

In [None]:
# calculate lower cutoff
# values below this are outliers 
lower_cutoff = lower - 1.5 * iqr

# calculate upper cutoff
# values above this are outliers 
upper_cutoff = upper + 1.5 * iqr

lower_cutoff, upper_cutoff

Boxplots really shine when you want to look at the range of typical values for a quantitative variable, broken down by a separate categorical variable.

In [None]:
# generate boxplots
# we can make sure the colors match what we used earlier for the same groups
ax = sns.boxplot(x='gender', y='statistics', data=df)

ax.set_title('Gender not related to comfort with statistics')
ax.set_ylabel('Comfort with Statistics')
ax.set_xlabel('Gender')

In [None]:
# generate boxplots
# we can make sure the colors match what we used earlier for the same groups
ax = sns.boxplot(x='gender', y='statistics', data=df, palette=palette)

ax.set_title('Gender not related to comfort with statistics')
ax.set_ylabel('Comfort with Statistics')
ax.set_xlabel('Gender')

## Histograms (By A Categorical Variable)

The same data plotted as a histogram are not so easily interpretable.

In [None]:
# `distplot` in older versions of `seaborn`
sns.histplot(df.loc[df['gender'] == 'female', 'statistics'], kde=True, color="red")
sns.histplot(df.loc[df['gender'] == 'male', 'statistics'], kde=True, color="purple")
sns.histplot(df.loc[df['gender'] == 'other or prefer not to say', 'statistics'], kde=True)

## Customization: `births` data

Now that we're getting the hang of this, let's see how complicated things can get. We'll return to using a line chart to look at birth patterns over time.

In [None]:
# get the data
births = pd.read_csv('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/data/births.csv')
births.head()
births.year.max()

In [None]:
from datetime import datetime

# calculate values & wrangle
quartiles = np.percentile(births['births'], [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')

births['day'] = births['day'].astype(int)

births.index = pd.to_datetime(10000 * births.year +
                              100 * births.month +
                              births.day, format='%Y%m%d')
births_by_date = births.pivot_table('births',
                                    [births.index.month, births.index.day])
births_by_date.index = [datetime(2012, month, day)
                        for (month, day) in births_by_date.index]


# plot the thing
fig, ax = plt.subplots(figsize=(22, 5))
births_by_date.plot(ax = ax)
ax.get_legend().remove()

# What are all those dips? Well, let's annotate the plot to get a better sense of what's going on.

# plot the thing
fig, ax = plt.subplots(figsize=(22, 7))
births_by_date.plot(ax=ax)
ax.get_legend().remove()

# define style
style = dict(size=16, color='gray')

# add annotation
ax.text('2012-1-1', 3950, "New Year's Day", **style)
ax.text('2012-7-4', 4250, "Independence Day", ha='center', **style)
ax.text('2012-9-4', 4850, "Labor Day", ha='center', **style)
ax.text('2012-10-31', 4600, "Halloween", ha='right', **style)
ax.text('2012-11-25', 4450, "Thanksgiving", ha='center', **style)
ax.text('2012-12-25', 3850, "Christmas ", ha='right', **style)

# label the axes
ax.set(title='USA births by day of year (1969-1988)',
       ylabel='average daily births')

# format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'))

Annotation directly on plots can help explain the plot to viewers.

## Saving Plots

While we're using a Jupyter notebook right now, you won't always be. So, you'll need to know how to save figures.

In [None]:
# save fig to plots directory
# this will only work if you have 
# a plots directory in your working directory
fig.savefig('images/my_figure.png', dpi=300)

Note that the file format is inferred from the extension you specify in the filename.

To see which file types are supported:

In [None]:
fig.canvas.get_supported_filetypes()

## Viewing Saved Plots

Once a plot is saved, it may be helpful to view it through IPython or your notebook. To do so, you'd use the following:

Can import with Markdown formatting... (or with HTML in a markdown cell)

In [None]:
# to see contents of a saved image
from IPython.display import Image
Image('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/my_figure.png')