# Data Visualization with Matplotlib and Seaborn

Agenda: 
- Explain what types of graphs best convey specific relationships
- Use the subplots syntax to create a graph
    - Line
    - Bar/hbar
    - Scatter
    - Hist
- Customize different aspects of a graph
    - labels (title, axis)
    - Linestyle 
    - Colors
- Create multiple graphs in one figure


Take 2 minutes to peruse some plot examples here:
[Python Graphing Gallery](https://python-graph-gallery.com) or [Data Viz Project](https://datavizproject.com/)

Then, write down what types of plots may be appropriate to visualize the scenarios below.

### Scenario 1: You would like to display counts of coffee shops in each Chicago zipcode?

In [None]:
 # what are some appropriate plots?

### Scenario 2: You would like to visualize the correllation between miles per gallon of a car and horsepower

In [None]:
# what are some appropriate plots?

### Scenario 3: You would like to visualize the distribution of blood pressure readings of American males between 25 and 35

In [None]:
# what are some appropriate plots?

## Why Visualize Data?
or why can’t we just hand someone a table of data?

Let's load up the iris data set.  This is a famous built-in dataset which is used to learn about categorization. 

In [None]:
# One of several libraries you will get real used to importing.  M
# https://matplotlib.org/3.1.1/index.html
import matplotlib.pyplot as plt

# Two well worn data sets
from sklearn.datasets import load_iris, load_wine
import pandas as pd

data = load_iris()
df_iris = pd.DataFrame(data['data'], columns=data['feature_names'])
df_iris['target'] = data['target']

Here is an image of one of the virginica iris, which is unique in its relative petal and sepal length.

![virginica_iris](iris_virginica.jpg)

### Dataframe vs Graph: Which do you prefer?

In [None]:
# I like to use sample rather than head because it gives me a better idea of the distribution of observations
df_iris.sample(5)

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

# Iterate through each type of flower and plot them using different colors
for flower in df_iris['target'].unique():
    subset_df = df_iris[df_iris['target'] == flower]
    x = subset_df['sepal length (cm)']
    y = subset_df['petal length (cm)']
    
    ax.scatter(x, y, label=data['target_names'][flower])

# Label your axes!
ax.set_ylabel('petal length (cm)')
ax.set_xlabel('sepal length (cm)')
ax.set_title('Petal length vs Sepal Length for Three Species of Flowers')
ax.legend();

What information in this graph jumps out to you?

In [None]:
# your thoughts here

In [None]:
#__SOLUTION__
'''
some ideas:
   - The data is separated into three categories.
   - It looks like there are clear lines that there are several clear lines that can be drawn to separate the groups along petal length.
   - Separation of Versicolor and Virginica is more difficult than that between setosa and and the other two.
   - There is a slight upward correlation between Petal lengthe and sepal length in the versicolor and virginica groups.
   - There appears to be a fairly balanced number of samples across the groups.
   - There appears to be one potential outlier.
'''

In your presentation decks, you will no doubt be tempted to print out the head of a data frame, take a screen shot, and plop it in the middle of a slide.  We all have that instinct; the dataframe object will become one your most cherished objects. If you put them in your deck, you will no doubt hear one of us gently request its replacement with some other figure.

## The Effectiveness of Visualizations

- People are highly visual and can synthesize visual information such more quickly than rows and columns of numbers 
- Precognitive understanding of the data
- Visual representations can be much more viscerally persuasive 

## What Makes an Effective Visualization?

- Each graph should have a clear point it is trying to make. Understanding the insight you are trying to convey will guide the decision making process for what kind of graph will be most effective

- Know your audience! Come up with a use case and audience to pitch your visualizations

- Choosing the correct graph for the relationship you are trying to communicate

- Label your axes and graph! It should not be difficult for someone to understand what your graph is trying to represent

- People have unconscious responses to visuals which will effect the way they interpret information. Good visualization makes use of these natural shortcuts in cognition to convey information more efficiently
        - Red and Down tends to be negative while Green and Up is positive
        - Lighter hues are seen as lower values and darker is higher values
        - Axis start at zero
        
__Note:__ All of these 'rules' can be broken but know that you will be working against most people's first instinct

## How to Lie with Graphs

- Graphs can be misleading
- Consciously or unconsciously people will make decisions to lead people towards their conclusions of the data

- Examples of dark patterns
        - Changing the axis scale
        - Using two different y axis scales to compare trends
        - Showing cumulative data which will always be increasing to hide a downturn in a trend
        - Pie charts (comparing degrees is not something people are good at) just use a bar chart
        - Inconsistent units
        - Not showing all of the data for motivated reasons
        - Percentages not adding up to 100

<img src="data/pie-chart-misleading.png">

image: http://flowingdata.com/2009/11/26/fox-news-makes-the-best-pie-chart-ever/

_____



<img src="data/usa-today-2.png">

# Common Charts and Their Uses

## Bar charts

Bar charts are everywhere: powerpoints, billboards and the evening news. They are used to show the relationship of a numerical and a categorical variable.

For example, a bar chart can show the growth of a single categorical variable across time.

There is a lot of code below.  For now we will focus on the output.  Two barcharts of total sales of shampoo across three years and across months.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

shampoo = pd.read_csv('data/sales-of-shampoo-over-a-three-ye.csv')

# Year and months are in an odd format in this shampoo dataset.  
# Use custom functions to extract data

def get_year(date):
    return date[0]

shampoo['year'] = shampoo['Month'].apply(get_year)
total_sales_per_year = shampoo.groupby('year').sum()[:-1]

def get_month(date):
    return date[2:]

shampoo['month'] = shampoo['Month'].apply(get_month)
total_sales_per_month = shampoo.groupby('month').sum()[:-1].sort_values(by='month')
months = ['January', 'February', 'March', 'April', 
         'May', 'June', 'July', 'August', 'September', 
         'October', 'November', 'December']


fig, (ax1,ax2) = plt.subplots(1,2)
ax1.bar(x = list(total_sales_per_year.index), height=total_sales_per_year.values.flatten())
ax1.set_title('Shampoo Sales Over\n Three Year Period')
ax1.set_xlabel('Year')
ax1.set_ylabel('Total Sales')

# If we have long lables on the xaxis, we can get around it with horizontal bar charts.

ax2.barh(y = list(total_sales_per_month.index), width=total_sales_per_month.values.flatten())
ax2.set_yticks(range(0,12))
ax2.set_yticklabels(months)
ax2.set_title('Total shampoo sales per month')
ax2.set_xlabel('Total Sales')
ax2.set_ylabel('Month')
plt.tight_layout()

# Scatter Plots

Scatter plots are also very common.  They allow one to visualize the relationship of two variables. 

In the plots below, we see different correlations between variables:



In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
boston_df = pd.DataFrame(boston['data'])
boston_df.columns = boston['feature_names']

fig,(ax1,ax2,ax3) = plt.subplots(1,3, figsize=(10,5))

ax1.scatter(boston_df.RM, boston_df.LSTAT, c='red')
ax1.set_xlabel('Rooms')
ax1.set_ylabel('LSTAT')
ax1.set_title('Negative Correlation')

ax1.set_xlabel('Rooms')
ax1.set_ylabel('LSTAT')
ax1.set_title('Negative Correlation')

ax2.scatter(boston_df.INDUS, boston_df.NOX, c='green')

ax2.set_xlabel('INDUS')
ax2.set_ylabel('NOX')
ax2.set_title('Positive Correlation')

ax3.scatter(boston_df.RM, boston_df.DIS)

ax3.set_xlabel('Rooms')
ax3.set_ylabel('Distance to Major Employment Centers')
ax3.set_title('Low Correlation')


plt.tight_layout()

## Line Plot

Tracks the change of a single variable over time.  They are generally better than bar graphs over shorter periods of time.


In [None]:
fig, ax = plt.subplots()

ax.plot(shampoo.index, shampoo.iloc[:,1], color='g')
ax.set_xticks(range(0,36), minor=True)
ax.set_xticks(range(12,36, 12))
ax.set_xticklabels(['year_2', 'year_3'])
ax.set_title('Shampoo Sales Across 3 Years')

## Histograms

We will get get further into histograms in mod 2, but it is good to get familiar with them sooner rather than later. 

Histograms are often confused with bar charts, since they look somewhat similar.  The big difference, however, is that histograms visualize the distribution of a continuous variable, rather than the discrete variable shown by barcharts. You can remember this because the bins of histograms don't have spaces between them.

Here is the distribution of the target variable, sales price, from the famous Boston Housing dataset.

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
sales_price = boston['target']

fig, ax = plt.subplots()
ax.hist(sales_price)
ax.set_xlabel('House Price ($1000s)');
ax.set_ylabel('Count')
ax.set_title('Distribution of Boston House Prices');

## Box Plots

Box plots (or box-and-whisker plots), like histograms, show the distribution of a continous variable.  They have a median line, where half the data falls above, half below.  The box represents the interquartile range, and the whiskers encompass (most often) 95% of the data. We can detect skew from a boxplot, and it is also a quick way to see detect outliers.

Again, we will get further into boxplots in mod 2.

In [None]:
fig, ax = plt.subplots()
ax.boxplot(sales_price)
ax.set_xlabel('House Price ($1000s)');
ax.set_ylabel('Count')
ax.set_title('Distribution of Boston House Prices');

## Pie Charts
Love em or hate em, you'll no doubt see em.
One spend much time on them, but know you can use matplotlib to plot them.

In [None]:
fig, ax = plt.subplots()
ax.pie(df_iris['target'].value_counts(), labels = list(data['target_names']), autopct='%1.0f%%');

ax.set_title('Target Distribution of Iris Dataset')
plt.tight_layout()


## How to Matplotlib

<img src="data/matplotlib_anatomy.png">

Explanation of non-obvious terms

__Figure__ - This is the sheet of paper all of your graphing sits on. 

__Axis__ - An axis is an individual plot. You can have multiple axes on one figure

__Major/Minor Ticks__ - The large and small dashes on the x and y axis

__Markers__ - In a scatter plot each of the points is refered to as a marker

__Spines__ - The lines that bound each axis

## Plotting Syntax

- There are many different ways to create plots but we will strongly suggest using the subplots method
    - This is useful for extensibility 
    - Gives you access to the figure and individual axis in a plot
    - More fine grained control of customizing your plot
    - Easily create additional axis on your figure
    - This syntax is a good level of abstraction
        - You can go deeper into the api but this should give you immediate access to most tools you will need for whatever plot you are making
    - Flatiron Specifc
        - Plotting code will be more easily readable for other students and instructors
        - You don’t need to remember many different ways to organize your code

Let's recreate the Boston Housing scatter plot from above, focusing on what the code is doing.

Here are links to the [matplotlib documentation](https://matplotlib.org/index.html) as well as the [Axes object documentation](https://matplotlib.org/api/axes_api.html):

In [None]:
# use boston data
boston_df = pd.DataFrame(boston['data'])
boston_df.columns = boston['feature_names']

# Create figure and axis objects
# We recomend using this syntax.  You could use plt.scatter(), but is less flexible.
fig, ax = plt.subplots()

# Use axes object to visualize a scatter plot

ax.scatter(boston_df.RM, boston_df.LSTAT, c='red');


That is the meat of the plot, but we need to add some labels.
We do this via the axes object methods set_xlabel, set_ylabel.

Use the above mentioned methods to set the xlabel and ylabel to 'Rooms' and 'LSTAT' respectively.
Set the title using set_title to whatever you deem appropriate.

In [None]:
# Create figure and axis objects
boston_df = pd.DataFrame(boston['data'])
boston_df.columns = boston['feature_names']


fig, ax = plt.subplots()
# Your code here

In [None]:
#__SOLUTION__

# Create figure and axis objects
boston_df = pd.DataFrame(boston['data'])
boston_df.columns = boston['feature_names']


fig, ax = plt.subplots()

ax.scatter(boston_df.RM, boston_df.LSTAT, c='red')
ax.set_xlabel('Rooms')
ax.set_ylabel('LSTAT')
ax.set_title('Negative Correlation')

Each type of graph has some fun levers to tweek.  Let's try changing the opacity, marker size, and color of the scatter.

In [None]:
boston_df = pd.DataFrame(boston['data'])
boston_df.columns = boston['feature_names']

fig, ax = plt.subplots()

ax.scatter(boston_df.RM, boston_df.LSTAT, c='pink', s=50, alpha=.5)
ax.set_xlabel('Rooms')
ax.set_ylabel('LSTAT')
ax.set_title('Messing with parameters')

Now lets use the suplots syntax to create two plots side by side.  The first plot will be the one you create above, the second, which should appear to the right, will be a histogram of the target value, sales price.  Play around with the bins parameter to see how the plot is affected.

In [None]:
boston_df = pd.DataFrame(boston['data'])
boston_df.columns = boston['feature_names']
sales_price = boston['target']
fig, (ax1, ax2) = plt.subplots(1,2)

# Your code here


In [None]:
#__SOLUTION__

boston_df = pd.DataFrame(boston['data'])
boston_df.columns = boston['feature_names']
sales_price = boston['target']


fig, (ax1, ax2) = plt.subplots(1,2)

ax1.scatter(boston_df.RM, boston_df.LSTAT, c='red')
ax1.set_xlabel('Rooms')
ax1.set_ylabel('LSTAT')
ax1.set_title('Negative Correlation')

ax2.hist(sales_price, bins=50)
ax2.set_title('Distribution of Boston\n Sales Prices')
plt.tight_layout()

## Let's work with an imagined bar chart of outcomes for people who contracted a life threatening disease

Through the barplot, we will learn to manipulate the figure size, tick labels, and color of specific parts of the plot.

In [None]:
# Let's imagine these are counts of our target variable categories in a classification problem

outcomes = [100,500,200,20]
outcome_labels = ['recovered', 'partially_recovered', 'deceased', 'other']


In [None]:

fig, ax = plt.subplots()

# we set our labels to be outcome labels
ax.bar(outcome_labels, outcomes)

In [None]:
# Let's make the figure a little bit taller

fig, ax = plt.subplots()
fig.set_figheight(5)
fig.set_figwidth(5)
ax.bar(outcome_labels, outcomes)

Uh, oh, that messed up our x-ticks
We can use the tick_params method to rotate the labels.
[tick_params](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.tick_params.html#matplotlib.axes.Axes.tick_params)

In [None]:
# Your code here

In [None]:
#__SOLUTION__
fig, ax = plt.subplots()
fig.set_figheight(5)
fig.set_figwidth(5)
ax.bar(outcome_labels, outcomes)
ax.tick_params(axis='x', labelrotation=45)

In [None]:
# Let's say we want to put emphasis on the people who recovered by changing the color of a bar.
fig, ax = plt.subplots()
fig.set_figheight(5)
fig.set_figwidth(5)
bars = ax.bar(outcome_labels, outcomes)
ax.tick_params(axis='x', rotation=45)
bars[0].set_color('green')

## Layering

![cake](https://media.giphy.com/media/XMgCFjsCSARxK/giphy.gif)

In [None]:
If we want to add multiple plots on one axis, we can simply call the plotting functions one after the other. 

### Quick note: style sheets are cool

Find another style from the Docs and set the style. Once you've set the style try rerunning older graphs:

[Style Sheets](https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html)

In [None]:
style = 'fivethirtyeight'
plt.style.use(style)

## Saving your figures

Let's split the shampoo sales into years, and plot three line plots, one on top of the other

In [None]:
fig, ax = plt.subplots()
for group in shampoo.groupby('year').groups:
    x = shampoo.groupby('year').get_group(group)
    ax.plot(x.month, x.iloc[:,1], )
    
ax.legend(['year_1', 'year_2', 'year_3'])
ax.set_xlabel('Month')
ax.set_ylabel('Shampoo Sales')