<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Guided Practice: Explore Python Data Visualization

_Authors: Alexander Combs (New York City), Dave Yerrington (San Francisco), and Kevin Markham (Washington, D.C.)_

---

In this guided practice lab you will use Pandas, Matplotlib, and Seaborn to create simple plots.

We'll cover plotting line plots, scatter plots, bar plots, and histograms, and how to manipulate the style of your plots with Matplotlib.

## Learning Objectives

- **Practice** using different types of plots.
- **Use** Pandas methods for plotting.
- **Create** line plots, bar plots, histograms, and box plots.
- **Know** when to use Seaborn or advanced Matplotlib

## Lesson Guide

- [Line Plots](#line-plots)
- [Bar Plots](#bar-plots)
- [Histograms](#histograms)
    - [Grouped Histograms](#grouped-histograms)
    
    
- [Box Plots](#box-plots)
    - [Grouped Box Plots](#grouped-box-plots)
    
- [Scatter Plots](#scatter-plots)
- [Using Seaborn](#using-seaborn)
- [OPTIONAL: Understanding Matplotlib (Figures, Subplots, and Axes)](#matplotlib)
- [OPTIONAL: Additional Topics](#additional-topics)

- [Summary](#summary)

### Introduction

In this lab, we will introduce how plotting works in Pandas and Matplotlib. It is important to know that Pandas uses Matplotlib behind the scenes to make plots. So, you will notice that Pandas plotting methods often use similar parameter names as Matplotlib methods. Further, you can use Matplotlib functions in combination with Pandas methods to alter the plots after drawing them. For example, you can use Matplotlib's `xlabel` and `title` functions to label the plot's x-axis and title, respectively, after it is drawn.

As we explore different types of plots, notice:

1. Different types of plots are drawn very similarly -- they even tend to share parameter names.
2. In Pandas, calling `plot()` on a `DataFrame` is different than calling it on a `Series`. Although the methods are both named `plot`, they may take different parameters.

Toward the end of the lab, we will show some motivational plots using Seaborn, a popular statistics plotting library, as well as go more in-depth about how Matplotlib works.

### Pandas Plotting Documentation

[Link to Documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)

In [None]:
from IPython.display import HTML

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

### Load Datasets for Lesson

#### Create fake data for examples.

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), 
                  columns=['col1', 'col2', 'col3', 'col4'],
                  index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])

#### Load in "real" data sets for visualization examples.

In [None]:
# Read in the Boston housing data.
housing_csv = '../datasets/boston_housing_data.csv'
housing = pd.read_csv(housing_csv)

# Read in the drinks data.
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
drinks_csv = '../datasets/drinks.csv'
drinks = pd.read_csv(drinks_csv, header=0, names=drink_cols, na_filter=False)

# Read in the ufo data.
ufo_csv = '../datasets/ufo.csv'
ufo = pd.read_csv(ufo_csv)
# set Time and Year to date variables
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year

# Read in the wine data.
wine_csv = '../datasets/winequality.csv'
wine = pd.read_csv(wine_csv)

#### Boston Housing Data Dictionary
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per \\$10,000
- PTRATIO - pupil-teacher ratio by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in \\$1000's


#### Wine Quality Data Dictionary
https://archive.ics.uci.edu/ml/datasets/wine+quality
- type (red or white)
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
##### Output variable (based on sensory data):
- quality (score between 0 and 10)


<a id="line-plots"></a>
## Line plots: Show the trend of a numerical variable over time
---

#### Line Plot With  `DataFrame`

In [None]:
df.plot();

#### Change Plot Size

In [None]:
# Technically the figsize is 15 "inches" (width) by 8 "inches" (height)
#   The figure is specified in inches for printing -- you set a dpi (dots/pixels per inch) elsewhere
df.plot(figsize=(15,8)); # width, height

#### Change Plot Color

In [None]:
df['col1'].plot(color='crimson');

In [None]:
df['col2'].plot(color='orange');

In [None]:
df['col1'].plot(color='crimson');
df['col2'].plot(color='orange');

#### Change Individual Line Style With Dictionary

In [None]:
# : - dotted line, v - triangle_down
# r - red, b - blue
df[['col1', 'col4']].plot(style={'col1': ':r', 'col4': ':vb'});

#### How to change the style of individual lines
+ [Line styles](https://matplotlib.org/gallery/lines_bars_and_markers/line_styles_reference.html)
+ [Marker styles](https://matplotlib.org/api/markers_api.html)
+ [Colors](https://matplotlib.org/api/colors_api.html)

### Line Plots with UFO Dataset

In [None]:
ufo.head()

In [None]:
# Count the number of ufo reports each year (and sort by year).
ufo.Year.value_counts().sort_index()

In [None]:
# Compare with line plot -- UFO sightings by year. (Ordering by year makes sense.)
ufo.Year.value_counts().sort_index().plot();

**COMMON MISTAKE:** Using a line plot when the x-axis cannot be ordered sensically.

For example, ordering by State below shows a trend where no trend exists ... 

In [None]:
ufo.State.value_counts().plot();

**Important:** A line plot is the wrong type of plot for this data. Any set of countries can be rearranged misleadingly to illustrate a negative trend, as we did here. Due to this, it would be more appropriate to represent this data using a bar plot, which does not imply a trend based on order.

#### Sorting Alphabetically

In [None]:
ufo.State.value_counts().sort_index(ascending=False).plot(figsize=(15,8), kind='barh');

#### Sorting by Value

In [None]:
ufo.State.value_counts(ascending=True).plot(figsize=(15,8), kind='barh');

### Exercises

#### 1. UFO Sighting

In [None]:
# Sort UFO dataset by Time


In [None]:
# Sort UFO Dataset by time for the year 1980


In [None]:
#Sort UFO dataset by Time and format the line as a green solid line using Dictionary method. 
#Set the figure size to 12 x 8


In [None]:
#Sort UFO dataset by Time and format the line as a green solid line using non-Dictionary method. 


<a id="bar-plots"></a>
## Bar Plots: Show a numerical comparison across different categories
---

In [None]:
drinks.head()

In [None]:
# Count the number of countries in each continent.
drinks.continent.value_counts()

In [None]:
# Compare with bar plot.
drinks.continent.value_counts().plot(kind='bar');

In [None]:
# Change the color to all the same color
drinks.continent.value_counts().plot(kind='bar', color='gray');

In [None]:
# Position horizontally
drinks.continent.value_counts().plot(kind='barh', color='gray');

In [None]:
# Change sort order
drinks.continent.value_counts(ascending=True).plot(kind='barh', color='gray');

In [None]:
#Emphasize a bar
drinks.continent.value_counts(ascending=True).plot(kind='barh', color = ['gray','gray','gray','gray','gray','red']);

In [None]:
# Calculate the mean alcohol amounts for each continent.
drinks.groupby('continent').mean()

In [None]:
# Side-by-side bar plots
drinks.groupby('continent').mean().plot(kind='bar');

In [None]:
# Sort the continent x-axis by a particular column.
drinks.groupby('continent').mean().sort_values('beer').plot(kind='bar');

In [None]:
# Stacked bar plot (with the liters comparison removed!)
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar', stacked=True);

In [None]:
# Stacked bar plot (with the liters comparison removed!)  Controlling the colors.
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar', stacked=True, color=['blue','gray','crimson']);

### Using a `DataFrame` and Matplotlib commands, we can get fancy.

In [None]:
ax = df.plot(kind='bar', figsize=(15,3));

# Set the title.
ax.set_title('Some Kinda Plot Thingy', fontsize=21, y=1.01);

# Move the legend.
ax.legend(loc=1);

# y-axis labels
ax.set_ylabel('Important y-axis info', fontsize=16);

# x-axis labels
ax.set_xlabel('Meaningless x-axis info', fontsize=16);

### Exercises
#### 2. Bar Happy Hour

In [None]:
# Create a horizontal bar plot of average beer drunk by continent


In [None]:
# Add a title and x and y axis label


# Set the title.



# y-axis labels




# x-axis labels


In [None]:
# Create a horizontal stacked bar plot of mean drinks by continent (with the liters comparison removed!)
# Only keep 'EU' and 'NA'


<a id="histograms"></a>
## Histograms: Show the distribution of a numerical variable
---


In [None]:
# Sort the beer column and mentally split it into three groups.
drinks.beer.sort_values().values

In [None]:
# Compare the above with histogram.
# About how many of the points above are in the groups 1-125, 125-250, and 250-376?
drinks.beer.plot(kind='hist', bins=3);

In [None]:
# Try more bins — it takes the range of the data and divides it into 20 evenly spaced bins.
drinks.beer.plot(kind='hist', bins=20);
plt.xlabel('Beer Servings');
plt.ylabel('Frequency');

In [None]:
# Making histograms of DataFrames — histogram of random data
df.hist(figsize=(16,8));

### Single Histogram

In [None]:
norm = np.random.standard_normal(5000)

In [None]:
pd.Series(norm).hist(figsize=(16,4), bins=50);

<a id="grouped-histograms"></a>
### Grouped histograms: Show one histogram for each group.

In [None]:
# Reminder: Overall histogram of beer servings
drinks.beer.plot(kind='hist');

In [None]:
# Histogram of beer servings grouped by continent -- how might these graphs be misleading?
drinks.hist(column='beer', by='continent');

In [None]:
# Share the x- and y-axes.
drinks.hist(column='beer', by='continent', sharex=True, sharey=True, layout=(2, 3));

### Exercises

#### 3. Housing Histogram
Create a histogram using `MEDV` in the housing data.
- Set the bins to 20.

In [None]:
# Create a histogram using 'MEDV' in the housing data. Set the bins to 20.
#housing['MEDV'].hist(bins=20);


In [None]:
# Create a set of histograms using 'AGE,'CHAS','INDUS' and 'ZN' in the housing data. 
# Set the bins to 10.


<a id="box-plots"></a>
## Box Plots: Show quartiles (and outliers) for one or more numerical variables
---

We can use boxplots to quickly summarize distributions.

**Five-number summary:**

- min = minimum value
- 25% = first quartile (Q1) = median of the lower half of the data
- 50% = second quartile (Q2) = median of the data
- 75% = third quartile (Q3) = median of the upper half of the data
- max = maximum value

(It's more useful than mean and standard deviation for describing skewed distributions.)

**Interquartile Range (IQR)** = Q3 - Q1

**Outliers:**

- below Q1 - 1.5 * IQR
- above Q3 + 1.5 * IQR

In [None]:
df.boxplot();

### Let's see how box plots are generated so we can best interpret them.

In [None]:
# Sort the spirit column.
drinks.spirit.sort_values().values

In [None]:
# Show "five-number summary" for spirit.
drinks.spirit.describe()

In [None]:
# Compare with box plot.
drinks.spirit.plot(kind='box');

In [None]:
# Include multiple variables.
drinks.drop('liters', axis=1).plot(kind='box');

### How to use a box plot to preview the distributions in the housing data

In [None]:
housing.boxplot();

In [None]:
housing.plot(kind='box',figsize=(15,10));

<a id="grouped-box-plots"></a>
### Grouped box plots: Show one box plot for each group.

In [None]:
# Reminder: box plot of beer servings
drinks.beer.plot(kind='box');

In [None]:
# Box plot of beer servings grouped by continent
drinks.boxplot(column='beer', by='continent');

In [None]:
drinks[drinks.continent=='AF']

In [None]:
# Box plot of all numeric columns grouped by continent
drinks.boxplot(by='continent');

<a id="scatter-plots"></a>
## Scatter plots: Show the relationship between two numerical variables
---


In [None]:
# Select the beer and wine columns and sort by beer.
drinks[['beer', 'wine']].sort_values('beer').values

In [None]:
# Compare with scatter plot.
drinks.plot(kind='scatter', x='beer', y='wine');

In [None]:
# Add transparency (great for plotting several graphs on top of each other, or for illustrating density!).
drinks.plot(kind='scatter', x='beer', y='wine', alpha=0.3);

In [None]:
# Vary point color by spirit servings.
drinks.plot(kind='scatter', x='beer', y='wine', c='spirit', colormap='Blues');

In [None]:
# Scatter matrix of three numerical columns
pd.plotting.scatter_matrix(drinks[['beer', 'spirit', 'wine']], figsize=(10, 8));

### Exercises

#### 4. Scattered Housing

In [None]:
# View the association between the variables `ZN` and `INDUS` using a scatter plot


In [None]:
# Create a Scatter matrix of 'AGE','CHAS','ZN', and 'INDUS'


# Using Seaborn

<a id="seaborn"></a>
## `pairplot`

---

- **Objective:** Know when to use Seaborn or advanced Matplotlib.

With the `DataFrame` object `wine`, we will render a pairplot using the Seaborn library.
What do each of the elements represent? Is this more or less useful than the previous plot?

In [None]:
sns.pairplot(wine);

**Answer:** _What do each of the elements represent?  Is this more or less useful than the previous plot?_
> In a pair plot we get to see every relationship between every _pair_ of variables.  We can see this is very useful for quickly discovering which variables have some kind of correlation during an exploratory data analysis.  However, when just looking at the `INDUS` feature, the pair plot is more difficult to read and interpret as opposed to the single histogram.  

## `heatmap`
---

When you have too many variables, a pairplot or scatter matrix can become impossible to read. We can still gauge linear correlation using a heatmap of the correlation matrix.

In [None]:
# Make a heatmap on the correlations between variables in the wine data:
wine_correlations = wine.corr();
sns.heatmap(wine_correlations);

In [None]:
wine_correlations

### Challenge: Create a scatter plot of two heatmap entries that appear to have a very positive correlation.

In [None]:
wine.plot();

- Now, create a scatter plot of two heatmap entries that appear to have negative correlation.

In [None]:
wine.plot();

## `lmplot`

In [None]:
sns.lmplot(x='alcohol', y='fixed acidity', data=wine)

In [None]:
sns.lmplot(x='alcohol', y='citric acid', data=wine, fit_reg=False)

In [None]:
# Scatterplot arguments, coloring dots by a specified variable in hue
sns.lmplot(x='alcohol', y='chlorides', data=wine,
           fit_reg=False,
           hue='quality')

## `violinplot`
"A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator." - Wikipedia

In [None]:
sns.violinplot(data=wine['pH'])

## `distplot`

In [None]:
sns.distplot(wine['pH'], bins=25, kde=True)

## `boxplot`

In [None]:
sns.boxplot(x='alcohol',y='type',data=wine, palette="Set2", hue='type')

<a id="matplotlib"></a>
## OPTIONAL: Understanding Matplotlib (Figures, Subplots, and Axes)

---

Matplotlib uses a blank canvas called a figure.

In [None]:
fig = plt.subplots(1,1, figsize=(16,8));

Within this canvas, we can contain smaller objects called axes.

In [None]:
fig, axes = plt.subplots(2,3, figsize=(16,8));

Pandas allows us to plot to a specified axes if we pass the object to the ax parameter.

In [None]:
fig, axes = plt.subplots(2,3, figsize=(16,8))
df.plot(ax=axes[0][0]);
df['col1'].plot(ax=axes[0][1]);
df['col2'].plot(ax=axes[1][1]);

## Let's use a bit more customization.
---

In [None]:
fig, axes = plt.subplots(2,2, figsize=(16,8))

# We can change the ticks' size.
df['col2'].plot(figsize=(16,4), color='purple', fontsize=21, ax=axes[0][0])

# We can also change which ticks are visible.
# Let's show only the even ticks. ('idx % 2 == 0' only if 'idx' is even.)
ticks_to_show = [idx for idx, _ in enumerate(df['col2'].index) if idx % 2 == 0]
df['col2'].plot(figsize=(16,4), color='purple', xticks=ticks_to_show, fontsize=16, ax=axes[0][1])

# We can change the label rotation.
df.plot(figsize=(15,7), title='Big Rotated Labels - Tiny Title',\
        fontsize=20, rot=-50, ax=axes[1][0])\

# We have to use ".set_title()" to fix title size.
df.plot(figsize=(16,8), fontsize=20, rot=-50, ax=axes[1][1])\
       .set_title('Better-Sized Title', fontsize=21, y=1.01);

<a id="additional-topics"></a>
## OPTIONAL: Additional Topics

In [None]:
# Saving a plot to a file
drinks.beer.plot(kind='hist', bins=20, title='Histogram of Beer Servings');
plt.xlabel('Beer Servings');
plt.ylabel('Frequency');
plt.savefig('beer_histogram.png');    # Save to file!

In [None]:
# List available plot styles
plt.style.available

In [None]:
# Change to a different style.
plt.style.use('ggplot')

<a id="plotly-express"></a>
## Optional: Plotly Express

Plotly express is a high level wrapper for plotly that allows you to use simple syntax. Before the below will run you will need to install plotly express

Install information and the below graphs come directly from the [plotly express site](https://plotly.com/python/plotly-express/)

In [None]:
import plotly.express as px
df = px.data.iris()


In [None]:
#Building a basic scatter chart
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species")
fig.show()

In [None]:
#Interactive scatter matrix
fig = px.scatter_matrix(df, dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"], color="species")
fig.show()

In [None]:
#Viewing patterns within the data
fig = px.parallel_coordinates(df, color="species_id", labels={"species_id": "Species",
                  "sepal_width": "Sepal Width", "sepal_length": "Sepal Length",
                  "petal_width": "Petal Width", "petal_length": "Petal Length", },
                    color_continuous_scale=px.colors.diverging.Tealrose, color_continuous_midpoint=2)
fig.show()

In [None]:
#Building graphs from geo-data
#Can also do tile-based maps
df = px.data.gapminder()
fig = px.choropleth(df, locations="iso_alpha", color="lifeExp", hover_name="country", animation_frame="year", range_color=[20,80])
fig.show()

<a id="summary"></a>
### Summary

In this lesson, we showed examples how to create a variety of plots using Pandas and Matplotlib. We also showed how to use each plot to effectively display data.

Do not be concerned if you do not remember everything — this will come with practice! Although there are many plot styles, many similarities exist between how each plot is drawn. For example, they have most parameters in common, and the same Matplotlib functions are used to modify the plot area.

We looked at:
- Line plots
- Bar plots
- Histograms
- Box plots
- Special seaborn plots
- How Matplotlib works