## Week 4 : Data Visualization

*Author: Kartik Jindgar*


---



In this guided practice lab you will use Pandas, Matplotlib, and Seaborn to create simple plots.

We'll cover plotting line plots, scatter plots, bar plots, and histograms, and how to manipulate the style of your plots with Matplotlib.

## Introduction

In this lab, we will introduce how plotting works in Pandas and Matplotlib. It is important to know that Pandas uses Matplotlib behind the scenes to make plots. So, you will notice that Pandas plotting methods often use similar parameter names as Matplotlib methods. Further, you can use Matplotlib functions in combination with Pandas methods to alter the plots after drawing them. For example, you can use Matplotlib's `xlabel` and `title` functions to label the plot's x-axis and title, respectively, after it is drawn.

As we explore different types of plots, notice:

1. Different types of plots are drawn very similarly -- they even tend to share parameter names.
2. In Pandas, calling `plot()` on a `DataFrame` is different than calling it on a `Series`. Although the methods are both named `plot`, they may take different parameters.

Toward the end of the lab, we will show some motivational plots using Seaborn, a popular statistics plotting library, as well as go more in-depth about how Matplotlib works.

### Pandas Plotting Documentation

[Link to Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)


In [2]:
from IPython.display import HTML

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
%matplotlib inline

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

Creating dummy data for examples

In [3]:
df = pd.DataFrame(np.random.randn(10, 4), 
                  columns=['col1', 'col2', 'col3', 'col4'],
                  index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])

Loading other datasets that will be used in this notebook

In [14]:
# Read in the Boston housing data.
housing_csv = '../boston_housing_data.csv'
housing = pd.read_csv(housing_csv)

# Read in the drinks data.
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
url = '../drinks.csv'
drinks = pd.read_csv(url, header=0, names=drink_cols, na_filter=False)

# Read in the ufo data.
ufo = pd.read_csv('../ufo.csv')
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year

Line Plot with dummy data

In [None]:
df.plot()

How to change the size of a plot

In [None]:
df.plot(figsize=(15,8)) #(width, height)

How to change the color of a plot


Useful [link](https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.plot.line.html) to understand the different possible ways you can set color values for different columns of the dataframe

In [None]:
df['col1'].plot(color='crimson', figsize=(16,8));

How to change the style of individual plots

In [None]:
# : - dotted line, v - triangle_down
# r - red, b - blue
df[['col1', 'col4']].plot(figsize=(15,7), style={'col1': ':r', 'col4': 'vb'});

Practice problem

Create a line plot of ZN and INDUS in the housing data.
- For ZN, use a solid green line. For INDUS, use a blue dashed line.
- Change the figure size to a width of 12 and height of 8.
- Change the style sheet to something you find [here](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html).

## Bar Plots: Show a numerical comparison across different categories

In [19]:
# Count the number of countries in each continent.

In [None]:
drinks.continent.value_counts()

In [None]:
# Compare with bar plot.
drinks.continent.value_counts().plot(kind='bar');

In [22]:
# Calculate the mean alcohol amounts for each continent.

In [None]:
drinks.groupby('continent').mean()

In [None]:
# Side-by-side bar plots
drinks.groupby('continent').mean().plot(kind='bar');

In [None]:
# Sort the continent x-axis by a particular column.
drinks.groupby('continent').mean().sort_values('beer').plot(kind='bar');

In [None]:
# Stacked bar plot (with the liters comparison removed!)
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar', stacked=True);

Using a `DataFrame` and Matplotlib commands, we can get fancy.

In [None]:
ax = df.plot(kind='bar', figsize=(15,3));

# Set the title.
ax.set_title('Some Kinda Plot Thingy', fontsize=21,y=1);

# Move the legend.
ax.legend(loc=1);

# x-axis labels
ax.set_ylabel('y-axis info', fontsize=16);

# y-axis labels
ax.set_xlabel('x-axis info', fontsize=16);

Practice Problem

Create a bar chart using `col1` and `col2`.

- Give the plot a large title of your choosing. 
- Move the legend to the lower-left corner.
- Do the same thing but with horizontal bars.
- Move the legend to the upper-right corner.

## Histograms : Show the distribution of a numerical variable

In [None]:
# Sort the beer column and mentally split it into three groups.
drinks.beer.sort_values().values

In [None]:
# Compare the above with histogram.
# About how many of the points above are in the groups 1-125, 125-250, and 250-376?
drinks.beer.plot(kind='hist', bins=3);

In [None]:
# Try more bins — it takes the range of the data and divides it into 20 evenly spaced bins.
drinks.beer.plot(kind='hist', bins=20);
plt.xlabel('Beer Servings');
plt.ylabel('Frequency');

In [None]:
# Compare with density plot (smooth version of a histogram).
drinks.beer.plot(kind='density', xlim=(0, 500));

In [None]:
# Making histograms of DataFrames — histogram of random data
df.hist(figsize=(16,8));

Single **Histogram**

In [None]:
norm = np.random.standard_normal(5000)
pd.Series(norm).hist(figsize=(16,4), bins=50);

In [None]:
pd.Series(norm).hist(figsize=(16,4), bins=20)

Practice Problem

Create a histogram with pandas for using MEDV in the housing data.
- Set the bins to 20

## Grouped Histograms: Show one histogram for each group

In [None]:
# Reminder: Overall histogram of beer servings
drinks.beer.plot(kind='hist');

In [None]:
# Histogram of beer servings grouped by continent -- how might these graphs be misleading?
drinks.hist(column='beer', by='continent');

In [None]:
# Share the x- and y-axes.
drinks.hist(column='beer', by='continent', sharex=True, sharey=True, layout=(2, 3));

## Box Plots: Show quartiles (and outliers) for one or more numerical variables

---

We can use boxplots to quickly summarize distributions.

**Five-number summary:**

- min = minimum value
- 25% = first quartile (Q1) = median of the lower half of the data
- 50% = second quartile (Q2) = median of the data
- 75% = third quartile (Q3) = median of the upper half of the data
- max = maximum value

(It's more useful than mean and standard deviation for describing skewed distributions.)

**Interquartile Range (IQR)** = Q3 - Q1

**Outliers:**

- below Q1 - 1.5 * IQR
- above Q3 + 1.5 * IQR

In [None]:
df.boxplot()

In [None]:
# Sort the spirit column.
drinks.spirit.sort_values().values

In [None]:
# Show "five-number summary" for spirit.
drinks.spirit.describe()

In [None]:
# Compare with box plot.
drinks.spirit.plot(kind='box');

In [None]:
# Include multiple variables.
drinks.drop('liters', axis=1).plot(kind='box');

How to use a box plot to preview the distributions in the housing data

In [None]:
housing.boxplot()

## Grouped box plots: Show one box plot for each group

In [None]:
# Reminder: box plot of beer servings
drinks.beer.plot(kind='box');

In [None]:
# Box plot of beer servings grouped by continent
drinks.boxplot(column='beer', by='continent');

In [None]:
# Box plot of all numeric columns grouped by continent
drinks.boxplot(by='continent');

## Scatter plots: Show the relatioship between two numerical variables

In [None]:
# Select the beer and wine columns and sort by beer.
drinks[['beer', 'wine']].sort_values('beer').values

In [None]:
# Compare with scatter plot.
drinks.plot(kind='scatter', x='beer', y='wine');

In [None]:
# Add transparency (great for plotting several graphs on top of each other, or for illustrating density!).
drinks.plot(kind='scatter', x='beer', y='wine', alpha=0.3);

In [None]:
# Vary point color by spirit servings.
drinks.plot(kind='scatter', x='beer', y='wine', c='spirit', colormap='Blues');

In [None]:
# Scatter matrix of three numerical columns
pd.plotting.scatter_matrix(drinks[['beer', 'spirit', 'wine']], figsize=(10, 8));

Plotting ```DataFrames```

In [None]:
df.plot(x='col3', y='col4', kind='scatter', color='dodgerblue',figsize=(15,7), s=250)

How to view the association between the variables `ZN` and `INDUS` using a scatter plot

In [None]:
housing.plot(x='ZN', y='INDUS', kind='scatter', 
           color='dodgerblue', figsize=(15,7), s=100);

How to use a list comprehension to change the size of the scatter plot dots based on `DIS`

In [None]:
# This list comprehension sets the point sizes ('s') to be the squares of the values in housing['DIS']
housing.plot(x='ZN', y='INDUS', kind='scatter', 
           color='dodgerblue', figsize=(15,7), s=[x**2 for x in housing['DIS']]);

<a id="seaborn"></a>
## Seaborn `pairplot`

---


With the `DataFrame` object `housing`, we will render a pairplot using the Seaborn library.
What do each of the elements represent? Is this more or less useful than the previous plot?

In [None]:
sns.pairplot(housing)

## Seaborn `heatmap`
---

When you have too many variables, a pairplot or scatter matrix can become impossible to read. We can still gauge linear correlation using a heatmap of the correlation matrix.

In [None]:
# Make a heatmap on the correlations between variables in the housing data:
housing_correlations = housing.corr();
sns.heatmap(housing_correlations);

Practice Problem

Create a scatter plot of two heatmap entries that appear to have a very positive correlation.

Practice Problem

Now, create a scatter plot of two heatmap entries that appear to have negative correlation.

## Optional section Understanding Matplotlib

---

- Matplotlib uses a blank canvas called a figure
- Within this canvas, we can obtain smaller objects called axes
- Pandas allows us to plot to a specified axes if we pass the object to the ax parameter


In [None]:
fig = plt.subplots(1,1, figsize=(16,8));

In [None]:
fig, axes = plt.subplots(2,3, figsize=(16,8));

In [None]:
fig, axes = plt.subplots(2,3, figsize=(16,8))
df.plot(ax=axes[0][0]);
df['col1'].plot(ax=axes[0][1]);
df['col2'].plot(ax=axes[1][1]);

Using a bit more customization

In [None]:
fig, axes = plt.subplots(2,2, figsize=(16,8))

# We can change the ticks' size.
df['col2'].plot(figsize=(16,4), color='purple', fontsize=21, ax=axes[0][0])

# We can also change which ticks are visible.
# Let's show only the even ticks. ('idx % 2 == 0' only if 'idx' is even.)
ticks_to_show = [idx for idx, _ in enumerate(df['col2'].index) if idx % 2 == 0]
df['col2'].plot(figsize=(16,4), color='purple', xticks=ticks_to_show, fontsize=16, ax=axes[0][1])

# We can change the label rotation.
df.plot(figsize=(15,7), title='Big Rotated Labels - Tiny Title',\
        fontsize=20, rot=-50, ax=axes[1][0])\

# We have to use ".set_title()" to fix title size.
df.plot(figsize=(16,8), fontsize=20, rot=-50, ax=axes[1][1])\
       .set_title('Better-Sized Title', fontsize=21, y=1.01);

##Saving a plot to a file

In [None]:
drinks.beer.plot(kind='hist', bins=20, title='Histogram of Beer Servings');
plt.xlabel('Beer Servings');
plt.ylabel('Frequency');
plt.savefig('beer_histogram.png');    # Save to file!

## Thank You!