# Chapter 03 - Visualizing Data

## The `matplotlib` Package

In [None]:
# One typically uses the Python plotting package
import matplotlib.pyplot as plt

In [None]:
# The `pyplot` package, in its simplest use, maintains an internal state
# that allows a developer to build up a visualization step by step.
#
# When finished, one can save the figure by calling `savefig()` or display
# it by calling `show()`
years = range(1950, 2011, 10)  # From 1950 through 2010 by 10 years
gdp = [300.2, 543.3, 1075.9, 2862.5, 5957.6, 10289.7, 14958.3]

In [None]:
# Create a line chart with years on the x-axis and GDP on the y-axis
# In the REPL, `plot()` does not actually plot the figure but prepares the
# state as if it was plotted. However, in a Jupyter notebook, a call to
# `plot` plots the figure in the output cell immediately. (One may be able
# to change this behavior by changing a setting.)

# Plot a green, solid line with circles at each data point
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')

# Add a title
plt.title('Nominal GDP')

# Add a label to the y-axis
plt.ylabel('Billions of $')

# And now we show the figure
plt.show()

## Bar Charts

A bar chart is a good choice when you want to show how some quantity varies among a **discrete** set of items.

In [None]:
# Oscars for different movies
movies = ['Annie Hall', 'Ben-Hur', 'Casablanca', 'Gandhi', 'West Side Story']
oscars_count = [5, 11, 3, 8, 10]

In [None]:
# Plot Oscars for different movies

# Plot bars with left-hand x-coordinate, [0, 1, 2, 3, 4] and
# heights, [oscars_count]
plt.bar(range(len(movies)), oscars_count)

plt.title('My Favorite Movies')  # add a title
plt.ylabel('# of Academy Awards')  # label the y-axis

# Label the x-axis with movie names at center of each bar
plt.xticks(range(len(movies)), movies)

# Finally, plot the figure
plt.show()

A bar chart can also be a good choice for plotting histograms of bucketed numeric values

In [None]:
from collections import Counter

In [None]:
# A sequence of grades for a class
grades = [83, 95, 91, 87, 70, 0, 85, 82, 100, 67, 73, 77, 0]

In [None]:
# Bucket grades by decile, but put scores of 100 in the 90's bucket
# Remember that the `//` operator is **integer** division
histogram = Counter(min(grade // 10 * 10, 90) for grade in grades)

plt.bar(
    # Center a 10-unit bar by shifting center of bars from the lower bound
    # of the decile to the middle. For example, the decile [10, 20) is
    # centered at 15 and occupies all the "space" from 10 to 20.
    [x + 5 for x in histogram.keys()],  # shift bars right by 5
    histogram.values(),  # give each bar the correct height
    10,  # each bar has a width of 10 which completely fills the decile
    edgecolor=(0, 0, 0)  # Each bar has black edges
)

# Plot x-axis from -5 to 105 and y-axis from 0 to 5
# Plotting the x-axis range from a value less than the minimum to a value
# greater than the maximum puts space around the plot horizontally.
# Similarly, plotting the y-axis range to a value greater than the maximum
# puts space at the top of the plot.
plt.axis([-5, 105, 0, 5])

plt.xticks([10 * i for i in range(11)])  # x-axis labels at 0, 10, 20, ..., 100
plt.xlabel('Decile')
plt.ylabel('# of Students')
plt.title('Distribution of Exam 1 Grades')
plt.show()

Be careful in your choice of boundaries supplied to `axis`. In particular, it is considered especially bad form for your y-axis **not** to start at 0, since this choice can mislead people.

In [None]:
# Mentions per years
mentions = [500, 505]
years = [2017, 2018]

In [None]:
plt.bar(years, mentions, 0.8)
plt.xticks(years)
plt.ylabel('# of times I heard someone say "data science"')

# Misleading y-axis only show the part above 500
plt.axis([2016.5, 2018.5, 499, 506])
plt.title('Look at the "Huge" increase!')
plt.show()

Using more sensible axes and the difference looks far, far less impressive.

In [None]:
plt.bar(years, mentions, 0.8)
plt.xticks(years)
plt.ylabel('# of times I heard someone say "data science"')

plt.axis([2016.5, 2018.5, 0, 550])
plt.title('Not so impressive anymore')
plt.show()

## Line Charts

Line charts are a good choice for showing trends.

In [None]:
# Variance, bias, and total error
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x, y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]

In [None]:
# We make multiple calls to `plt.plot` to show multiple curves on the
# same figure
plt.plot(xs, variance, 'g-', label='variance')  # green, solid line
plt.plot(xs, bias_squared, 'r-.', label='bias^2')  # red, dot-dashed line
plt.plot(xs, total_error, 'b:', label='total error')  # blue dotted line

# Because we've assigned labels to each curve, we can get a legend
# for "free" (loc=9 means "top center")
plt.legend(loc=9)
plt.xlabel('model complexity')
plt.xticks([])
plt.title('The Bias-Variance Tradeoff')

plt.show()

## Scatterplots

A scatter plot is the right choice for visualizing the relationship between two paired sets of data.

In [None]:
# The relationship between friend and minutes spent on a social media site
# every day
friends = [70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 120, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

In [None]:
plt.scatter(friends, minutes)

# Label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
    plt.annotate(
        label,
        xy=(friend_count, minute_count),  # put the label with its point
        xytext = (5, -5),  # but slightly
        textcoords='offset points'
    )  # offset

plt.title('Daily Minutes vs. Number of Friends')
plt.xlabel('# of friends')
plt.ylabel('daily minutes on the site')

plt.show()

**BEWARE** If you're scattering comparable variables, you might get a misleading picture if you let `matplotlib` choose the scale

In [None]:
# Scatter plot of grades on first exam against grades on second exam
test_1_grades = [99, 90, 85, 97, 80]
test_2_grades = [100, 85, 60, 90, 70]

plt.scatter(test_1_grades, test_2_grades)
plt.title('Axes are **not** comparable')
plt.xlabel('test 1 grade')
plt.ylabel('test 2 grade')
plt.show()

I'm uncertain as to the meaning of "comparable" (or "not comparable"). Since scatter plots are intended to show relationships between pairs of values, I think "comparable" refers to similar ranges. This "definition" does not mean always equal but more like similar. For example, if the y-axis ranges from 0 to y-max, then the x-axis should also range from 0 to x-max.

I think this notion of "comparable" is related to the concept of "skewness." We are accustomed to a 1:1 relationship being a 45-degree line. Consequently, if the y-axis goes from 60 to 100 but the x-axis goes from 80 to 100, the graph is visually "skewed"; that is, a unit change along one axis is **not** equivalent to a unit change along the other axis.

Vague, but I think heading "in the right direction."

In [None]:
# The same data with comparable axes
plt.scatter(test_1_grades, test_2_grades)
plt.title('Axes are comparable')
plt.xlabel('test 1 grade')
plt.ylabel('test 2 grade')
plt.axis('equal')
plt.show()

With "equal" axes, one can determine that the variation of grades for the second test is much large than the variation of grades for the first test.