# Problem Set 3.1: Basic Plotting

[Click here to open this notebook in your browser](https://leifwalsh.github.io/data-analysis-problem-sets/lab/index.html?path=3-visualization-basics/3.1-basic-plotting/3.1-basic-plotting.ipynb)

Let's start creating plots with the `pandas` APIs.

In [None]:
import pandas as pd

One famous dataset used in teaching data science is the "mpg dataset". It's ubiquitous, and you can get it out of the [`seaborn` package](https://seaborn.pydata.org/generated/seaborn.load_dataset.html).

In [None]:
import seaborn as sns
# mpg = sns.load_dataset('mpg')
mpg = pd.read_csv("mpg.csv")
mpg

## The `pandas` plotting API

The first API we'll explore is the one built right into `pandas`. There's another library we'll see shortly, called `matplotlib`, that handles the drawing, but `pandas` gives us an easy to use API to use `matplotlib` for simple plots that's discoverable by typing `.plot` after a DataFrame and tab completing.

We'll only scratch the surface here, if you want to learn more, read the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

### Scatter Plots

Let's dive right in with a simple scatter plot. With `pandas`, we can use the `DataFrame.plot` accessor to make all kinds of plots, and the arguments we provide describe how each row's columns will be visually encoded.

Here, we're asking for a scatter plot, so each row will show up as a dot in the plot: its `weight` will determine the x-coordinate, and its `horsepower` will determine the y-coordinate.

In [None]:
mpg.plot.scatter(x="weight", y="horsepower")

We can set other properties of each point, like the color, by giving it another channel to encode. Let's show each country of origin as its own color.

Here, we need to map each `origin` value to a different color, so we should prepare our dataframe first.

In [None]:
mpg["origin"].unique()

In [None]:
colors = {
    "usa": "lightblue",
    "japan": "lime",
    "europe": "violet",
}
mpg["color"] = mpg["origin"].map(colors)
mpg

In [None]:
mpg.plot.scatter(x="weight", y="horsepower", c="color")

We can also encode columns directly into the point colors, and they'll be mapped via a "colormap". And we can encode a fourth column, `acceleration`, to control the size.

In [None]:
mpg.plot.scatter(x="weight", y="horsepower", c="displacement", s="acceleration")

What do you notice about this plot? What do each of the encoded dimensions tell us about their relationships?

#### Output Control

A quick note: you can get rid of that `<Axes: ...>` text at the top with a semicolon at the end of the line.

The reason is that calling the plotting function by itself makes the plot appear, but it still returns something (a `matplotlib` object). Jupyter wants to display the last thing in a cell that's returned, so it prints the representation of the plot as well.

You can suppress this by "returning" `None` from the cell, like this:

In [None]:
mpg.plot.scatter(x="weight", y="horsepower", c="displacement", s="acceleration")
None

But a more concise way is to "add another empty statement" to the cell by adding a semicolon. I'll often do this when I'm preparing a notebook to be shared:

In [None]:
mpg.plot.scatter(x="weight", y="horsepower", c="displacement", s="acceleration");

### Other familiar plots: lines, bars, box plots

`pandas` has many other plots available in its simple API, you can see them all by tab-completing after `df.plot.`. Here is a sample:

#### Line plots

For this dataset, a line plot isn't great. Line plots are usually better for time series data, where the time points are clearly distinguished. In this case, the times are discrete, so you get lots of jaggedness and overlap in this plot.

In [None]:
mpg.plot.line(x="model_year", y="mpg")

We can do something useful with this dataset though, we can first aggregate by year and then plot the aggregates.

In [None]:
mpg.groupby("model_year")["mpg"].mean().plot.line(x="model_year", y="mpg", title="Average MPG")

#### Bar plots

Bar plots are good for big chunky groups of data. Here are a few examples:

In [None]:
mpg["origin"].value_counts().plot.bar()

In [None]:
mpg.groupby("cylinders")["mpg"].mean().plot.bar()

#### Distribution plots: box plots and histograms

You can provide a sketch of a distribution with either a box plot or a histogram:

In [None]:
mpg.plot.box(by="origin", column="mpg")

In [None]:
mpg["acceleration"].plot.hist()

### Plotting multiple series

Often you want to compare multiple things by drawing them as separate objects (separate lines or colors on the plot). You can do this with the `pandas` API by giving it multiple things to plot.

Let's try this with a line plot, one line per country of origin. To start, we'll need a table where each country is in its own column, which we can get with `pivot_table`:

In [None]:
weights_by_country = mpg.pivot_table(index="model_year", columns="origin", values="weight", aggfunc="mean")
weights_by_country

In [None]:
weights_by_country.plot.line()

In [None]:
weights_by_country.plot.bar()

In [None]:
weights_by_country.plot.box()

#### **IMPORTANT QUESTION**

Why should you never do the previous cell? What is actually being visualized here? Write your answer below:

## The `matplotlib` API

So far we've used the `pandas` convenience methods for plotting data, but the underlying drawing library it uses, `matplotlib`, can do a lot more. When you really want to control your visualization, you may need to use the `matplotlib` APIs.

**Disclaimer:** I am very much not a `matplotlib` expert, and the library is very old and very expansive. I can only show you a few things. If you want to have fine grained control, you're going to have to read its documentation closely, they have a great set of [Examples](https://matplotlib.org/stable/gallery/index.html).

Personally, I much prefer the library `altair`, which we'll see in the next notebook, and that's what I use most of the time.

The `matplotlib` library has a couple of "styles" or "entrypoints" into its API - if you want to learn it, read some docs or watch some videos. The one I know best starts with this import statement:

In [None]:
# I'm told the lineage of this API traces back to trying to be like MATLAB
from matplotlib import pyplot as plt

Start by creating a `Figure` and an `Axes`:

In [None]:
fig, ax = plt.subplots()  # I don't know why it's called subplots. There's no "plt.plots()"

On this, we can draw things, by passing the `Axes` into `pandas` APIs:

In [None]:
fig, ax = plt.subplots()
mpg.plot.scatter(x="weight", y="horsepower", ax=ax)

One useful thing is, you can reuse the `Axes` object to draw multiple things on the plot:

In [None]:
fig, ax = plt.subplots()
mpg.loc[mpg["origin"] == "usa"].plot.scatter(x="weight", y="horsepower", c="lightblue", ax=ax)
mpg.loc[mpg["origin"] == "japan"].plot.scatter(x="weight", y="horsepower", c="lime", ax=ax)
mpg.loc[mpg["origin"] == "europe"].plot.scatter(x="weight", y="horsepower", c="violet", ax=ax)

You can also set other options:

In [None]:
fig, ax = plt.subplots(figsize=(16, 9))
mpg.loc[mpg["origin"] == "usa"].plot.scatter(x="weight", y="horsepower", c="lightblue", ax=ax)
mpg.loc[mpg["origin"] == "japan"].plot.scatter(x="weight", y="horsepower", c="lime", ax=ax)
mpg.loc[mpg["origin"] == "europe"].plot.scatter(x="weight", y="horsepower", c="violet", ax=ax)

Of course, the `pandas` API takes many `matplotlib` arguments already, so you could have just done this:

In [None]:
mpg.plot.scatter(x="weight", y="horsepower", c="color", figsize=(16, 9))

You can also call other methods on the `Axes` object to change properties of the plot after drawing it:

In [None]:
fig, ax = plt.subplots(figsize=(16, 9))
mpg.plot.scatter(x="weight", y="horsepower", c="color", ax=ax)
ax.set_xlim(0, 6000)
ax.tick_params(axis="x", labelrotation=30)
ax.set_title("Cool Plot")

I'm going to be completely honest here, I have no idea what the `Figure` is for.

## An even higher level API: `seaborn`

There's a library called `seaborn` that has a lot of pre-canned kinds of statistical plots you can use. It also takes `pandas` data as inputs and draws using `matplotlib`. You should check the documentation's [Gallery](https://seaborn.pydata.org/examples/index.html) to see all the things it can do, we'll see a sample.

We usually import it with the name `sns`:

In [None]:
import seaborn as sns

In [None]:
sns.displot(mpg, x="mpg", y="displacement")

Like `pandas`, since `seaborn` uses `matplotlib` to draw, you can also control aspects of the plot with the `matplotlib` API, which is another good reason to familiarize yourself with it a bit.

In [None]:
fig, ax = plt.subplots(figsize=(16, 9))
sns.boxplot(x="model_year", y="mpg", hue="origin", data=mpg, ax=ax)

### Configuring `matplotlib`

If you want all the plots in your notebook to have the same style, you can use `matplotlib`'s [`rcParams`](https://matplotlib.org/stable/users/explain/customizing.html) to customize things globally:

In [None]:
from matplotlib import rcParams

In [None]:
rcParams["figure.figsize"] = (10, 3)
rcParams["lines.linestyle"] = "--"

In [None]:
mpg.pivot_table(index="model_year", columns="origin", values="weight", aggfunc="mean").plot.line()

## Exercises

### Exercise 1

Make a plot that shows the distribution of acceleration separately for each manufacturer.

### Exercise 2

Make a bar plot of something with error bars. Then make a box plot of the same thing.

### Exercise 3

Use `seaborn` to show the joint distribution between `mpg` and `weight`. Use `hue` to show some interesting property.

### Exercise 4

Visit the [`pandas` Chart Visualization page](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) and make three plots of these data that interest you.

### Exercise 5

Visit the [`seaborn` Gallery](https://seaborn.pydata.org/examples/index.html) and make three plots of these data that interest you.