# Section 1: Getting Started With Matplotlib

We will begin by familiarizing ourselves with Matplotlib. Moving beyond the default options, we will explore how to customize various aspects of our visualizations. By the end of this section, you will be able to generate plots using the Matplotlib API directly, as well as customize the plots that libraries like pandas and Seaborn create for you.

## Why start with Matplotlib?

There are many libraries for creating data visualizations in Python (even more if you include those that build on top of them). In this section, we will learn about Matplotlib's role in the Python data visualization ecosystem before diving into the library itself.

<figure>
  <blockquote cite="https://matplotlib.org/stable/index.html" style="border-left: none; box-shadow: none;">
    Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. &#91;It&#93; makes easy things easy and hard things possible.
  </blockquote>
  <figcaption style="text-align: right">&ndash; <a href="https://matplotlib.org/stable/index.html" target="_blank" rel="noopener noreferrer">Matplotlib documentation</a></figcaption>
</figure>

We will start by working with the `stackoverflow.zip` dataset, which contains the title and tags for all Stack Overflow questions tagged with a select few Python libraries since Stack Overflow's inception (Sept. 2008) through Sept. 12, 2021. The data comes from the [Stack Overflow API](https://api.stackexchange.com/docs/search) – more information can be found in [this](../data/collection/stackoverflow.ipynb) notebook. Here, we are aggregating the data monthly to get the total number of questions per library per month:

In [None]:
import pandas as pd

stackoverflow_monthly = pd.read_csv(
    '../data/stackoverflow.zip', parse_dates=True, index_col='creation_date'
).loc[:'2021-08','pandas':'bokeh'].resample('1M').sum()
stackoverflow_monthly.sample(5, random_state=1)

*Source: [Stack Exchange Network](https://api.stackexchange.com/docs/search)*

Those familiar with pandas have likely used the `plot()` method to generate visualizations. Here, we plot monthly Matplotlib questions over time:

In [None]:
%config InlineBackend.figure_formats = ['svg']
%matplotlib inline
stackoverflow_monthly.matplotlib.plot(
    figsize=(8, 2), xlabel='creation date', ylabel='total questions', 
    title='Matplotlib Questions per Month\n(since the creation of Stack Overflow)',
)

*Tip: The previous example used the `%matplotlib inline` IPython magic command to embed inline Matplotlib images (in SVG format as specified by the `%config` line). Curious how it works? Check out the source code [here](https://github.com/ipython/matplotlib-inline).*

Notice that this returns a Matplotlib `Axes` object since pandas is using Matplotlib as a plotting backend. This means that pandas takes care of a lot of the legwork for us &ndash; some examples include the following:

- Creating the figure: [source code](https://github.com/pandas-dev/pandas/blob/f5c224215ad0b3728173c67330ffcf13b35bdb2e/pandas/plotting/_matplotlib/core.py#L373-L392)
- Calling the `Axes.plot()` method: [source code](https://github.com/pandas-dev/pandas/blob/f5c224215ad0b3728173c67330ffcf13b35bdb2e/pandas/plotting/_matplotlib/core.py#L759-L760)
- Adding titles/labels: [source code](https://github.com/pandas-dev/pandas/blob/f5c224215ad0b3728173c67330ffcf13b35bdb2e/pandas/plotting/_matplotlib/core.py#L576-L591)

While pandas can do a lot of the work for us, there are benefits to understanding how to work with Matplotlib directly.

#### Flexibility

We can use other data structures (such as NumPy arrays) without the overhead of converting to a pandas data structure just to plot.

#### Customization

Even if we use pandas to make the initial plot, we can use Matplotlib commands on the `Axes` object that is returned to tweak other parts of the visualization. This is also the case for any library that uses Matplotlib as its plotting backend &ndash; examples of which include the following:
- [Cartopy](https://scitools.org.uk/cartopy/docs/latest/): geospatial data processing to produce map visualizations
- [ggplot](https://github.com/yhat/ggplot): Python version of the popular `ggplot2` R package
- [HoloViews](http://holoviews.org/): interactive visualizations with minimal code
- [Seaborn](https://seaborn.pydata.org/): high-level interface for creating statistical visualizations with Matplotlib
- [Yellowbrick](https://www.scikit-yb.org/): extension of Scikit-Learn for creating visualizations to analyze machine learning performance

*Note: Matplotlib maintains a list of such libraries [here](https://matplotlib.org/stable/thirdpartypackages/index.html). We will cover HoloViews later in this workshop, and examples with Seaborn can be found in [this pandas workshop](https://github.com/stefmolin/pandas-workshop).*

#### Extensibility

You can also build on top of Matplotlib for personal/work libraries. This might mean defining custom plot themes or functionality to create commonly-used visualizations.

Furthermore, if you want to contribute to open source data visualization libraries (like the aforementioned), knowledge of Matplotlib will come in handy. An example is the addition of the `refline()` method in the Seaborn library. This method makes it possible to draw horizontal/vertical reference lines on all subplots at once. The Matplotlib methods `axhline()` and `axvline()` are the basis of [this contribution](https://github.com/mwaskom/seaborn/commit/a626c0ae29b8c777b8e1342948e1611b984bf27b):

<div style="text-align: center;">
    <img width="60%" src="https://pbs.twimg.com/media/FBSw8BPX0AAa3wn?format=jpg&name=medium" alt="Seaborn refline() example" style="min-width: 300px">
    <div><small><em><a href="https://twitter.com/chris1610/status/1446976863365124098">Source</a></em></small></div>
</div>

## Matplotlib basics

In this workshop, we will explore the static and animated visualization functionality to gain a breadth of knowledge of the library. While we won't go too in depth, additional resources will be provided throughout. Now, let's get started with the basics.

The `Figure` object is the container for all components of our visualization. It contains one or more `Axes` objects, which can be thought of as the (sub)plots, as well as other [*Artists*](https://matplotlib.org/stable/tutorials/intermediate/artists.html), which draw on the plot canvas (x-axis, y-axis, legend, lines, etc.). The following image from the Matplotlib documentation illustrates the different components of a figure:

<div style="text-align: center;">
    <img width="50%" src="https://raw.githubusercontent.com/stefmolin/python-data-viz-workshop/main/media/figure_anatomy.png" alt="Matplotlib figure anatomy" style="min-width: 400px">
    <div><small><em><a href="https://matplotlib.org/stable/tutorials/introductory/usage.html#parts-of-a-figure">Source</a></em></small></div>
</div>

There are two ways we can build visualizations with Matplotlib:
1. **Functional**: call functions provided by the `matplotlib.pyplot` module
2. **Object-oriented**: call methods on `Figure` and `Axes` objects

While the object-oriented (OO) approach is recommended for non-interactive use (i.e., not in a Jupyter Notebook), either approach is valid &ndash; you should, however, try to avoid mixing them. Note that different use cases lend themselves to different approaches, so we will explore examples of both.

First, we will import the `pyplot` module:

In [None]:
import matplotlib.pyplot as plt

#### Functional approach

In [None]:
# figsize is determined by rcParams for plt.plot()
plt.plot(stackoverflow_monthly.index, stackoverflow_monthly.matplotlib)

_ = plt.xlabel('creation date')
_ = plt.ylabel('total questions')
_ = plt.title('Matplotlib Questions per Month\n(since the creation of Stack Overflow)')

*Note: Since we ran `%matplotlib inline` earlier, we don't need to do anything to display our plot here. If we hadn't, we would need to call `plt.show()` to do so.*

#### Object-oriented approach

In [None]:
# creates the Figure and adds a single Axes object
fig, ax = plt.subplots(figsize=(8, 2))

ax.plot(stackoverflow_monthly.index, stackoverflow_monthly.matplotlib)

ax.set_xlabel('creation date')
ax.set_ylabel('total questions')
ax.set_title('Matplotlib Questions per Month\n(since the creation of Stack Overflow)')

*Tip: Take note that each of the plotting commands is returning something. These are Matplotlib objects that we can use to further customize the visualization as well.*

As mentioned before, we can use Matplotlib code to modify the plot that pandas created for us. Here, we will use the object-oriented approach to remove the top and right spines and to start the y-axis at 0, while keeping the current setting for the end: 

In [None]:
ax = stackoverflow_monthly.matplotlib.plot(
    figsize=(8, 2), xlabel='creation date', ylabel='total questions', 
    title='Matplotlib Questions per Month\n(since the creation of Stack Overflow)',
)
ax.set_ylim(0, None) # this can also be done with pandas

# hide some of the spines (must be done with Matplotlib)
for spine in ['top', 'right']:
    ax.spines[spine].set_visible(False)

*Tip: You can use the functional approach to change the y-axis limits by replacing `ax.set_ylim(0, None)` with `plt.ylim(0, None)`.*

Now that we have the basics down, let's see how to create other plot types and add additional components to them like legends, reference lines, and annotations. Note that the anatomy of a figure diagram we looked at earlier will help moving from idea to implementation since it helps identify the right keywords to search. It may also be helpful to bookmark [this](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf) Matplotlib cheat sheet.

## Plotting with Matplotlib

Now that we understand a little bit of how Matplotlib works, we will walk through some more involved examples, which include legends, reference lines, and/or annotations, building them up step by step. Note that while using a library like pandas to do the initial plot creation can makes things easier, we will focus on using Matplotlib exclusively to get more familiar with it.

Each example in this section will showcase both how to build a specific plot with Matplotlib directly and how to customize it with some of the more advanced plotting techniques available. In particular, we will learn how to build and customize the following plot types:
- line plots
- scatter plots
- area plots
- bar plots
- stacked bar plots
- histograms
- box plots

### Line plot

The Stack Overflow data we have been working with thus far is a time series, so the first set of visualizations will be for studying the evolution of the data over time. However, rather than using a monthly aggregate like before, we will use daily data, so we will read in the data once more and this time aggregate it daily:

In [None]:
stackoverflow_daily = pd.read_csv(
    '../data/stackoverflow.zip', parse_dates=True, index_col='creation_date'
).loc[:,'pandas':'bokeh'].resample('1D').sum()
stackoverflow_daily.tail()

We are going to visualize how the rolling 30-day mean number of Matplotlib questions changed over time, along with the standard deviation. To do so, we first need to calculate these data points using pandas:

In [None]:
avgs = stackoverflow_daily.rolling('30D').mean()
stds = stackoverflow_daily.rolling('30D').std()

avgs.tail()

Now, we can proceed to building this visualization. We will work through the following steps over the next few slides:
1. Create the line plot.
2. Add a shaded region for $\pm$2 standard deviations from the mean.
3. Set the axis labels, y-axis limits, plot title, and despine the plot.

#### 1. Create the line plot.

By default, the `plot()` method will return a line plot:

In [None]:
fig, ax = plt.subplots(figsize=(8, 2))
ax.plot(avgs.index, avgs.matplotlib)

#### 2. Add a shaded region for $\pm$2 standard deviations from the mean.

Next, we use the `fill_between()` method to shade the region $\pm$2 standard deviations from the mean. Note that we also set `alpha=0.25` to make the region 25% opaque &ndash; transparent enough to easily see the line for the rolling 30-day mean:

In [None]:
fig, ax = plt.subplots(figsize=(8, 2))
ax.plot(avgs.index, avgs.matplotlib)
ax.fill_between(
    avgs.index, avgs.matplotlib - 2 * stds.matplotlib, 
    avgs.matplotlib + 2 * stds.matplotlib, alpha=0.25
)

#### 3. Set the axis labels, y-axis limits, plot title, and despine the plot.

Now for the final touches. While in previous examples we used `ax.set_xlabel()`, `ax.set_ylabel()`, etc., here we use `ax.set()`, which allows us to set multiple attributes of the plot in a single method call.

In [None]:
fig, ax = plt.subplots(figsize=(8, 2))
ax.plot(avgs.index, avgs.matplotlib)
ax.fill_between(
    avgs.index, avgs.matplotlib - 2 * stds.matplotlib, 
    avgs.matplotlib + 2 * stds.matplotlib, alpha=0.25
)

ax.set(
    xlabel='creation date', ylabel='total questions', ylim=(0, None),
    title='Rolling 30-Day Average of Matplotlib Questions per Day'
)

for spine in ['top', 'right']:
    ax.spines[spine].set_visible(False)

Next, we will make a utility function to remove the top and right spines of our plots more easily going forward. It's considered good practice to return the `Axes` object:

In [None]:
def despine(ax):
    for spine in ['top', 'right']:
        ax.spines[spine].set_visible(False)
    return ax

*Note: Since we are working in a Jupyter Notebook, our figures are automatically closed after we run the cell. However, if you are working elsewhere, make sure to call `plt.close()` to free up those resources when you are finished.*

### Scatter plot

The `plot()` method can also be used to create scatter plots, but we have to pass in some additional information. Let's build up to a scatter plot of monthly Matplotlib questions with some "best fit" lines:

1. Create the scatter plot.
2. Convert to Matplotlib dates.
3. Add the best fit lines.
4. Label the axes, add a legend, and despine.
5. Format both the x- and y-axis tick labels.

#### 1. Create the scatter plot.

So far, we have passed x and y as positional arguments to the `plot()` method; however, there is a third argument we haven't explored: the format string (`fmt`) is a shorthand for specifying the marker (shape of the point), line style, and color to use for the plot. We can use this to create a scatter plot with the `plot()` method.

Note that while there is some flexibility in the order these are specified, it is recommended that we specify them in the following order: 

```python
fmt = '[marker][line][color]'
```

Here, we use the format string `ok` to create a scatter plot with black (`k`) circles (`o`); notice that we don't specify a line style because we don't want lines this time:

In [None]:
fig, ax = plt.subplots(figsize=(9, 3))
ax.plot(
    stackoverflow_monthly.index,
    stackoverflow_monthly.matplotlib, 
    'ok', label=None, alpha=0.5
)

*Tip: As an alternative, the `scatter()` method can be used to create a scatter plot, in which case we don't need to specify the format string (`fmt`).*

#### 2. Convert to Matplotlib dates.

In the previous example, we used `stackoverflow_monthly.index` as our x values. While Matplotlib was able to correctly show the years on the x-axis, when we try to add the best fit lines, we will have issues. This is because Matplotlib works with dates a little differently. To get around this, we will convert the dates to Matplotlib dates while we build up the plot; then, at the end, we will format them into a human-readable format.

We can use the `date2num()` function in the `matplotlib.dates` module to convert to Matplotlib dates:

In [None]:
import matplotlib.dates as mdates

x_axis_dates = mdates.date2num(stackoverflow_monthly.index)
x_axis_dates[:5]

Now, let's update our plot to use these dates:

In [None]:
fig, ax = plt.subplots(figsize=(9, 3))
ax.plot(
    x_axis_dates, stackoverflow_monthly.matplotlib, 
    'ok', label=None, alpha=0.5
)

#### 3. Add the best fit lines.

We will use NumPy to obtain the best fit lines, which will be a first degree and a second degree polynomial. The `polyfit()` function fits a polynomial of the specified degree to our data and `poly1d()` instantiates a polynomial that we can evaluate at different x values to obtain the y values for the best fit line:

```python
import numpy as np

degree = 1
poly = np.poly1d(
    np.polyfit(x_axis_dates, stackoverflow_monthly.matplotlib, degree)
)
```

For each of these best fit lines, we will call the `plot()` method to add them to the scatter plot:

In [None]:
import numpy as np

fig, ax = plt.subplots(figsize=(9, 3))
ax.plot(x_axis_dates, stackoverflow_monthly.matplotlib, 'ok', label=None, alpha=0.5)

for degree, linestyle in zip([1, 2], ['solid', 'dashed']):
    poly = np.poly1d(np.polyfit(x_axis_dates, stackoverflow_monthly.matplotlib, degree))
    ax.plot(
        x_axis_dates, poly(x_axis_dates), label=degree, 
        linestyle=linestyle, linewidth=2, alpha=0.9
    )

*Tip: We also specified `linestyle` to differentiate between the lines and `linewidth` to make them thicker. More info on the `zip()` function is available [here](https://realpython.com/python-zip-function/).*

Before moving on, let's package up this logic in a function:

In [None]:
def add_best_fit_lines(ax, x, y):
    for degree, linestyle in zip([1, 2], ['solid', 'dashed']):
        poly = np.poly1d(np.polyfit(x, y, degree))
        ax.plot(
            x, poly(x), label=degree, 
            linestyle=linestyle, linewidth=2, alpha=0.9
        )
    return ax

#### 4. Label the axes, add a legend, and despine.

Next, we need to add a legend so we can tell the best fit lines apart. Now is also a good time to label our axes, give our plot a title, and adjust the limits of both the x- and y-axis (`xlim`/`ylim`). Here, we define a function that will add all of this to our plot and use the first date in the data as the start of the x-axis (we will pass this in as `xmin`):

In [None]:
def add_labels(ax, xmin):
    ax.set(
        xlabel='creation date', ylabel='total questions',
        xlim=(xmin, None), ylim=(0, None),
        title='Matplotlib Questions per Month\n(since the creation of Stack Overflow)'
    )
    ax.legend(title='degree') # add legend and give it a title
    return ax

Let's call this after the plotting code we built up so far and also despine our plot:

In [None]:
fig, ax = plt.subplots(figsize=(9, 3))
ax.plot(x_axis_dates, stackoverflow_monthly.matplotlib, 'ok', label=None, alpha=0.5)

add_best_fit_lines(ax, x_axis_dates, stackoverflow_monthly.matplotlib)

add_labels(ax, x_axis_dates[0])
despine(ax)

####  5. Format both the x- and y-axis tick labels.

All that remains now is to clean up the tick labels on the axes: the x-axis should have human-readable dates, and the y-axis can be improved by formatting the numbers for readability. For both, we will need to access the `Axis` objects contained in the `Axes` object via the `xaxis`/`yaxis` attribute:

```python
ax.xaxis # access the x-axis
ax.yaxis # access the y-axis
```

From there, we will use two methods to customize the *major* tick labels (as opposed to *minor*, which our plot isn't currently showing). We call the `set_major_locator()` method to adjust where the ticks are located, and the `set_major_formatter()` to adjust the format of the tick labels. For the x-axis, we will place ticks at 16-month intervals and format the labels as `%b\n%Y`, which places the month abbreviation above the year. This functionality comes from the `matplotlib.dates` module:

```python
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=16))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b\n%Y'))
```

The `matplotlib.ticker` module contains classes for tick location and formatting for non-dates. Here, we use the `StrMethodFormatter` class to provide a format string just as we would see with the `str.format()` method. This particular format specifies that the labels should be floats with commas as the thousands separator and zero digits after the decimal:

```python
from matplotlib import ticker

ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
```

Now, let's put everything together in a function:

In [None]:
from matplotlib import ticker

def format_axes(ax):
    ax.xaxis.set_major_locator(mdates.MonthLocator(interval=16))
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%b\n%Y'))
    ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
    return ax

*Tip: Use `EngFormatter` instead of `StrMethodFormatter` for engineering notation.*

We now have all the pieces for our final visualization:

In [None]:
fig, ax = plt.subplots(figsize=(9, 3))
ax.plot(x_axis_dates, stackoverflow_monthly.matplotlib, 'ok', label=None, alpha=0.5)

add_best_fit_lines(ax, x_axis_dates, stackoverflow_monthly.matplotlib)

add_labels(ax, x_axis_dates[0])
despine(ax)
format_axes(ax)

### Exercise 1.1

##### Using the data in `weather.csv`, plot the daily average temperature (`TAVG`) for both LA and NYC. Fill in all sections where NYC's daily average temperature was higher than LA's in 2020.

In [None]:
# Complete exercise 1.1 

In [None]:
# TIP: the `despine()` function is available in utils.py

### Area plot

We have just been using the Matplotlib questions time series, but it's also interesting to look at trends for multiple libraries. Since the libraries in this dataset vary in age, popularity, and number of Stack Overflow questions, a good option to view many at once is an area plot. This will give us an idea of both the overall trend for these types of libraries and the libraries themselves. Let's start by subsetting our daily Stack Overflow questions data to the top four libraries by number of questions:

In [None]:
subset = stackoverflow_daily.sum().nlargest(4)
top_libraries_monthly = stackoverflow_monthly.reindex(columns=subset.index)
top_libraries_monthly.head()

Now, we can build up our plot. Once again, we will break this down in steps:
1. Create the area plot.
2. Label and format the axes, provide a title, and despine the plot.
3. Add annotations.

#### 1. Create the area plot.

First, we use `stackplot()` to create the area plot as our starting point. Note that we are using Matplotlib dates from the start rather than switching when we add the annotations:

In [None]:
fig, ax = plt.subplots(figsize=(12, 3))
ax.stackplot(
    mdates.date2num(top_libraries_monthly.index),
    top_libraries_monthly.to_numpy().T, # each element is a library's time series
    labels=top_libraries_monthly.columns
)

#### 2. Label and format the axes, provide a title, and despine the plot.
Next, we will handle labels and formatting before working on the annotations. This should look familiar from previous examples:

In [None]:
fig, ax = plt.subplots(figsize=(12, 3))
ax.stackplot(
    mdates.date2num(top_libraries_monthly.index), top_libraries_monthly.to_numpy().T, 
    labels=top_libraries_monthly.columns
)
ax.set(xlabel='', ylabel='tagged questions', title='Stack Overflow Questions per Month')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
ax.yaxis.set_major_formatter(ticker.EngFormatter())
despine(ax)

This will be the basis of a couple of visualizations we do in this section, so let's make a function for what we have so far:

In [None]:
def area_plot(data):
    fig, ax = plt.subplots(figsize=(12, 3))
    ax.stackplot(
        mdates.date2num(data.index),
        data.to_numpy().T, 
        labels=data.columns
    )
    ax.set(
        xlabel='', ylabel='tagged questions',
        title='Stack Overflow Questions per Month'
    )
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    ax.yaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
    return ax

#### 3. Add annotations.
Rather than use a legend for this plot, we are going to use annotations to label each area and provide the median value in 2021. To create annotations, we use the `annotate()` method with the following arguments:

- The first argument (`text`) is the annotation text as a string.
- The `xy` argument is a tuple of the coordinates for the data point that we are annotating.
- The `xytext` argument is a tuple of the coordinates where we want to place the annotation text.
- When providing `xytext`, we can optionally provide `arrowprops`, which defines the style to use for the arrow pointing from `xytext` to `xy`.
- We can also customize alignment of the text horizontally (`ha`) and vertically (`va`).

We will annotate pandas, NumPy, and Matplotlib alongside their respective areas, but Seaborn will be moved higher up and to the right using an arrow to point to its area (since it is thin). Once again, we will place this logic in a function for reuse later:

In [None]:
def annotate(ax, data):
    total = 0
    for library in data.columns:
        last = data.last('1D')[library]
        last_day, last_value = last.index[0], last.iat[0]
        if library in ['pandas', 'numpy', 'matplotlib']:
            kwargs = {}
        else:
            kwargs = dict(
                xytext=(last_day + pd.Timedelta(days=20), (last_value + total) * 1.1),
                arrowprops=dict(arrowstyle='->')
            )

        ax.annotate(
            f' {library}: {data.loc["2021", library].median():,.0f}',
            xy=(last_day, last_value/2 + total), ha='left', va='center', **kwargs
        )
        total += last_value
    return ax

Now, let's see what our plot looks like so far:

In [None]:
ax = area_plot(top_libraries_monthly)
annotate(ax, top_libraries_monthly)

*Tip: You can use $LaTeX$ symbols when providing text (annotations, titles, etc.) to Matplotlib commands, e.g., using `r'$\alpha$'` will be rendered as $\alpha$. See [this](https://matplotlib.org/stable/tutorials/text/mathtext.html) page in the documentation for more information.*

Before we move on to our next plot, which builds upon this one, let's update our `area_plot()` function to include the call to the `annotate()` function:

In [None]:
def area_plot(data):
    fig, ax = plt.subplots(figsize=(12, 3))
    ax.stackplot(
        mdates.date2num(data.index),
        data.to_numpy().T, 
        labels=data.columns
    )
    ax.set(
        xlabel='', ylabel='tagged questions',
        title='Stack Overflow Questions per Month'
    )
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    ax.yaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
    annotate(ax, data)
    return ax

#### More Annotations
The Stack Overflow community is very active, and you will frequently see old questions updated with new answers reflecting the latest solutions. The dataset we have been working with contains several examples of this: Seaborn makes some plotting tasks a lot easier than Matplotlib, so some of the Stack Overflow questions that are currently tagged "seaborn" were originally posted before Seaborn's first release ([v0.1](https://github.com/mwaskom/seaborn/releases/tag/v0.1)) on October 28, 2013. See the [Color by Column Values in Matplotlib](https://stackoverflow.com/questions/14885895/color-by-column-values-in-matplotlib) question for an example.

Let's use reference lines and a shaded region to highlight this section on the area plot.

First, we will add a dashed vertical line for Seaborn's first release (October 28, 2013) using the `axvline()` method:

In [None]:
import datetime as dt

ax = area_plot(top_libraries_monthly)
    
# mark when seaborn was created
seaborn_released = dt.date(2013, 10, 28)
ax.axvline(seaborn_released, ymax=0.6, color='gray', linestyle='dashed')
ax.annotate('seaborn v0.1', xy=(seaborn_released, 4750), rotation=-90, va='top')

Next, we'll make an additional vertical line for the oldest question that was retroactively tagged "seaborn":

In [None]:
ax = area_plot(top_libraries_monthly)

seaborn_released = dt.date(2013, 10, 28)
ax.axvline(seaborn_released, ymax=0.6, color='gray', linestyle='dashed')
ax.annotate('seaborn v0.1', xy=(seaborn_released, 4750), rotation=-90, va='top')

# oldest question tagged "seaborn"
first_seaborn_qs = top_libraries_monthly.query('seaborn >= 1')\
    .first('1D').index[0].to_pydatetime().date()
ax.axvline(first_seaborn_qs, ymax=0.6, color='gray', linestyle='dashed')

Let's package this reference line logic up in a function before looking at how to shade the region between them:

In [None]:
def add_reflines(ax, data):
    seaborn_released = dt.date(2013, 10, 28)
    ax.axvline(seaborn_released, ymax=0.6, color='gray', linestyle='dashed')
    ax.annotate('seaborn v0.1', xy=(seaborn_released, 4750), rotation=-90, va='top')

    first_seaborn_qs = \
        data.query('seaborn >= 1').first('1D').index[0].to_pydatetime().date()
    ax.axvline(first_seaborn_qs, ymax=0.6, color='gray', linestyle='dashed')
    return ax

Finally, we use the `axvspan()` method to shade in the region between the lines:

In [None]:
ax = area_plot(top_libraries_monthly)
add_reflines(ax, top_libraries_monthly)

# shade the region of posts that were retroactively tagged "seaborn"
ax.axvspan(
    ymax=0.6, xmin=mdates.date2num(first_seaborn_qs),
    xmax=mdates.date2num(seaborn_released), color='gray', alpha=0.25
)
middle = (seaborn_released - first_seaborn_qs)/2 + first_seaborn_qs
ax.annotate(
    'posts retroactively\ntagged "seaborn"', 
    xy=(mdates.date2num(middle), 3500), 
    va='top', ha='center'
)

### Bar plot

The upper left region of the area plot we just worked on has a lot of whitespace. We can use this space to provide additional information with an inset plot. Let's work on adding an inset bar plot that shows total questions per library:

1. Add the inset `Axes` object to the `Figure` object.
2. Create the horizontal bar plot.
3. Label and format the plot.

#### 1. Add the inset `Axes` object to the `Figure` object.

First, we need to modify our `area_plot()` function to return the `Figure` object as well:

In [None]:
def area_plot(data):
    fig, ax = plt.subplots(figsize=(12, 3))
    ax.stackplot(
        mdates.date2num(data.index),
        data.to_numpy().T, 
        labels=data.columns
    )
    ax.set(
        xlabel='', ylabel='tagged questions',
        title='Stack Overflow Questions per Month'
    )
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    ax.yaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
    annotate(ax, data)
    return fig, ax

Now, we can call our updated `area_plot()` function and use the `Figure` object that it returns to add the inset plot via the `add_axes()` method. This method receives the dimensions as a sequence of 4 values, represented as percentages of `Figure` dimensions:
1. `left`: Offset from the left edge of the `Figure` (i.e., the `x`).
2. `bottom`: Offset from the bottom edge of the `Figure` (i.e., the `y`).
3. `width`: The width of the inset.
4. `height`: The height of the inset.

In [None]:
fig, ax = area_plot(top_libraries_monthly)
inset_ax = fig.add_axes([0.2, 0.6, 0.2, 0.2])

*Tip: Check out the [axes_grid1 toolkit](https://matplotlib.org/stable/tutorials/toolkits/axes_grid.html#insetlocator) if you want the inset to contain a "zoomed in" version of the data.*

#### 2. Create the horizontal bar plot.

Next, we use the `barh()` method to add horizontal bars to the inset. These bars will represent total questions for each of the libraries in the area plot, so we will also need to make sure the colors align. For this, we use the `collections` attribute to access each of the sections of the area plot and grab their colors with the `get_facecolor()` method:

In [None]:
fig, ax = area_plot(top_libraries_monthly)
inset_ax = fig.add_axes([0.2, 0.6, 0.2, 0.2])
colors = {area.get_label(): area.get_facecolor() for area in ax.collections}

# populate the inset with a bar plot of total questions
total_qs = top_libraries_monthly.sum()
inset_ax.barh(
    total_qs.index, total_qs.to_numpy(),
    color=[colors[label] for label in total_qs.index]
)
inset_ax.invert_yaxis() # sort bars in descending order

#### 3. Label and format the plot.
Labeling and formatting the inset works the same as we've seen before:

In [None]:
fig, ax = area_plot(top_libraries_monthly)
inset_ax = fig.add_axes([0.2, 0.6, 0.2, 0.2])
colors = {area.get_label(): area.get_facecolor() for area in ax.collections}

total_qs = top_libraries_monthly.sum()
inset_ax.barh(
    total_qs.index, total_qs.to_numpy(), 
    color=[colors[label] for label in total_qs.index]
)
inset_ax.invert_yaxis()
despine(inset_ax)
inset_ax.xaxis.set_major_formatter(ticker.EngFormatter())
inset_ax.set_xlabel('total questions')

Final remarks on this example:

- More complicated subplot layouts can be created with [GridSpec](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/gridspec_multicolumn.html#sphx-glr-gallery-subplots-axes-and-figures-gridspec-multicolumn-py) or the [Figure.subplot_mosaic()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot_mosaic.html) method, which was added in v3.3 to make it easier to create complex layouts.
- We used both dictionary and list comprehensions here &ndash; check out [this article](https://www.geeksforgeeks.org/comprehensions-in-python/) for more information on comprehensions in Python.

#### Annotating bars

Our inset only shows data for four of the libraries in the Stack Overflow dataset to match the area plot. Due to the smaller scale of the other libraries, it didn't make sense to include them in the area plot; however, we can visualize total questions for each of them with a bar plot if we use a log scale for the x-axis. This time our bar plot won't just be an inset, and we will explore how to annotate the bars.

Our data looks like this:

In [None]:
questions_per_library = pd.read_csv(
    '../data/stackoverflow.zip', parse_dates=True, index_col='creation_date'
).loc[:,'pandas':'bokeh'].sum().sort_values()
questions_per_library

We will work through the following steps to create this visualization:

1. Create the bar plot.
2. Label, format, and apply a log scale to the plot.
3. Annotate each of the bars.

#### 1. Create the bar plot.
We once again use the `barh()` method to create horizontal bars; however, note that the `bar()` method can be used to create vertical bars:

In [None]:
fig, ax = plt.subplots(figsize=(7, 4))
ax.barh(questions_per_library.index, questions_per_library.to_numpy())

#### 2. Label, format, and apply a log scale to the plot.
To change the scale of an axis, we specify a value for `xscale`/`yscale` or pass in a value to the `set_xscale()`/`set_yscale()` method:

```python
ax.set_xscale('log')
ax.set(xscale='log')
```

Now, we can actually see all the bars:

In [None]:
fig, ax = plt.subplots(figsize=(7, 4))
ax.barh(questions_per_library.index, questions_per_library.to_numpy())
ax.set(xlabel='total questions', xscale='log')
despine(ax)

#### 3. Annotate each of the bars.

To annotate the bars, we need to grab them off the `Axes` object, similar to how we grabbed the colors from the area plot. To access the bars, we use the `patches` attribute. Here, we write a function to iterate over each of these patches and annotate each bar with the total number of questions for that library:

In [None]:
def annotate_bars(ax):
    for bar in ax.patches:
        x, y = bar.get_xy()
        ax.text(
            x + bar.get_width(), y + bar.get_height()/2, f'{bar.get_width():,d} ',
            va='center', ha='right', color='white'
        )
    return ax

Putting everything together, our final visualization looks like this:

In [None]:
fig, ax = plt.subplots(figsize=(7, 4))
ax.barh(questions_per_library.index, questions_per_library.to_numpy())
ax.set(xlabel='total questions', xscale='log')
despine(ax)
annotate_bars(ax)

### Exercise 1.2

##### Using the data in `weather.csv`, make a vertical bar plot showing total monthly precipitation (`PRCP`) in Seattle. Annotate the bars.

In [None]:
# Complete exercise 1.2 

In [None]:
# TIP: the `despine()` function is available in utils.py

### Stacked bar plot
For the next few examples, we will be creating a stacked bar plot showing co-occurrences of the library tags. This gives us an idea of relationships between the libraries and how people use them. Converting the original dataset into the adjacency matrix we will need for these visualizations is not important for this training, so we will just read in a file with the adjacency matrix; however, those interested in the code behind this can find it in the [stackoverflow.ipynb](../data/collection/stackoverflow.ipynb) notebook.

Our data looks as follows. Here, we see that the largest co-occurrence with **hvplot** is **holoviews** followed by **bokeh**.

In [None]:
co_occur = pd.read_csv(
    '../data/stackoverflow_tag_co_occurrences.csv',
    index_col='library'
)
co_occur

Note that the diagonal contains all zeros because we also want to understand what percentage of questions per library were also tagged with another library in this dataset. For example, most questions tagged with **hvplot** or **geoviews** were also tagged with another library in our list, but questions tagged with **pandas** were only tagged with another library 12.3% of the time:

In [None]:
co_occur.sum(axis=1)

In this example, we will also see another way of formatting an axis with `ticker` and learn how to customize colors. Let's work through the following steps:
1. Create the stacked bar plot.
2. Add the legend.
3. Label and format the plot.
4. Annotate the bars.
5. Change the color scheme.

#### 1. Create the stacked bar plot.
To make our stacked bar plot, we can still use the `barh()` method; however, we have to make multiple calls to it, each time specifying where the starting point should be (i.e., the end of the previous portion of the bar). We will package this logic up in a function:

In [None]:
def stacked_bars(data):
    fig, ax = plt.subplots(figsize=(6, 3))
    libraries = data.index

    last = 0
    for library in libraries:
        co_occurring_library = data[library]
        ax.barh(libraries, co_occurring_library, label=library, left=last)
        last += co_occurring_library
    
    ax.invert_yaxis()
    return despine(ax)

Calling our function gives us the start of our stacked bar plot visualization:

In [None]:
ax = stacked_bars(co_occur)

#### 2. Add the legend.
Next, we add a legend to understand what the colors mean:

In [None]:
ax = stacked_bars(co_occur)
ax.legend(bbox_to_anchor=(1.35, 0.5), loc='center right', framealpha=0.5)

#### 3. Label and format the plot.
Then, we label our x-axis and set the limits of the x-axis (`xlim`) so that it goes from 0% to 100%. We also use `ticker.PercentFormatter` to display our x-axis tick labels as percentages. Note that we passed in `xmax=1` when instantiating it because our data already contains percentages, but passing in the maximum value in the data will calculate the percentages:

In [None]:
ax = stacked_bars(co_occur)
ax.legend(bbox_to_anchor=(1.35, 0.5), loc='center right', framealpha=0.5)
ax.set(xlabel='percentage of questions with co-occurrences', xlim=(0, 1))
ax.xaxis.set_major_formatter(ticker.PercentFormatter(xmax=1))

Before moving on, let's update our function to include what we have so far:

In [None]:
def stacked_bars(data):
    fig, ax = plt.subplots(figsize=(6, 3))
    libraries = data.index

    last = 0
    for library in libraries:
        co_occurring_library = data[library]
        ax.barh(libraries, co_occurring_library, label=library, left=last)
        last += co_occurring_library
    
    ax.invert_yaxis()
    ax.legend(bbox_to_anchor=(1.35, 0.5), loc='center right', framealpha=0.5)
    ax.set(xlabel='percentage of questions with co-occurrences', xlim=(0, 1))
    ax.xaxis.set_major_formatter(ticker.PercentFormatter(xmax=1))
    return despine(ax)

#### 4. Annotate the bars.
As with the previous example, we annotate our bars after accessing each via the `patches` attribute on the `Axes` object. However, this time we only label bars whose values surpass a threshold &ndash; note that the value of a bar here is its width:

In [None]:
ax = stacked_bars(co_occur)

for patch in ax.patches:
    width = patch.get_width()
    if width > .09:
        ax.text(
            patch.get_x() + width/2, patch.get_y() + patch.get_height()/2,
            f'{width:.1%}', va='center', ha='center', color='ivory', fontsize=11
        )

Let's package up the annotation logic before moving on:

In [None]:
def annotate_bars(ax, threshold):
    for patch in ax.patches:
        width = patch.get_width()
        if width > threshold:
            ax.text(
                patch.get_x() + width/2, patch.get_y() + patch.get_height()/2,
                f'{width:.1%}', va='center', ha='center', color='ivory', fontsize=11
            )
    return ax

#### 5. Change the color scheme.
Throughout this workshop, we have been using default colormaps. For this visualization, we will take a look at how to change the colormap. Note that this is only one way of working with colormaps.

We will start by importing the `cm` module, which contains all the colormaps (full list [here](https://matplotlib.org/stable/gallery/color/colormap_reference.html#sphx-glr-gallery-color-colormap-reference-py)). For this example, we will select the `tab10` qualitative colormap and reverse its order:

In [None]:
from matplotlib import cm

cmap = cm.get_cmap('tab10').reversed()

*Note: The colormap we selected is the default colormap, but reversing it will change the order in which the colors are assigned. Since we have nine elements and ten colors, we will see a new color this time.*

The colormap object is a **callable**:

In [None]:
[cmap(i) for i in range(10)]

*Tip: Learn more about callables [here](https://www.pythonmorsels.com/topics/callables/).*

This means that when we update our `stacked_bars()` function to accept a colormap, we also need to update our `for` loop:

In [None]:
def stacked_bars(data, cmap):
    fig, ax = plt.subplots(figsize=(6, 3))
    libraries = data.index

    last = 0
    for i, library in enumerate(libraries):
        co_occurring_library = data[library]
        ax.barh(
            libraries, co_occurring_library, 
            label=library, left=last, color=cmap(i)
        )
        last += co_occurring_library
    
    ax.invert_yaxis()
    ax.legend(bbox_to_anchor=(1.35, 0.5), loc='center right', framealpha=0.5)
    ax.set(xlabel='percentage of questions with co-occurrences', xlim=(0, 1))
    ax.xaxis.set_major_formatter(ticker.PercentFormatter(xmax=1))
    return despine(ax)

Calling the updated function with the reversed colormap changes the colors on each of the bars:

In [None]:
ax = stacked_bars(co_occur, cmap)
annotate_bars(ax, threshold=0.09)

*Tip: The full list of colors Matplotlib recognizes by name can be found [here](https://matplotlib.org/stable/gallery/color/named_colors.html#sphx-glr-gallery-color-named-colors-py).*

### Exercise 1.3

##### Using the data in `weather.csv`, create a stacked horizontal bar plot of total precipitation per city per quarter (each city will have four segments &ndash; one for the total precipitation in each quarter of the year). Add a vertical line at Seattle's total precipitation.

In [None]:
# Complete exercise 1.3 

In [None]:
# TIP: the `despine()` function is available in utils.py

### Histogram

We will now be switching to a different dataset for the final two plot types that we will be discussing. The new dataset we will be working with contains NYC subway entrances and exits per borough per day for 2017-2021. It was resampled from [this](https://www.kaggle.com/eddeng/nyc-subway-traffic-data-20172021?select=NYC_subway_traffic_2017-2021.csv) Kaggle dataset created through some extensive data wrangling by Kaggle user [Edden](https://www.kaggle.com/eddeng). Our dataset looks like this:

In [None]:
subway = pd.read_csv(
    '../data/NYC_subway_daily.csv', parse_dates=['Datetime'], 
    index_col=['Borough', 'Datetime']
)
subway_daily = subway.unstack(0)
subway_daily.head()

We will build a histogram of daily subway entries in Manhattan with this data using the following steps:
1. Create the histogram.
2. Label and format the plot.
3. Explore the use of subplots.

#### 1. Create the histogram.
To create a histogram, we use the `hist()` method:

In [None]:
fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(subway_daily.loc['2018', 'Entries']['M'], ec='black')

#### 2. Label and format the plot.
Next, we clean up the plot by labeling axes and formatting the tick labels:

In [None]:
fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(subway_daily.loc['2018', 'Entries']['M'], ec='black')
ax.set(
    xlabel='Entries', ylabel='Frequency',
    title='Histogram of Daily Subway Entries in Manhattan'
)
ax.xaxis.set_major_formatter(ticker.EngFormatter())
despine(ax)

#### 3. Explore the use of subplots.
The histogram of daily subway entries in Manhattan shows a distribution that is clearly bimodal. Let's use subplots to separate out the weekday and weekend distributions that combine to form the shape that we are seeing.

First, we will need to create a Boolean mask to be able to filter our data by weekday versus weekend:

In [None]:
weekday_mask = subway_daily.index.weekday < 5
weekday_mask

Next, we will need to update our call to `plt.subplots()` to specify one row of two columns as our layout, with all subplots sharing the same x-axis range (`sharex=True`). Since there are fewer weekend days than weekdays in the year, we don't share the y-axis (`sharey=False`). As before, we call the `hist()` method to add the histogram to each of the subplots:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 3), sharex=True, sharey=False)

for ax, mask in zip(axes, [~weekday_mask, weekday_mask]):
    ax.hist(subway_daily[mask].loc['2018', 'Entries']['M'], ec='black')

*Tip: When iterating over layouts with multiple rows and columns (e.g., a 2x2 layout), call the `flatten()` method on the NumPy ndarray of `Axes` objects to iterate over the `Axes` objects one by one rather than row by row.*

Now, let's label and format the subplots. We will use the x-axis label to distinguish between the weekday and weekend distributions, and we will only provide a label for the y-axis of the leftmost plot to reduce clutter. This requires that we include a label in our `for` loop. Here, we are also updating the x-axis tick label format to use engineering notation:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 3), sharex=True, sharey=False)

for ax, mask, label in zip(axes, [~weekday_mask, weekday_mask], ['Weekend', 'Weekday']):
    ax.hist(subway_daily[mask].loc['2018', 'Entries']['M'], ec='black')
    ax.set_xlabel(f'Entries per {label}')
    ax.xaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
axes[0].set_ylabel('Frequency')

Since we have multiple subplots, we need to call the `suptitle()` method on the `Figure` object to provide a title for the whole visualization. This shows the clear shift in subway usage between weekdays and weekends, along with the effect of the bridge-and-tunnel crowd:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 3), sharex=True, sharey=False)

for ax, mask, label in zip(axes, [~weekday_mask, weekday_mask], ['Weekend', 'Weekday']):
    ax.hist(subway_daily[mask].loc['2018', 'Entries']['M'], ec='black')
    ax.set_xlabel(f'Entries per {label}')
    ax.xaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
axes[0].set_ylabel('Frequency')
fig.suptitle('Histogram of Daily Subway Entries in Manhattan')

### Box plot
As an alternative to the previous visualization, we will create box plots. Our initial code looks very similar &ndash; we just call the `boxplot()` method instead of the `hist()` method:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(6, 2.5), sharey=True)
for ax, mask, label in zip(axes, [~weekday_mask, weekday_mask], ['Weekend', 'Weekday']):
    ax.boxplot(subway_daily[mask].loc['2018', 'Entries']['M'])
    ax.set_xlabel(label)
    ax.yaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
axes[0].set_ylabel('daily subway entries')
fig.suptitle('Box Plot of Daily Subway Entries in Manhattan')

However, this time each of the subplots has a single tick with the label **1**. Rather than setting the x-axis label, we will need to use the `set_xticklabels()` method:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(6, 2.5), sharey=True)
for ax, mask, label in zip(axes, [~weekday_mask, weekday_mask], ['Weekend', 'Weekday']):
    ax.boxplot(subway_daily[mask].loc['2018', 'Entries']['M'])
    ax.set_xticklabels([label]) # label the ticks instead of the axis this time
    ax.yaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
axes[0].set_ylabel('daily subway entries')
fig.suptitle('Box Plot of Daily Subway Entries in Manhattan')

The final tweak we will cover in this section is the `tight_layout()` method, which will adjust the layout of the visualization to make better use of the space (more info [here](https://matplotlib.org/stable/tutorials/intermediate/tight_layout_guide.html?highlight=tight%20layout%20guide)). Notice that here it reduced the space between the subplots by adjusting the length of the x-axis of each subplot. This method can also be useful when labels are partially covered:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(6, 2.5), sharey=True)
for ax, mask, label in zip(axes, [~weekday_mask, weekday_mask], ['Weekend', 'Weekday']):
    ax.boxplot(subway_daily[mask].loc['2018', 'Entries']['M'])
    ax.set_xticklabels([label])
    ax.yaxis.set_major_formatter(ticker.EngFormatter())
    despine(ax)
axes[0].set_ylabel('daily subway entries')
fig.suptitle('Box Plot of Daily Subway Entries in Manhattan')
fig.tight_layout()

*Tip: Save any of the visualizations we've built by calling the `plt.savefig()` function or the `savefig()` method on the `Figure` object as the last line in the cell generating the plot.*

### Exercise 1.4

##### Using the data in `weather.csv`, generate histograms for the daily average wind (`AWND`) in each of the cities. Make sure to use subplots that share both the x- and y-axis.

In [None]:
# Complete exercise 1.4 

In [None]:
# TIP: the `despine()` function is available in utils.py