# Plotting with `DataFrame`s

Pandas `DataFrame`s also provide a plotting API through a [`.plot()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) method. More specialized methods like [`.bar()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html) are also available.

In the following we will use our previously gained knowledge on selecting data from `DataFrame`s to explore some --- hopefully interesting --- plotting capabilities.

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt

mpl.style.use("seaborn-v0_8-colorblind")

f"Pandas version: {pd.__version__ = }, Numpy version: {np.__version__ = }"

The easiest way is to access each column individually; this leaves us with a `Series` for which we know how to create plots. We start with plotting the distribution of each measured quantity as a histogram in a separate graph.

In [None]:
df_iris = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    names=["sepal length", "sepal width", "petal length", "petal width", "species"],
)

In [None]:
# Fill the gap!

Indeed, we can achieve this kind of plot even simpler. As we can see many of the parameters used to specify the subplot layout above with `plt.subplots()` can also be passed directly to the plotting method. We use the `bool`ean parameter `subplots=True` (in combination with `layout`) to generate a separate graph for each plot. This allows to comfortably plot multiple columns with from a single `DataFrame`. 

There is a difference observed for the bin widths. Above, the width is computed invidiually for each plot while in the plots below the bin width seems to be to chosen to be the same for each plot.

In [None]:
# Fill the gap!

For completeness we show the distributions in a single graph.

In [None]:
# Fill the gap!

## Exercises

:::{note} Please use the `DataFrame` plotting API whenever possible in the exercises below. For some of the tasks it will be helpful to know how to select values from the `DataFrame`. Refer to [this section](dataframe-acessing-rows-and-columns) if you need a refresher on this..
:::

### Boxplots

Create boxplots each of the measured quantities sepal length, sepal width, petal length, and petal width. Add grid lines along the y-axis. Also make sure to add a title and axes labels (with units?) where suitable.

There are two methods we can use, [`.plot.box()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html) and [`.boxplot()`](https://pandas.pydata.org/docs/reference/api/pandas.plotting.boxplot.html).

In [None]:
# Fill the gap!

In [None]:
ax = df_iris.boxplot(
    grid=False,
    ylabel="observed values / cm",
)

ax.grid(which="major", axis="y")
ax.set_title("Spread of lengths and widths in the Iris dataset")

Next, we are interested in the distribution of lengths and widths for each specie. Generate a box plot for each specie just as above for the full dataset. Arrange the plots in a suitable grid of subplots

In [None]:
# Fill the gap!

In [None]:
axes = df_iris.boxplot(
    by="species",
    ylabel="observed value / cm",
    sharey=True,
    figsize=(20, 4),
    layout=(1, 4),
)

# Get the figure containing the subplots from an axes object. We
# use the figure to customize the title of the full plot.
(
    axes[0]
    .get_figure()
    .suptitle(
        "Spread of lengths and widths in Iris dataset grouped by species",
        verticalalignment="bottom",
    )
)

for ax in axes:
    ax.grid(which="major", axis="x")

### Violin plot

Violin plots convey the a similar type of information as boxplots or histograms (to some extent a combination of the two). While histograms bin data, count the number of values that fall into a particular bin and plot that count, violin plots contain a "smoothed" version of this. These "smoothed histograms" estimate the observed data distribution by means of a kernel density estimate (KDE).

Matplotlib provides us with the [`violinplot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html) function that can also be called from `Axis` objects. Read through the documentation and try to generate a violinplot of the observed data distribution of the sepal length, petal length, sepal width, and petal width.

Also make sure to have a look at the examples at the bottom of the [violin plot documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html) to learn about about violin plots themselves and how to customize them.

In [None]:
# Fill the gap!

#### Scatter plots

[Earlier](plotting-data-from-iris-dataset) in this course we have created scatter plots to check if there is a correlation between different measured quantities in the Iris dataset. 

Create scatter plots of all reasonable combinations of sepal length, sepal width, petal length and petal width. Arrange the plots in suitable grid of subplots.

In [None]:
# Fill the gap!