# Plotting and Data Visualization

A picture is worth a thousand words. While exploring an unknow data set the visualisation of data is a powerful way to a deeper understanding and an important part of data science.

The Python ecosystem includes several low and high level plotting/visualization libraries. The most feature-complete and popular one is [**matplotlib**](https://matplotlib.org). Among the alternatives are [**bokeh**](https://bokeh.pydata.org/en/latest/) and [**plotly**](https://plot.ly/python/), focusing on interactive visualizations. An example for plotly is given at the end of this notebook.

Libraries like [**seaborn**](http://seaborn.pydata.org) are built on matplotlib and provide a high-level interface for visual data analysis. The [**pandas**](http://pandas.pydata.org/) library also provides a more high-level plotting interface that uses matplotlib. If you want to have a more general overview on libraries and interfaces, have a look for this [blog entry](https://www.anaconda.com/blog/python-data-visualization-2018-why-so-many-libraries).

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn

Here we configure some settings for the following plots:

In [None]:
seaborn.set_style("ticks")
plt.rcParams["figure.figsize"] = (16.0, 6.0)
plt.rcParams["axes.grid"] = True

#### Load the dataset
The data set is of roughly 5000 different white wine samples ([Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality)). The different columns are chemical and physical characteristics. In addition a quality score is given.

In [None]:
WINE_COLOR = "red"
df = pd.read_csv(f"../.assets/data/winequality/{WINE_COLOR}.csv.zip", sep=";")

In [None]:
df.head()

_Note:_  [Documentation of the data set and additional information](https://files.point-8.de/trainings/data-science-101/wine-quality/INFO.md)

## Types of Plots

### Bar Chart

As a first example we will check how the wine quality is distributed. To do so we use a **bar chart**, because we have discrete values to denote the wine quality. Here, we use the plotting functionality provided by the `pandas` library, which basically uses `matplotlib` under the hood.

In [None]:
df["quality"].value_counts().sort_index().plot(kind="bar")

### Box plot

A nice option to visualize the statistical distribution of a data set are [**box plots**](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html). The _box_ itself includes by definition the central 50% of the data. Thereby, the blue box includes all data points from the 0.25-quantile (Q1) to the 0.75-quantile (Q3). Its length is called _interquantile range_ (IQR). 

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/704px-Boxplot_vs_PDF.svg.png)
[_Source_: Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg)


In our example the **median** (Q2, 0.5-quantile) is given in green and the **mean** as the green dotted line, respectively. A whisker always ends on the last data point in the range of a whisker. The set length of a whisker has not a global definition. Often it is set to 1.5 x IQR. All data outside of the whisker edges could be classified as outliers (marked by `x` in the plots below).

In [None]:
df["pH"].plot(
    kind="box", 
    showmeans=True, 
    meanline=True, 
    flierprops={"alpha": 0.5, "marker": "x"}
)

In [None]:
df["total sulfur dioxide"].plot(
    kind="box", 
    showmeans=True, 
    meanline=True, 
    flierprops={"alpha": 0.5, "marker": "x"}
)

Note that until now we have only called methods of `pandas` - they provide a high-level interface to the most commonly used plots. In many cases, we can get the visualizations we want by passing the right parameters to these high level methods. 

However, if we want more customized plots, we might have to go down one level and call `matplotlib` directly. This is what we see in the example below. Showing the distribution of all variables visualised as box plots and combined in a single plot.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=df.columns.size)
for idx, c in enumerate(df.columns):
    axes[idx].boxplot(
        df[c],
        showmeans=True,
        meanline=True,
        whis=1.5, #1.5 times IQR
        tick_labels=[c],
        flierprops={"alpha": 0.5, "marker": "x"},
    )
plt.tight_layout()

### Histogram

Another possibility to visualize the distribution of data is the [**histogram**](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html). In this case we have a continuous range of pH values and want to visualize how the wines are distributed over this range. Each data point is added to its associated value range (**bin**). The height of each bar corresponds to the number of entries per bin.

In [None]:
df["pH"].hist(bins=30)
plt.title("pH values")

We can customize the binning scheme by using a `numpy` array. For example, we divide the range from 2.5 to 4 into 16 equally spaced bins. When plotting different histrogram in the same plot, the binning can differ. Setting the binning like this is be recommendable.

In [None]:
df["pH"].hist(bins=np.linspace(2.5, 4, 16))
plt.title("pH values")

### Violin plot

Another interesting option for showing distributions is the [**violin plot**](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.violinplot.html). It can be thought of as a combination of box plot and histogram. Here, the `seaborn` library is used.

In [None]:
seaborn.violinplot(
    x=df["alcohol"]).set(title='distribution of alcohol content');

### Scatter plot

To put the distributions of two variables in relation we can use [**scatter plots**](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html). Here, we compare how the residual sugar concentration behaves in comparison to the volumentric alcohol concentration.

In [None]:
df.plot(
    kind="scatter",
    x="alcohol",
    y="residual sugar",
    title="Alcohol Content vs Residual Sugar",
)

## Fine tuning with matplotlib
So far, we worked on the high-level API direclty with `pandas` to plot the data. Under the hood `pandas` applies `matplotlib` as mentioned before. But often you want to have more elaborated plot. Instead of creating this plot by hand starting from scratch with `matplotlib` one can combine the simple usability of plotting with `pandas` and a finetuning with `matplotlib`. To do that you should use your plot in a more [objective oriented way](https://matplotlib.org/3.2.1/tutorials/introductory/lifecycle.html) and adjust the [matplotlib.axes-object](https://matplotlib.org/api/axes_api.html) afterwards.

In [None]:
# Define your axes-object
fig, axes = plt.subplots(dpi=100)

# Plot with the known pandas interface and set the `axes`-object
df["pH"].hist(bins=30, ax=axes)

# Use matplotlib commands to finetune your plot
## examples
axes.set_title("Histogram pH values")
axes.set_xlabel("pH value")
axes.set_ylabel("Anzahl")
axes.axhline(100, color="red")
axes.set_xlim(2.6, 4.2)

## Save your plot
Instead of only showing your plot directly in the notebook you can save the plot to disk. With [savefig](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html) one can export the plot as different file formats (e.g. jpg, png, pdf), set dpi and quality, add edge and background color and more.

In [None]:
## Save to disk
df["pH"].hist(bins=30)
plt.savefig("example_plot.png", dpi=300, facecolor="lightgrey", edgecolor="black")

## 2D-histogram as a _Heatmap_

Alternatively, we can visualize the point density using a [**2-dimensional histogram**](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist2d.html) (also called **heatmap**). The value range is again divided into discrete bins. The number of entries per bin is visualized using a [color map](https://matplotlib.org/users/colormaps.html) (_cmap_). The name heatmap indicates, that (depending on the color scheme: e.g. blue to red) areas with more data are visualised as "hotter". 

In [None]:
plt.figure(figsize=(12, 9))
cmap = "viridis" #coolwarm; 
plt.hist2d(x=df["alcohol"], y=df["residual sugar"], bins=30, cmap=cmap)
plt.xlabel("Alcohol [%]")
plt.ylabel("Residual sugar [g/l]")
cb = plt.colorbar()
cb.set_label("Number of entries per bin")

When looking at circular points clouds using hexagonal shaped bins in contrast to rectangular shaped ones, can help to get a better impression of the data. In this example you can see that there is not exactly one "hotspot" but that the "hotspot" is spread over a number of bins.

In [None]:
plt.figure(figsize=(12, 9))
plt.hexbin(x=df["alcohol"], y=df["residual sugar"], gridsize=30, cmap="viridis")
plt.xlabel("Alcohol [%]")
plt.ylabel("Residual sugar [g/l]")
cb = plt.colorbar()
cb.set_label("Number of entries per bin")

# Interactive Plots

In some data analysis applications, the ability to explore the plots interactively is helpful - think about interactive dashboards. `plotly` is a library that is similar in its functionality to `matplotlib`, only that it outputs interactive plots. 

You can have a look at some [plotly examples](https://plotly.com/python/). Furthermore, you can set up your own dashboards with [plotly-dash](https://dash-gallery.plotly.host/Portal/). As already mentioned in the introduction of this chapter, there is also [bokeh](https://docs.bokeh.org/en/latest/docs/gallery.html) for interactive plots. But let's start with some examples with `plotly`. With [plotly-express](https://plotly.com/python/plotly-express/) there is a easy-to-use, high-level interface.

In [None]:
import plotly.express as px

### Histogram

This example is analogous to our histrogram plot from above. With `plotly` however, we get it interactivity. Check out different [histogram styles](https://plotly.com/python/histograms/).

In [None]:
fig = px.histogram(df, x="pH", nbins=30)
fig.show()

### Box Plot
Same as above to get started. More [boxplot styles](https://plotly.com/python/box-plots/).

In [None]:
fig = px.box(df, x=["pH"])
fig.show()

## Scatter Plot
More [scatter styles](https://plotly.com/python/line-and-scatter/).

In [None]:
fig = px.scatter(df, x=df["alcohol"], y=df["residual sugar"])
fig.show()

## Export notebook as HTML
One cool thing about `plotly` is that it is based on web technologies. When you export the notebook as HTML and open the exported HTML in a standard browser, you will find the plot not integrated as a simple image but as an interactive plot with active JavaScript. But keep in mind, that the data now has to be somehow saved within the file and will blow up your file size!

# Matplotlib-style parameters

Configuration parameters, which are changed from the default values, in this notebook at the beginning:

``` 
seaborn.set_style("ticks")
plt.rcParams["figure.figsize"] = (16.0, 6.0)
plt.rcParams["axes.grid"] = True
```

If you want to have a more detailed look at the configuration of the matplotlib-style parameters the [Matplotlib Sytle Configurator](https://matplotlib-style-configurator.herokuapp.com) can be a good starting point. You can change the parameters, get a preview of what your plots will look like, and export the style sheet, which you can include in your code, thus your plots will be generated with this specific style.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_