# Data Visualization
## Advanced Python for Life Sciences @ Physalia courses (Summer 2025)
### Marco Chierici, Fondazione Bruno Kessler - Data Science For Health

# Objectives

- Be able to create simple plots with Matplotlib and tweak them
- Know about object-oriented vs pyplot interfaces of Matplotlib
- Be able to adapt gallery examples
- Know that other tools exist

## Repeatability/reproducibility

From [Claus O. Wilke: "Fundamentals of Data Visualization"](https://clauswilke.com/dataviz/):

> *One thing I have learned over the years is that automation is your friend. I
> think figures should be autogenerated as part of the data analysis pipeline
> (which should also be automated), and they should come out of the pipeline
> ready to be sent to the printer, no manual post-processing needed.*

- **No manual post-processing**. This will bite you when you need to regenerate 50
  figures one day before submission deadline or regenerate a set of figures
  after the person who created them left the group.
- There is not the one perfect language and **not the one perfect library** for everything.
- Within Python, many libraries exist:
  - [Matplotlib](https://matplotlib.org/gallery.html):
    probably the most standard and most widely used
  - [Seaborn](https://seaborn.pydata.org/examples/index.html):
    high-level interface to Matplotlib, statistical functions built in
  - [Altair](https://altair-viz.github.io/gallery/index.html):
    declarative visualization (R users will be more at home), statistics built in
  - [Plotly](https://plotly.com/python/):
    interactive graphs
  - [Bokeh](https://demo.bokeh.org/):
    also good for interactivity
  - [plotnine](https://plotnine.readthedocs.io/):
    implementation of a grammar of graphics in Python, it is based on [ggplot2](https://ggplot2.tidyverse.org/)
  - [ggplot](https://yhat.github.io/ggpy/):
    R users will be more at home
  - ...
- Two main families of libraries: procedural (e.g. Matplotlib) and declarative
  (using grammar of graphics).


## Why Matplotlib?

- Matplotlib is perhaps the most "standard" Python plotting library.
- Many libraries build on top of Matplotlib.
- MATLAB users will feel familiar.
- Even if you choose to use another library (see above list), chances are high
  that you need to adapt a Matplotlib plot of somebody else.
- Libraries that are built on top of Matplotlib may need knowledge of Matplotlib
  for custom adjustments.

However it is a relatively low-level interface for
drawing (in terms of abstractions, not in terms of quality) and does not
provide statistical functions. Some figures require typing and tweaking many lines of code.

Many other visualization libraries exist with their own strengths, it is also a
matter of personal preferences. **Later we will also try other libraries.**

# Getting started with Matplotlib

Let us create our first plot:

In [None]:
# canonical import for matplotlib
import matplotlib.pyplot as plt

# this line tells Jupyter to display matplotlib figures in the notebook
%matplotlib inline

# this is Anscombe's quartet dataset 1 from
# https://en.wikipedia.org/wiki/Anscombe%27s_quartet
data_x = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
data_y = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

# create a blank figure (canvas)
fig, ax = plt.subplots()

# add a plot
ax.scatter(x=data_x, y=data_y, c="#E69F00")

# customize aesthetics
ax.set_xlabel("we should label the x axis")
ax.set_ylabel("we should label the y axis")
ax.set_title("some title")

plt.show()
# uncomment the next line if you would like to save the figure to disk
# fig.savefig("my-first-plot.png")

When running a Matplotlib script on a remote server without a
"display" (e.g. compute cluster), you may need to add this line:

```python
import matplotlib.pyplot as plt
plt.use("Agg")

# ... rest of the script
```

To add semi-automatic legends, we first need to add a label to the data used in the scatterplot. Then, we call `ax.legend()`:

In [None]:
fig, ax = plt.subplots()

ax.scatter(x=data_x, y=data_y, c="#E69F00", label="set 1")

ax.set_xlabel("we should label the x axis")
ax.set_ylabel("we should label the y axis")
ax.set_title("some title")
ax.legend()

plt.show()

Not really informative or necessary with just one variable, but it will become handy with more data points.

## Exercise

Extend the previous plot by also plotting this set of values but this time
  using a different color (`#56B4E9`):
  
  ```python
  data2_y = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
  ```

Then add another color (`#009E73`) which plots the second dataset, scaled
  by 2.0.
  ```python
  # here we multiply all elements of data2_y by 2.0
  data2_y_scaled = [y*2.0 for y in data2_y]
  ```

Try to add a legend to the plot with `ax.legend()`

Optionally save the plot to a file.

At the end it should look like [this one](https://drive.google.com/uc?export=view&id=1BRUlUkRArpVrFZwlnISccI4OuvTvsiVG)

In [None]:
# this is dataset 2
data2_y = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
# here we multiply all elements of data2_y by 2.0
data2_y_scaled = [y * 2.0 for y in data2_y]

fig, ax = plt.subplots()

ax.scatter(x=data_x, y=data_y, c="#E69F00", label="set 1")
ax.scatter(x=data_x, y=data2_y, c="#56B4E9", label="set 2")
ax.scatter(x=data_x, y=data2_y_scaled, c="#009E73", label="set 2 (scaled)")

ax.set_xlabel("we should label the x axis")
ax.set_ylabel("we should label the y axis")
ax.set_title("some title")
ax.legend()

plt.show()

---

### Why these colors?

This qualitative color palette is opimized for all color-vision
deficiencies, see
- <https://clauswilke.com/dataviz/color-pitfalls.html> and
- [Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD):
How to Make Figures and Presentations That Are Friendly to Colorblind People."](http://jfly.iam.u-tokyo.ac.jp/color/).

---

## Matplotlib has two different interfaces

When plotting with Matplotlib, it is useful to know and understand that there are **two approaches** even though the reasons of this dual approach is outside the scope of this lesson.

The more modern option is an **object-oriented interface** (OO) - the one we just used. The `fig` and `ax` objects can be configured separately and passed around to functions:

In [None]:
fig, ax = plt.subplots()

ax.scatter(x=data_x, y=data_y, c="#E69F00")

ax.set_xlabel("we should label the x axis")
ax.set_ylabel("we should label the y axis")
ax.set_title("some title")

plt.show()

The more traditional option mimics MATLAB plotting and uses the **`pyplot` interface** - also known as MATLAB-like interface, where `plt` carries the global settings:

In [None]:
fig = plt.figure(figsize=(8, 4), dpi=100)

plt.scatter(x=data_x, y=data_y, c="#E69F00")

plt.xlabel("we should label the x axis")
plt.ylabel("we should label the y axis")
plt.title("some title")

plt.show()

When searching for help on the internet, you will find both approaches.

Although the `pyplot` interface looks more compact, **I recommend to learn and use the object-oriented interface.**

Regardless of the interface you choose, **stick to it** in your code and avoid mixing interfaces.

### Why do I emphasize this?

One day you may want to write functions which wrap around Matplotlib function calls and then you can send `fig` and `ax` into these functions: there is less risk that adjusting figures changes global settings also for unrelated figures created in other functions.

When using the `pyplot` interface, settings are modified for the entire `plt` package. 

The latter is acceptable for linear scripts but may yield surprising results when introducing functions to enhance/abstract Matplotlib calls.

### Key principles

- Avoid manual post-processing, script everything.
- Browse a number of example galleries to help you choose the library
  that fits best your work/style.
- Figures for presentation slides and figures for manuscripts have
  different requirements.
- Think about color-vision deficiencies when choosing colors. Use
  existing solutions for this problem.

---

# Further steps with Matplotlib

## Reading data

We will experiment with some example weather data obtained from [Norsk KlimaServiceSenter](https://seklima.met.no/observations/), Meteorologisk institutt (MET) (CC BY 4.0).

In this example, we will load a CSV file directly from the web instead of using a local file name.

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/coderefinery/data-visualization-python/main/data/tromso-daily.csv"
df_tromso = pd.read_csv(url)

df_tromso

## Combine dataframes

Using pandas we can **merge, join, concatenate, and compare** dataframes, see our dedicated lecture and also <https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html>.

Let us try to concatenate two dataframes: one for Tromsø weather data (which we already loaded into `df_tromso`) and one for Oslo, which we load right now:

In [None]:
url = "https://raw.githubusercontent.com/coderefinery/data-visualization-python/main/data/oslo-daily.csv"
df_oslo = pd.read_csv(url)
df_oslo

Now we can concatenate them:

In [None]:
df = pd.concat([df_tromso, df_oslo], axis=0, ignore_index=True)
df

`axis=0` above means to combine dataframes along the 1st axis, i.e., the rows.

**Mini-exercise** Once the dataframes are combined, we can do interesting queries: for example, compute the summary statistics of all variables, breaking down by the "name" column. (hint: `describe`, `groupby`)

In [None]:
# solution
df.groupby("name").describe().transpose()

## Data cleaning

Before plotting the data, we realize that the `date` data type is not inferred as date or time by pandas, but rather as "object" (in pandas lingo, that's a string).

In [None]:
df["date"]

So we convert the dates into proper "datetime" format:

In [None]:
# replace dd.mm.yyyy to date format
df["date"] = pd.to_datetime(df["date"], format="%d.%m.%Y")
df["date"]

## Exercise: Plotting data with Matplotlib

Consider now the Tromsø dataframe only.

1. Convert its 'date' column to proper date and time format. Then, use Matplotlib to create a simple line plot to visualize the maximum temperature as a function of time.

2. Add meaningful x & y axis labels and a title.

3. (Optional) Save the output figure to a file named "temperatures-tromso.png".

Hints:
* besides `scatter`, maplotlib has a more general `plot` function with similar arguments
* use the object-oriented interface (e.g., the one with `ax`), or the MATLAB-like interface (e.g., the one with `plt`), or both!
* see the beginning of the notebook for how to save figures to files

In [None]:
# object-oriented interface
fig, ax = plt.subplots()

# make sure the date axis is in the right format
# we fixed it above for df but not for df_tromso !
df_tromso["date"] = pd.to_datetime(df_tromso["date"], format="%d.%m.%Y")

ax.plot(df_tromso["date"], df_tromso["max temperature"])

ax.set_xlabel("Date")
ax.set_ylabel("Max temperature")
ax.set_title("Temperatures in Tromsø")

fig.savefig("temperatures-tromso_oob.png")

# this avoid matplotlib to echo the last command, i.e. Text(0.5, 1.0, 'some title')
plt.show()

In [None]:
# MATLAB-like interface
fig = plt.figure()

plt.plot(df_tromso["date"], df_tromso["max temperature"])

plt.xlabel("Date")
plt.ylabel("Max temperature")
plt.title("Temperatures in Tromsø")

# uncomment the next line if you would like to save the figure to disk
fig.savefig("temperatures-tromso.png")

# this avoid matplotlib to echo the last command, i.e. Text(0.5, 1.0, 'some title')
plt.show()

---

# Seaborn quickstart

[Seaborn](https://seaborn.pydata.org/examples/index.html) is a visualization library built **on top of Matplotlib**, offering **more modern plot styles** and color defaults, as well as better-looking color palettes. In fact, the look-and-feel of Matplotlib plots is a little old-fashioned in the context of modern data visualization.

Another advantage of Seaborn is its **native handling of Pandas** DataFrames, whereas Matplotlib is not designed for use with them.

Moreover, Seaborn's API is more high-level than Matplotlib's, thus allowing you to create **even complex visualizations with less boilerplate code** than Matplotlib.

Let's start right away and compare vanilla Matplotlib with Seaborn

In [None]:
# data creation
import numpy as np

np.random.seed(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(np.random.randn(500, 6), 0)

# plot with Matplotlib defaults
plt.plot(x, y)
plt.title("Matplotlib lineplot")
plt.legend("ABCDEF", ncol=2, loc="upper left")
plt.show()

Let's recreate the same plot in Seaborn. There is only one little extra step to put the input data into a pandas DataFrame - since Seaborn is designed to work with DataFrames! Moreover, the DataFrame should be **tidy**, so in long format.

In [None]:
x.shape

In [None]:
x[:10]

In [None]:
y.shape

In [None]:
data = pd.DataFrame(y, index=x, columns=["A", "B", "C", "D", "E", "F"])
data

Pandas `.melt()` takes a dataframe where one or more columns are identifier variables (in our case, the index) and all other columns are considered measured variables (in our case, A:F). These latter columns are unpivoted to the row axis, thus leaving just two non-identifier columns, "variable" and "value".

In [None]:
data = data.reset_index().melt(id_vars="index")
data

In [None]:
# canonical import
import seaborn as sns

plot = sns.lineplot(x="index", y="value", hue="variable", data=data)
plot.set_title("Seaborn lineplot")
plt.show()

### Exercise

Recreate the plot "Temperatures in Tromsø" using Seaborn instead of Matplotlib.

In [None]:
plot = sns.lineplot(x="date", y="max temperature", data=df_tromso)

plot.set_title("Temperatures in Tromso")
plot.set_xlabel("Date")
plot.set_ylabel("Max temp")

# uncomment next line to save the figure
# plot.figure.savefig("temperatures.png")

---

Let's experiment now with *plotting multiple data* on the same plot.

As we saw earlier, the `hue='name'` parameter of Seaborn plotting functions automatically breaks down the plots by the categories in the `name` variable of the dataframe.

In [None]:
plot = sns.lineplot(x="date", y="max temperature", hue="name", data=df)

**We got both timeseries nicely drawn in the same plot.**

Let's experiment a little more. What if we wanted the two timeseries in separate panels?

We use the versatile function `sns.relplot()`:

In [None]:
plot = sns.relplot(
    x="date",
    y="max temperature",
    hue="name",
    col="name",  # try commenting this out or changing col to row
    kind="line",
    data=df,
)

# uncomment next line to save the figure
# plot.savefig("temperatures-side-by-side.png")

Note how by changing two lines we can change the data plotted (`x` or `y`) and the plot type (`kind`).

**Your turn!** 

Now plot the snow depth vs. time as a scatter plot instead of a line plot. Optionally save the figure to the file "snow-depth.png".

In [None]:
# solution
plot = sns.relplot(
    x="date",
    y="snow depth",  # we now plot the snow depth instead of the max temperature
    hue="name",
    col="name",
    kind="scatter",  # we changed "line" to "scatter"
    data=df,
)

# uncomment next line to save the figure
# plot.savefig("snow-depth.png")

---

# More on Matplotlib

## Anatomy of a Figure

![](https://matplotlib.org/stable/_images/anatomy.png)

##  Axes

A figure can have one or more subplots inside it called Axes, arranged in rows and columns. Every figure has at least one Axes. Don't confuse Axes with X and Y axis: they are different!

The Axes objects, such as `ax1` and `ax2` above, are what you think of as 'a plot'. It is the region of the image with the data space. A given figure can contain many Axes, but a given Axes object can only be in one Figure. The **Axes** contains two (or three in the case of 3D) **Axis** objects (be aware of the difference between Axes and Axis!!) which take care of the data limits (the data limits can also be controlled via `set_xlim()` and `set_ylim()` Axes methods). 

Each Axes has:

1. a title (set via `set_title()`);
1. a x-label (set via `set_xlabel()`);
1. a y-label (set via `set_ylabel()`).

The Axes class and its member functions are the primary entry point to working with the object-oriented programming (OOP) interface.

## Axis

These are the number-line-like objects. They take care of setting the graph limits and generating the ticks (the marks on the axis) and ticklabels (strings labeling the ticks). The location of the ticks is determined by a Locator object and the ticklabel strings are formatted by a Formatter. The combination of the correct Locator and Formatter gives very fine control over the tick locations and labels.


## How to draw two scatterplots in different panels

Suppose I want to draw our two sets of points in two separate plots side-by-side instead of the same plot. 

You can do that by creating two separate subplots, or Axes, using `plt.subplots(1, 2)`: this means to creates 1 row with 2 subplots. The command returns two objects:

1. the figure
1. the axes (subplots) inside the figure

Previously, we used `plt.plot()`. Since there was only one axes by default, matplotlib drew the points on that axes itself.

But now, since you want the points drawn on different subplots (axes), you have to call the plot function **on the respective axes**.

In [None]:
# create some data
x = np.linspace(0, 5, 10)
y = x**2

# Create Figure and Subplots, capturing them in separate variables
fig, axes = plt.subplots(1, 2)

# Grab the individual Axis from the Axes object
ax1 = axes[0]
ax2 = axes[1]

# Plot
ax1.plot(x, y, "o")
ax2.plot(x, y, "o")

plt.show()

The above code seems quite repetitive and can be further optimized:

In [None]:
fig, axes = plt.subplots(1, 2)

for ax in axes:
    ax.plot(x, y, "o")

plt.show()

Let's add titles and x-y labels:

In [None]:
fig, axes = plt.subplots(1, 2)

for ax in axes:
    ax.plot(x, y, "o")
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.set_title("Scatterplot")

plt.show()

Quite good, but figure axes and labels overlap: we fix this with the `fig.tight_layout` method, which automatically adjusts the positions of the axes on the figure canvas so that there is no overlapping content:

In [None]:
fig, axes = plt.subplots(1, 2)

for ax in axes:
    ax.plot(x, y, "o")
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.set_title("Scatterplot")

fig.tight_layout()
plt.show()

Better, but we notice that the y-axis labels of the right-hand panel are redundant: in the following example, we create two Axes sharing the y axis.

In [None]:
fig, axes = plt.subplots(1, 2, sharey=True)

for ax in axes:
    ax.plot(x, y, "o")
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.set_title("Scatterplot")

fig.tight_layout()
plt.show()

Thats sounds like a lot of functions to learn! It's actually quite easy to remember them.

The `ax1` and `ax2` objects, like `plt`, have equivalent `set_title`, `set_xlabel` and `set_ylabel` functions. In fact, `plt.title()` actually calls the current axes `set_title()` to do the job.

* `plt.xlabel()` → `ax.set_xlabel()`
* `plt.ylabel()` → `ax.set_ylabel()`
* `plt.xlim()` → `ax.set_xlim()`
* `plt.ylim()` → `ax.set_ylim()`
* `plt.title()` → `ax.set_title()`

Alternately, to save keystrokes, you can set multiple things in one go using the `ax.set()`:

In [None]:
fig, axes = plt.subplots(1, 2, sharey=True)

for ax in axes:
    ax.plot(x, y, "o")
    ax.set(title="Scatterplot", xlabel="x", ylabel="y", xlim=(0, 6), ylim=(0, 30))

fig.tight_layout()
plt.show()

---

# Plotting with Pandas

We can plot pandas dataframes directly! This implicitly uses `matplotlib.pyplot`.

Plotting in pandas is as simple as using a `.plot()` method on a series or dataframe.

To demonstrate pandas plotting, let's reload the subset of the breast cancer dataset that we prepared earlier this week:

In [None]:
from pathlib import Path

DATADIR = Path("data")

bc_data = pd.read_csv(DATADIR / "bc_new_table.txt", sep="\t", index_col=0)
bc_data.head()

`pandas` incorporates part of the `matplotlib` library to create quick data visualizations. 

We use the `.plot(x, y, kind, data)` method directly on a dataframe `data` and specify the column names we want to plot (`x` and `y` arguments) and the type of plot (`kind`).

Some of the supported `kind`s of plots are:
- 'scatter' for scatterplots,
- 'hist' for histograms,
- 'bar' for vertical barplots,
- 'barh' for horizontal barplots,
- 'box' for boxplots,
- 'kde' for kernel density estimate plots (or simply density plots)

Some kind of plots need both `x` and `y`, others just `y`.

In [None]:
# to build a scatter plot of 'texture' vs 'radius'
bc_data.plot(kind="scatter", x="radius", y="texture");

In [None]:
# exploring the parameters we can customize the plot
bc_data.plot(
    kind="scatter",
    x="radius",
    y="texture",
    figsize=(6, 5),  # custom figure size
    alpha=0.6,  # opacity
    color="blue",  # fill color
    edgecolor="w",
    s=80,  # point size
)

What about customizing titles and x/y axis labels? How to save a pandas plot to file?

This has to be done using Matplotlib syntax, as below.

In [None]:
# we need to assign the pandas plot to a variable
plot = bc_data.plot(
    kind="scatter",
    x="radius",
    y="texture",
    figsize=(6, 5),  # custom figure size
    alpha=0.6,  # opacity
    color="blue",  # fill color
    edgecolor="w",
    s=80,  # point size
)

plt.title("Custom Scatterplot")
plt.xlabel("Radius")
plt.ylabel("Texture")

# get the current figure from the plot
fig = plot.get_figure()
# save the current figure to file
fig.savefig("test.png")
# plt.show()

### Exercise

Experiment on pandas' `plot` capabilities. 

1. Following the rationale above, first create a histogram of the `radius` variable. Make the bars light blue with black borders.

In [None]:
bc_data.plot(kind="hist", y="radius", color="lightblue", edgecolor="black", width=1.5);

2. Create an alternate version of the above by selecting the `radius` column from the Pandas dataframe just before calling the `.plot` method. Modify then the `.plot` arguments accordingly. Use the default color for the bars this time, with black borders.

In [None]:
bc_data["radius"].plot(kind="hist", edgecolor="black", width=1.5);

3. Try plotting multiple histograms on the same plot! One should be light blue and the other orange. 

Hint: (version 1) you can pass more than one column name to the `y` and `color` arguments of the `plot` method. Or (version 2) you can select more than one column from the dataframe and then call `plot`.

Add some transparency (`alpha`) to help discriminating both histograms.

In [None]:
# version 1
bc_data.plot(
    kind="hist",
    y=["radius", "texture"],
    bins=15,
    alpha=0.8,
    width=1.8,
    color=["lightblue", "orange"],
    edgecolor="black",
);

In [None]:
# version 2
bc_data[["radius", "texture"]].plot(
    kind="hist",
    bins=15,
    alpha=0.8,
    width=1.8,
    color=["lightblue", "orange"],
    edgecolor="black",
);

4. (optional) It is also possible to put the `.plot()` right after some other functions applied to the dataframe: e.g. `df.some_func().plot()`. For example, create a bar plot to visualize the number of benign and malignant diagnoses in the dataset.

Hint: `value_counts()` or `value_counts(normalize=True)`

Make the bars light blue with blue edges.

In [None]:
bc_data["diagnosis"].value_counts(normalize=True).plot(
    kind="bar", color="lightblue", edgecolor="blue", fontsize=12, width=0.5, rot=0
)

---

# Seaborn plot gallery

With matplotlib, you may have much more control over colors, axes, sizes and shapes; on the other hand, it is much more user friendly to use Seaborn or Pandas plotting.

## Heatmaps

In [None]:
# correlation matrix
corr_table = bc_data.corr(numeric_only=True)

# plot it as a heatmap with searborn
sns.heatmap(corr_table);

In [None]:
# change a few parameters
sns.heatmap(corr_table, cmap="Blues", linewidth=0.1);

The argument `cmaps` accepts a [matplotlib colormap name](https://matplotlib.org/stable/users/explain/colors/colormaps.html), or list of colors.

## Pair plots

Pair plots, also called "scatterplot matrices", are useful to explore correlations in multidimensional data.

In [None]:
sns.pairplot(bc_data);

**Your turn!**

Remake the scatterplot matrix above by coloring the points according to the diagnosis status.

In [None]:
sns.pairplot(bc_data, hue="diagnosis");

## Boxplots

`seaborn` can be extremely helpful as a first sneak peek at the correlation between multiple variables. In addition, seaborn also provides an efficient way to plot box plots directly from our dataframes using `sns.boxplot` function.

In [None]:
# create a simple boxplot with seaborn:
# by default they already look good

sns.boxplot(
    y="area",
    x="diagnosis",
    hue="diagnosis",
    palette=[
        "tomato",
        "lightgreen",
    ],  # colors to use for the levels of the ``hue`` variable
    data=bc_data,
    width=0.5,
);

In [None]:
# we can also add a stripplot to make it more informative
sns.boxplot(
    y="area",
    x="diagnosis",
    hue="diagnosis",
    palette=["tomato", "lightgreen"],
    data=bc_data,
    width=0.5,
)

sns.stripplot(
    y="area",
    x="diagnosis",
    hue="diagnosis",
    palette=["tomato", "lightgreen"],
    alpha=0.6,
    data=bc_data,
)

plt.show()

Notice how Seaborn "adds" plots to the same figure above.

In [None]:
# tweak the plotting parameters to make it more appealing
sns.boxplot(
    y="area",
    x="diagnosis",
    hue="diagnosis",
    palette=["tomato", "lightgreen"],
    data=bc_data,
    width=0.5,
)

sns.stripplot(
    y="area",
    x="diagnosis",
    hue="diagnosis",
    palette="dark:black",
    alpha=0.5,
    size=4,
    edgecolor="w",
    linewidth=0.5,
    data=bc_data,
);

## Histograms and density plots

Seaborn produces histograms with `sns.histplot` and density plots `sns.kdeplot`.

### Exercise

Using the same rationale above, create a histogram of the `radius` variable with Seaborn.

In [None]:
sns.histplot(data=bc_data, x="radius")

Now, add to the same plot the histogram of the variable `texture`. Hint: see the boxplot example above.

In [None]:
sns.histplot(data=bc_data, x="radius")
sns.histplot(data=bc_data, x="texture")

See what happens if you add the optional argument `kde=True`.

In [None]:
sns.histplot(data=bc_data, x="radius", kde=True)
sns.histplot(data=bc_data, x="texture", kde=True)

Create now a density plot of the variable `area`. Try with and without the optional argument `fill=True`.

In [None]:
sns.kdeplot(data=bc_data, x="area", fill=True)

What happens if you create a density plot using two variables? e.g. `area` and `smoothness`. Again, try with and without the optional argument `fill=True`.

In [None]:
# a bivariate plot
sns.kdeplot(data=bc_data, x="area", y="smoothness")

## Plots with marginal distributions

These are obtained with `sns.jointplot`.

Assigning a `hue` variable will add conditional colors to the plots and an automatic legend.

In [None]:
sns.jointplot(x="concavity", y="texture", data=bc_data);

In [None]:
sns.jointplot(x="concavity", y="texture", hue="diagnosis",
              data=bc_data);

In [None]:
sns.jointplot(x="concavity", y="texture",
              kind="kde", # hist, hex, reg (regression)
              fill=True,
              data=bc_data);

## Faceted plots

To demonstrate faceted plots, we load the dataset "tips" from Seaborn:

In [None]:
tips = sns.load_dataset("tips")
tips.head()

Before proceeding, we compute the tip percentage and add it as a new `tip_pct` column. Then, we plot it as a histogram using Seaborn, with light gray bars and a black edge.

In [None]:
tips["tip_pct"] = 100 * tips["tip"] / tips["total_bill"]
sns.histplot(data=tips, x="tip_pct", color="lightgray", edgecolor="black")

Now we want to draw 4 histograms of the tip percentage broken down by both `time` (Lunch and Dinner) and `sex` (Male and Female): we would like to have `time` across the columns and `sex` across the rows.

We use seaborn's `FacetGrid()` to create the layout and then we use `map()` on the grid to apply a plotting function (`sns.histplot`) to each facet's subset of the data.

In [None]:
grid = sns.FacetGrid(tips, col="time", row="sex")
grid.map(sns.histplot, "tip_pct");

In [None]:
# margin_titles may not work in all situations
grid = sns.FacetGrid(tips, col="time", row="sex", margin_titles=True)
grid.map(sns.histplot, "tip_pct");

In [None]:
grid = sns.FacetGrid(tips, row="sex", col="time")
grid.map_dataframe(sns.histplot, x="tip_pct");

This time I used `map_dataframe`, which acts like `map` but is made for plotting with functions that accept a long-form DataFrame as a `data` keyword argument and access the
data in that DataFrame using string variable names (`x="tip_pct"`)

### Exercise (optional)

Following the above examples, try plotting:

1. two scatterplots in one row, showing `tip` vs. `total_bill` faceted by `sex` (on the columns), and colored by `smoker` (use `hue` in the call to `FacetGrid`)

In [None]:
grid = sns.FacetGrid(tips, col="sex", hue="smoker")
grid.map(sns.scatterplot, "total_bill", "tip", alpha=0.7)
grid.add_legend()
plt.show()

2. four barplots in one row, showing `total_bill` by `sex`, faceted by `day` (on the columns)

In [None]:
g = sns.FacetGrid(tips, col="day", height=4, aspect=.5)
g.map(sns.barplot, "sex", "total_bill", order=["Male", "Female"])
plt.show()

---

# Interactive plotting: plotly

Plotly is a powerful tool designed to create interactive, publication-quality graphs. 

From simple line charts to complex three-dimensional surfaces, plotly provides an extensive range of chart types and can be used in Jupyter Notebooks, as a standalone HTML, and integrated directly into web applications. 

What sets plotly apart is its ability to create **interactive plots** that the user can engage with; for example, by *zooming in* on areas of interest or *hovering* over data points to reveal more information.

One of plotly's greatest strengths is its compatibility with numerous data science libraries such as Pandas, NumPy, and SciPy, allowing for seamless integration with the broader Python data science ecosystem.

Plotly has various modules for different purposes:

1. `plotly.graph_objects` (imported as `go`) is part of Plotly’s core functionality and is used for creating a wide range of charts and graphs. It offers fine-grained control over the appearance of the plots, including colors, layouts, and annotations.
2. `plotly.express` (imported as `px`) is a higher level, function oriented API providing a simpler, more streamlined syntax for creating common types of charts and visualizations.

## Installation

```
conda install -c conda-forge plotly
```

or

```
pip install plotly
```

For use within Jupyter Lab, you'll also need anywidget

```
conda install -c conda-forge jupyterlab anywidget
```

```
pip install jupyterlab anywidget
```

## First steps with plotly.express

In [None]:
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt

In [None]:
# canonical import
import plotly.express as px

# x, y are list-like objects
x = np.arange(5)
y = x**2

fig = px.scatter(x=x, y=y)
fig.show()

In [None]:
# x, y are columns of a pandas dataframe
df = px.data.iris()  # iris is a pandas DataFrame
fig = px.scatter(df, x="sepal_width", y="sepal_length")
fig.show()

In [None]:
# tweaking color and size options creates a bubble chart
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 size="petal_length", hover_data=["petal_width"])
fig.show()

In [None]:
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", symbol="species")
fig.show()

In [None]:
# marginal distribution plot
fig = px.scatter(df, x="sepal_length", y="sepal_width", marginal_x="histogram", marginal_y="rug")
fig.show()

In [None]:
# a bar plot
categories = ["A", "B", "C", "D", "E"]
values = [15, 22, 18, 12, 28]

fig = px.bar(x=categories, y=values)

fig.show()

Styling plotly express plots:

In [None]:
df = px.data.tips()
fig = px.histogram(df, x="day", y="total_bill", color="sex")
fig.show()

In [None]:
fig = px.histogram(df, x="day", y="total_bill", color="sex",
            title="Receipts by Payer Gender and Day of Week",
            width=600, height=400,
            labels={  # replaces default labels by column name
                "sex": "Payer Gender",  "day": "Day of Week", "total_bill": "Receipts"
            },
            category_orders={  # replaces default order by column name
                "day": ["Thur", "Fri", "Sat", "Sun"], "sex": ["Male", "Female"]
            },
            template="simple_white"
            )
fig.show()

Available themes and templates:

In [None]:
import plotly.io as pio
pio.templates

Update or modify plots:

In [None]:
fig = px.histogram(df, x="day", y="total_bill", color="sex",
            title="Receipts by Payer Gender and Day of Week",
            width=600, height=400,
            labels={ # replaces default labels by column name
                "sex": "Payer Gender",  "day": "Day of Week", "total_bill": "Receipts"
            },
            category_orders={ # replaces default order by column name
                "day": ["Thur", "Fri", "Sat", "Sun"], "sex": ["Male", "Female"]
            },
            template="simple_white"
            )

fig.update_layout( # customize font and legend orientation & position
    font_family="Rockwell",
)

fig.add_shape( # add a horizontal line
    type="line", line_color="black", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=950, y1=950, yref="y"
)

fig.add_annotation( # add a text callout with arrow
    text="below target", x="Fri", y=400, arrowhead=1, showarrow=True
)

fig.show()

## Saving and embedding Plotly charts

In [None]:
fig.write_html("customized_histogram.html")

In [None]:
%%html
<iframe
  src="customized_histogram.html"
  width="800"
  height="600"
  title="chart name"
  style="border:none">
</iframe>

## Saving static Plotly charts

You can easily export a static version of your Plotly charts: for example, in PDF format.

In order to do this, you have to first install a required package in your environment: Kaleido, which is a library for generating static images without the need of external dependencies.

So type the following in a code cell:

```
!pip install -U kaleido
```

if you want to use `pip`, or

```
!conda install -y -c conda-forge python-kaleido
```

if you use `conda`.

You may need to *restart the kernel* afterwards (i.e., in Jupyter Lab's menu: Kernel > Restart Kernel and run up to selected cell...).

Then, you can go ahead and export any Plotly chart to a static format such as PDF or PNG.

In [None]:
fig.write_image("customized_histogram.pdf")  # the format is inferred from file extension

In [None]:
fig.write_image("customized_histogram.png")  # the format is inferred from file extension

## Exercise

Use plotly with the `bc_data` to recreate the marginal distribution plot of `texture` vs. `concavity`.

In [None]:
fig = px.scatter(
    bc_data,
    x="concavity",
    y="texture",
    color="diagnosis",  # optional, same as 'hue' in seaborn
    marginal_x="histogram",
    marginal_y="histogram",
)
fig.show()

---

# Constructing biologically relevant plots


## Manhattan plot

Manhattan plots are a variation of scatterplots used to display dense data. They typically represent the p-values of an entire genome-wide association study (GWAS) on a genomic scale: $-log10(P)$ on the y axis and chromosomes on the x axis. We'll use the package `qmplot`:

```
!pip install qmplot
```

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/plotly/dash-bio-docs-files/master/manhattan_data.csv")
df.head()

In [None]:
df.describe().transpose()

In [None]:
from qmplot import manhattanplot

ax = manhattanplot(
    data=df,
    chrom="CHR",
    pos="BP",
    pv="P",
    snp="SNP",
    xticklabel_kws={"rotation": "vertical"},
    # suggestiveline=None,
    # genomewideline=None,
    # CHR="7",  # single chromosome
    sign_marker_p=1e-8,  # threshold for SNP annotation
    is_annotate_topsnp=True,
)

plt.savefig("output_manhattan_plot.png")

## Volcano plot

Volcano plots are a special case of scatterplots. They are typically used to display the results of a differential gene expression analysis in terms of statistical significance ($-log(P)$) vs. magnitude of change (fold change, effect size).

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/plotly/dash-bio-docs-files/master/volcano_data1.csv")
df.head()

In [None]:
df.describe().transpose()

In [None]:
df["-log10P"] = -np.log10(df["P"])

In [None]:
# determine "top genes" based on sensible thresholds
condition = (df["EFFECTSIZE"] >= 1) & (df["-log10P"] > 7)
# add a column
df["signif"] = "ns"
df.loc[condition, "signif"] = "Up"

In [None]:
ax = sns.scatterplot(
    df,
    x="EFFECTSIZE",
    y="-log10P",
    hue="signif",
    palette={"Up": "red", "ns": "darkblue"},  # optional palette
    alpha=0.6,
    legend=None,
)

ax.axvline(x=1, color="grey", linestyle="--")
ax.axvline(x=-1, color="grey", linestyle="--")
ax.axhline(y=7, color="grey", linestyle="--")

df_top = df.loc[condition]
for _, row in df_top.iterrows():
    ax.text(row["EFFECTSIZE"], row['-log10P']+0.1, row["GENE"], fontsize=9)


In [None]:
# optional: automatic text repel with adjustText
# pip install adjustText
from adjustText import adjust_text

ax = sns.scatterplot(
    df,
    x="EFFECTSIZE",
    y="-log10P",
    hue="signif",
    palette={"Up": "red", "ns": "darkblue"},  # optional palette
    alpha=0.6,
    legend=None,
)

ax.axvline(x=1, color="grey", linestyle="--")
ax.axvline(x=-1, color="grey", linestyle="--")
ax.axhline(y=7, color="grey", linestyle="--")

df_top = df.loc[condition]
texts = [ax.text(row["EFFECTSIZE"], row['-log10P']+0.1, row["GENE"], fontsize=9) for _, row in df_top.iterrows()]

adjust_text(texts, arrowprops=dict(arrowstyle='-'));

## Clustergram: heatmap + dendrogram

Clustergrams are typically used for gene expression data. The hierarchical clustering represented by the dendrograms can be used to identify groups of genes with related expression levels represented by the heatmap.

In [None]:
df = pd.read_csv("data/hbr_uhr_top_deg_normalized_counts.csv", index_col=0)
df.head()

In [None]:
df.describe().transpose().head()

In [None]:
sns.clustermap(
    df,
    z_score=0,  # scale the rows by z-score (1 = scale the columns)
    cmap="viridis",  # custom color palette
    figsize=(8, 8),
    vmin=-1.5,  # min value on the colorbar
    vmax=1.5,  # max value on the colorbar
    cbar_kws=({"label": "z score"}),  # dict that specifies the title to the colorbar
    cbar_pos=(0.855, 0.8, 0.025, 0.15),  # optional position of the colorbar axes in the figure (left, bottom, width, height)
)

---

# Enrichment analysis

Suppose you have gene expression data for a cohort of patients. The data were collected before and after an intervention. You identified a bunch of differentially expressed genes (DEGs) between the two conditions.

There are a lot of web tools that you can consider as query databases, such as:

- [Database for Annotation, Visualization and Integrated Discovery (DAVID)](https://david.ncifcrf.gov/)
- [Ingenuity Pathway Analysis (IPA)](https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-ipa/)
- [Metacore](https://clarivate.com/cortellis/solutions/early-research-intelligence-solutions/)
- [PandaOmics](https://insilico.com/pandaomics)
- [Enrichr](https://maayanlab.cloud/Enrichr/)
- [GSEA](https://www.gsea-msigdb.org/gsea/index.jsp)
- [STRING](https://string-db.org/)
- [Panther](http://www.pantherdb.org/)
- [Reactome](https://reactome.org/)
- [KEGG](https://www.genome.jp/kegg/)

...to name a few!

Here we'll set up an enrichment pipeline based on two ingredients: Enrichr and [GSEApy](https://gseapy.readthedocs.io/en/latest/introduction.html).

Enrichr is a user-friendly and free to use database and allows querying of gene lists against a repertoire of gene-set libraries. GSEApy is a Python wrapper for Enrichr, allowing users to query the characteristics of your DEGs within Python.

To install GSEApy, run this command in the below code cell:

```
!pip install gseapy
```

In [None]:
import gseapy as gp
import matplotlib.pyplot as plt

Enrichr allows you to query across multiple databases. To understand the gene set libraries that are available for data query, you can type in the following command:

In [None]:
names = gp.get_library_name()
print(names)

In [None]:
len(names)

As a proof of concept, we will use the transcriptomics dataset published by [Zak et al., PNAS, 2012](https://www.pnas.org/content/109/50/E3503), examining how seropositive and seronegative subjects respond to the Ad5 vaccine across various time points.

The fold change, ratio, p-value and adjusted p-values (q-value) are calculated with respect to baseline (timepoint=0). We will filter for "up" and "down" DEGs based on fold-change > 1.5, q-value < 0.05 and fold-change < -1.5, q-value < 0.05 respectively at day 1 post-vaccination using the commands as follows:

In [None]:
df = pd.read_csv("data/Ad5_seroneg.csv", index_col=0)

DEGs_up = (df[(df["fc_1d"] > 1.5) & (df["qval_1d"] < 0.05)]).index.tolist()
DEGs_down = (df[(df["fc_1d"] < -1.5) & (df["qval_1d"] < 0.05)]).index.tolist()

Very popular databases are Gene Ontology (GO) and Reactome. GO Biological Processes (GOBP), GO Molecular Functions (GOMF) and GO Cellular Components (GOCC) usually provide users a rough idea of what the DEGs do, and their localisation within the cell. To query the upregulated DEGs against all of these 4 databases, the following commands can be executed:

In [None]:
enr_GOBP_up = gp.enrichr(
    gene_list=DEGs_up,
    gene_sets=["GO_Biological_Process_2025"],
    organism="Human",
    outdir="out/enr_DEGs_GOBP_up",
    cutoff=0.05,
)

enr_GOMF_up = gp.enrichr(
    gene_list=DEGs_up,
    gene_sets=["GO_Molecular_Function_2025"],
    organism="Human",
    outdir="out/enr_DEGs_GOMF_up",
    cutoff=0.05,
)

enr_GOCC_up = gp.enrichr(
    gene_list=DEGs_up,
    gene_sets=["GO_Cellular_Component_2025"],
    organism="Human",
    outdir="out/enr_DEGs_GOCC_up",
    cutoff=0.05,
)

enr_Reactome_up = gp.enrichr(
    gene_list=DEGs_up,
    gene_sets=["Reactome_2022"],
    organism="Human",
    outdir="out/enr_DEGs_Reactome_up",
    cutoff=0.05,
)

Let's break these commands down individually to understand what they do.

Firstly, `gp.enrichr` calls Enrichr to query against the gene list that is assigned to `DEGs_up`. We then specify the gene sets to query against (GOBP, GOMF, GOCC or Reactome), which is why the syntaxes are broken into 4 distinct function calls.
Since the data is from Humans, we assign the organism as Human.
Next we assign the output path with `outdir`. In this example, we specify a folder named `'out'` to store the output. Finally, we set a cutoff of $p=0.05$ for analysis.

In [None]:
enr_GOBP_up.results.head(5)

The above table provides important statistics that allows you to have a quick sensing of the pathways that are enriched.

A bar plot is among the most widely used methods to visualize enriched terms, depicting the enrichment scores (e.g. p-values) of the top terms.

In [None]:
from gseapy.plot import barplot, dotplot

barplot(enr_GOBP_up.res2d, title='GO BP seroneg day 1 (up)', color='r');

The `barplot()` function plots the top 10 enriched molecular pathways with adjusted p-value < 0.05.

To determine the enriched pathways in another database, for example Reactome:

In [None]:
barplot(enr_Reactome_up.res2d, title='Reactome seroneg day 1 (up)', color='r');

Dot plots are similar to bar plots with the capability to encode one additional score as dot size.

In [None]:
dotplot(enr_GOBP_up.res2d)

To save your figure, make sure that you pass a file name to `ofname`. The image format is automatically determined by the extension you use:

In [None]:
dotplot(enr_GOBP_up.res2d, ofname="dotplot_test.pdf")

In [None]:
dotplot(enr_GOBP_up.res2d, ofname="dotplot_test.png")

---

# Credits

Partially abridged from "Python for Scientific Computing" course (Aalto Scientific Computing, 2020), [Top 50 matplotlib Visualizations – The Master Plots](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/)), [Arthur Turrell](https://aeturrell.github.io/coding-for-economists/vis-common-plots.html), [Machine Learning Plus](https://www.machinelearningplus.com/plots/matplotlib-tutorial-complete-guide-python-plot-examples/), [matplotlib.org](https://matplotlib.org/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py).

---

# Further resources

- [The Python graph gallery](https://www.python-graph-gallery.com/): a great site with 100s of charts made with Python using mostly Matplotlib and Seaborn, organized in multiple sections with their associated reproducible code.
- [Matplotlib tutorials](https://matplotlib.org/stable/tutorials/index)
- [Seaborn gallery](https://seaborn.pydata.org/examples/index.html)