**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git
!{sys.executable} -m pip install umap_learn missingno

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
from class_utils import corr_heatmap, ColGrid, sorted_order, crosstab_plot, RainCloud
from umap import UMAP
import missingno as msno

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("titanic.zip"), directory="data/titanic")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Exploratory Data Analysis: Visualization

In this notebook we are going to look at some common ways to visualize data in Python. Visualization is one of the most powerful tools in **exploratory data analysis**  (EDA). We have already seen some kinds of visualization in the automatic reports generated in the previous notebook. Here we are going to look at other ways to visualize data and get further insights.

To create the plots, we will be using `seaborn` – a powerful visualization library designed to be used with `pandas` – and `matplotlib` – a popular visualization library that `seaborn` is based on.

We will again be using the Titanic dataset.



In [None]:
numeric_inputs = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
categorical_inputs = ["Sex", "Embarked"]
ignored = ["PassengerId", "Name", "Ticket", "Cabin"]
output = "Survived"

df = pd.read_csv("data/titanic/train.csv")
df.head()

### The Correlation Matrix

One of the most useful visualizations – and one that we generally want to do among the first – is the correlation matrix.

#### The Standard Correlation Matrix

One option is to use the standard kind of correlation matrix, which uses colours and annotations to visualize the correlations between all pairs of numeric variables. Values close to zero mean that there is little or no linear relationship between the two variables. Large positive and negative correlations are both of interest though: a large negative correlation means that the two variables are strongly related, but inversely proportional to each other.

The standard correlation matrix for our dataset might look as follows:



In [None]:
plt.figure(figsize=(10, 10))
corr_heatmap(df, map_type='standard', mask_diagonal=False);
plt.savefig("output/corr_unmasked.svg", bbox_inches="tight", pad_inches=0)

Note that the values at the diagonal are always going to be ones. These can be distracting and – especially if other correlations are rather smaller – it can even mess up the colour scale. It is therefore generally a good idea to mask the diagonal out.



In [None]:
plt.figure(figsize=(10, 10))
corr_heatmap(df, map_type='standard');
plt.savefig("output/corr.svg", bbox_inches="tight", pad_inches=0)

While the correlation coefficients give us an idea of what linear relationships exist in our data and how strong they are, some of these correlations may actually occur by chance. Given a finite amount of data, there is usually no way to be 100% sure that the observed correlations are real.

However, we can at least use the tools made available to use through statistics and compute the **statistical significance**  of the correlations. Then, if it turns out that the **p-values**  of some correlations are high (there is a good chance that these correlations occurred by chance), we can mask them out of the correlation matrix.

Here is what our correlation matrix would look like if we masked out correlations with p-values greater than 0.01. Note how this also makes our matrix sparser and therefore a bit easier to read.



In [None]:
plt.figure(figsize=(10, 10))
corr_heatmap(df, map_type='standard', p_bound=0.01);
plt.savefig("output/corr_pbound_masked.svg", bbox_inches="tight", pad_inches=0)

#### A SweetViz-Like Association Matrix

If we want a more powerful way to visualize relationships in the dataset, we can use an association matrix similar to that provided in SweetViz reports. This extends the standard correlation matrix in two important ways:

* It is able to display relationships that involve categorical variables:* Using the **correlation ratio**  for numeric vs. categorical interactions;
* Using the **uncertainty coefficient**  for categorical vs. categorical interactions;

* It uses shapes (circles for numeric vs. numeric, rectangles for the rest) and sizes (to indicate magnitude) as well colours to encode the values, which makes it much easier to read.
Note that the **uncertainty coefficient**  is asymmetrical. You can pass `sym_u=True` to get its **symmetric version** .



In [None]:
plt.figure(figsize=(10, 10))
corr_heatmap(df, categorical_inputs=categorical_inputs);
plt.savefig("output/assoc.svg", bbox_inches="tight", pad_inches=0)

### Visualizing Missingness

The `missingno` package provides a couple of interesting kinds of plots that can be used to visualize missingness of data. It can help you to get quick feel for which columns have missing data, how much of it is missing and even if there are any interesting patterns to the missingness.

#### The Missingness Matrix

The first of these plots is the missingness matrix, which is a collection of stripes arranged according to the rows and columns in the dataset. White stripes indicate missing values, while black stripes indicate non-missing values. On the right, there is a plot that summarizes completeness/missingness of entire rows – this can help e.g. to spot rows with the maximum and minimum missingness in the dataset.

The package includes several different kinds of plots, including a missingness bar plot, a dendrogram-style plot that clusters columns based on the correlation of their missingness patterns, etc.



In [None]:
msno.matrix(df)
plt.savefig("output/missingness_matrix.svg", bbox_inches="tight", pad_inches=0)

The package includes several different kinds of plots, including a missingness bar plot, a dendrogram-style plot that clusters columns based on the correlation of their missingness patterns, etc.



In [None]:
msno.bar(df)
plt.savefig("output/missingness_barplot.svg", bbox_inches="tight", pad_inches=0)

### Distribution Plots

To explore all the individual variables, we can use various kinds of distribution plots.

#### Distributions of Numeric Variables

For single numeric variables, distributions are usually visualized using histograms. Histogram partition the continuous variable into a finite number of discrete bins and then plot the count (or the proportion) of points in each bin using a bar plot.

The histogram is also often overlaid with a smooth curve (kernel density estimate; KDE), which tries to approximate the underlying probability density function. This can make the plot a bit easier to read.



In [None]:
sns.histplot(x='Age', data=df, kde=True)
plt.savefig("output/hist.svg", bbox_inches="tight", pad_inches=0)

One can typically configure the number of bins to make the plot more or less granular or specify the bins manually. 



In [None]:
sns.histplot(x="Age", data=df, kde=True, bins=5)
plt.savefig("output/hist_5bin.svg", bbox_inches="tight", pad_inches=0)

To run the same plotting function on more columns and display the results in a grid, we can use a `ColGrid` object, specifying the dataframe, the columns to use and the number of columns in the grid (`col_wrap`).



In [None]:
g = ColGrid(df, numeric_inputs, col_wrap=2)
g.map_dataframe(sns.histplot, kde=True);
plt.savefig("output/hist_colgrid.svg", bbox_inches="tight", pad_inches=0)

#### Distributions of Categorical Variables: Bar Plots

For categorical variables, we use **bar plots**  (using `sns.countplot`) instead of histograms. These are similar to histograms, but there is no need for binning since the variables are already discrete.



In [None]:
sns.countplot(x="Embarked", data=df)
plt.savefig("output/count.svg", bbox_inches="tight", pad_inches=0)

And we can again use `ColGrid` to make the same kind of plot for all categorical variables.



In [None]:
g = ColGrid(df, categorical_inputs + [output], col_wrap=2)
g.map_dataframe(sns.countplot);
plt.savefig("output/count_colgrid.svg", bbox_inches="tight", pad_inches=0)

##### Using `hue` in a `countplot`

Bar plots created using `countplot` (like most other `seaborn` plots, actually) also accept a `hue` argument, which can be used to break up a single column into multiple coloured columns by some discrete variable.

For instance, we can plot our "embarked" counts again, but break them up by passenger class this time. The resulting plot will tell us e.g. that most 3rd class passengers embarked in Southampton.



In [None]:
sns.countplot(x="Embarked", hue="Pclass", data=df)
plt.savefig("output/count_hue.svg", bbox_inches="tight", pad_inches=0)

### Interaction Plots

The next thing we can visualize are the various ways in which pairs of variables interact.

#### Numeric vs. Numeric: A Scatter Plot

To plot numeric columns against each other, we can use scatter plots, where a column goes on each axis and each row becomes a point in the plot.



In [None]:
sns.scatterplot(x='Age', y='Fare', data=df)
plt.savefig("output/scatter.svg", bbox_inches="tight", pad_inches=0)

Now, let us make scatter plots for all combinations of numeric columns. We will use `ColGrid` again and specify  This will avoid duplicate plots, e.g. we do not want both `Age` vs. `Fare` and `Fare` vs. `Age` – that would be redundant.



In [None]:
g = ColGrid(df, numeric_inputs, interact='comb', col_wrap=3)
g.map_dataframe(sns.scatterplot);
plt.savefig("output/scatter_colgrid.svg", bbox_inches="tight", pad_inches=0)

##### Ordinal Categorical Variables

Note that in our case, `Age` and `Fare` are the only continuous numeric columns. Our other numeric columns are discrete and they each only have a small number of values. As we are going to see later on, some of them will be better plotted if we treat them as **ordinal categorical variables** .

Ordinal categorical variables represent a specific kind of categorical variables, where there is some natural **ordering**  of the values. E.g. for a categorical variable `height` with values `short`, `medium` and `tall`, it is clear that `short` is the smallest value and `tall` the largest in some sense: even though the variable is categorical.

Let us note down which of our numeric columns could instead be treated as ordinal categorical variables and we can try different plots for them in the next section.



In [None]:
ordinal_inputs = ["Pclass", "SibSp", "Parch"]

#### Numeric vs. Numeric: A Regression Plot

There is a special kind of plot, which combines a scatter plot with a linear regression plot. In some cases this makes plot easier to read – it makes it more obvious what kind of linear trend there might be in the data. For the linear regression plot, one usually also visualizes the confidence intervals using a shaded region.



In [None]:
sns.regplot(x='SibSp', y='Fare', data=df)
plt.savefig("output/regplot.svg", bbox_inches="tight", pad_inches=0)

As you can see, the SibSp variable is numeric, but discrete. This makes our plot quite difficult to read – one thing we can do in such cases is to add some jitter to the points. That is to say, we are going to add a little bit of random noise to each point: in our case to each point's x-coordinate, since that is the axis where the variable is discrete. This will make the points spread out and the plot will be easier to read. We can also add some transparency to the points: this will also give us a better idea of how many points are in each area.



In [None]:
sns.regplot(
    x='SibSp', y='Fare', data=df,
    x_jitter=0.25, scatter_kws={'alpha': 0.25}
)
plt.savefig("output/regplot_jitter_alpha.svg", bbox_inches="tight", pad_inches=0)

The regplot function is also a convenient way to create standard scatter plots with jitter, since the regression line can be turned off.



In [None]:
sns.regplot(
    x='SibSp', y='Fare', data=df,
    x_jitter=0.25, scatter_kws={'alpha': 0.25},
    fit_reg=False
)
plt.savefig("output/regplot_jitter_alpha_noreg.svg", bbox_inches="tight", pad_inches=0)

#### Numeric vs. Numeric: A Line Plot

One of the most common kinds of plots is the standard line plot. However, when making a plot of this kind, we need to make sure that the points are ordered correctly (otherwise the lines might not connect up the way they should) and that for each value of the horizontal-axis variable we only plot a single value of the vertical-axis variable. In a scatter plot we were able to plot as many values as we liked: in a line plot, it would make no sense.

##### Proper Ordering

Let us make a naïve line plot of 'Age' vs. 'Fare' to illustrate the problem:



In [None]:
df.plot.line(x='Age', y='Fare')

The plot looks dreadful because the points were connected up in the order they appear in the dataset. To make them connect up properly, we need to sort them first:



In [None]:
df.sort_values(by='Age').plot.line(x='Age', y='Fare')

##### Taking the Means

Now, this is still not a good plot because we ignored the fact that for many values of 'Age' we have multiple values of 'Fare'. What we should do is group the points by age first and compute the means of the corresponding fares. Then the output will actually result in a well-defined plot. Note also how the two outliers with fares greater than 500 were now smoothed away by the averaging.



In [None]:
df[['Age', 'Fare']].groupby(by='Age').mean().plot.line()

##### Confidence Intervals and Smoothing

Now, this is still not ideal because unlike in a scatter plot, we have no idea about how much the fares varied for any given age. We can work around that by plotting **confidence intervals** . Seaborn does this automatically (using bootstrapping): the shaded area corresponds to the 95% confidence interval. This means that the probability that the true mean lies inside the interval is 95%.



In [None]:
sns.lineplot(x='Age', y='Fare', data=df)
plt.savefig("output/line.svg", bbox_inches="tight", pad_inches=0)

For very noisy plots, it can also make sense to overlay the plot with a smoothed version. E.g. using the moving average:



In [None]:
sns.lineplot(x='Age', y='Fare', data=df)
df_grouped = df[['Age', 'Fare']].groupby(by='Age').mean()
moving_average = df_grouped.rolling(window=10, min_periods=1).mean()
moving_average.plot.line(ax=plt.gca(), linewidth=4)
plt.legend(['original fare', 'moving average'])
plt.savefig("output/line_ma.svg", bbox_inches="tight", pad_inches=0)

#### Categorical vs. Categorical: A Crosstab Plot

To visualize interaction between two categorical variables, we can crosstabulate them and plot the resulting matrix. Each cell of the matrix corresponds to the number of co-occurrences of the two corresponding values in the dataset.



In [None]:
crosstab_plot(x='Sex', y='Survived', data=df);
plt.savefig("output/crosstab.svg", bbox_inches="tight", pad_inches=0)

This plot clearly indicates that there is a very strong association between being male and not surviving. There is also a rather strong association between being female and surviving. This already gives us a lot of information.

To get interactions of each categorical input with the output variable, we can use a 2-argument `ColGrid`. We can also include the ordinal categorical variables that we identified in the previous section.



In [None]:
g = ColGrid(df, categorical_inputs + ordinal_inputs, output, col_wrap=2)
g.map_dataframe(crosstab_plot);
plt.savefig("output/crosstab_colgrid.svg", bbox_inches="tight", pad_inches=0)

#### Numeric vs. Categorical

For numeric vs. categorical we are going to look at two kinds of plots. They are somewhat similar in nature, but each has its own strengths.

##### Box Plot

The better known of the two plots is the box plot, which visualizes the distribution of a numeric variable across different values of a categorical variable. The plot displays boxes and whiskers. The boxes range from the 25th to the 75th percentile (i.e. the middle 50% of the data lies within the box). The line drawn through the box represents the median (i.e. 50% of the data lies below that line).

Given that the box ranges from the 1st quartile (i.e. the 25th percentile) to the 3rd quartile (the 75th percentile), the height of the box represents the interquartile range (IRQ). The whiskers extend from the box to the minimum and maximum values of the data, but no further than 1.5 times the IRQ (from the 1st quartile at the bottom and the 3rd quartile at the top). Any points outside this range are considered outliers and are plotted individually.



In [None]:
sns.boxplot(x="Sex", y="Age", data=df)
plt.savefig("output/box.svg", bbox_inches="tight", pad_inches=0)

To make box plots more readable, especially when there is a lot of categories, it is often better to sort the categories by medians. We have an auxiliary function called `sorted_order`, which does just that:



In [None]:
sorted_order(sns.boxplot)(x="Sex", y="Age", data=df)
plt.savefig("output/box_sorted.svg", bbox_inches="tight", pad_inches=0)

##### Violin Plot

Violin plots are somewhat similar to box plots, but they give a fuller idea of what the distribution of the numeric variable looks like. They are essentially like rotated density plots (created using kernel density estimation; KDE): the thickness of the violin represents the counts of the corresponding numeric values. Violin plots are especially useful when the distribution is multimodal (there are multiple local maxima), which a box plot cannot visualize.

**Note: To make sure that you do not misread a violin plot, note that the violin is padded a little at the top and the bottom. This is because of the kernel density estimation (KDE) smoothing. It means, however, that the minimum and the maximum value is not indicated by the point where the violin ends, but rather by the line segment plotted inside the violin.**  Compare the violin plot to the box plot we have shown above to get a fuller idea. If you want to cut the violin so that it does not reach beyond the minimum and the maximum, you can pass `cut=0` to the `violinplot` function, but this is not the default behaviour.

The inside of the violin typically contains a small boxplot, where the box is replaced with a thick line, the whiskers with a thin line and the median is indicated by a white circle.



In [None]:
sorted_order(sns.violinplot)(x="Sex", y="Age", data=df)
plt.savefig("output/violin.svg", bbox_inches="tight", pad_inches=0)

Naturally, we can again put both box plots and violin plots on a `ColGrid`.



In [None]:
g = ColGrid(df, categorical_inputs, "Age", col_wrap=2)
g.map_dataframe(sorted_order(sns.violinplot));
plt.savefig("output/violin_categorical.svg", bbox_inches="tight", pad_inches=0)

##### Violin Plots for Ordinal Categorical Variables

We can do the same for ordinal categorical variables. The visualizations will be much more informative than the scatter plots we made earlier. Note though that ordinal variables are already ordered by nature so it makes no sense to use `sorted_order` in this case to reorder them.



In [None]:
g = ColGrid(df, ordinal_inputs, "Age", col_wrap=2)
g.map_dataframe(sns.violinplot);
plt.savefig("output/violin_ordinal.svg", bbox_inches="tight", pad_inches=0)

##### Raincloud Plot

Now, if you want to make your plot even more fancy – but actually also more informative – you can make a **raincloud plot** : a new, recently published kind of plot that combines three aspects:

* a (half) violin plot;
* a boxplot and;
* a real-data plot.
That way you get a nice visual summary, which gives you all the most important information at a glance.

For more about raincloud plots see [Raincloud Plots at GitHub](https://github.com/RainCloudPlots/RainCloudPlots) or the paper itself:
Allen M, Poggiali D, Whitaker K et al. Raincloud plots: a multi-platform tool for robust data visualization [version 2; peer review: 2 approved]. Wellcome Open Res 2021, 4:63. DOI: 10.12688/wellcomeopenres.15191.2

See also [raincloud_tutorial_python.ipynb](https://github.com/pog87/PtitPrince/blob/master/tutorial_python/raincloud_tutorial_python.ipynb) for advanced usage.



In [None]:
RainCloud(x="Sex", y="Age", data=df)
plt.savefig("output/raincloud.svg", bbox_inches="tight", pad_inches=0)

Or using a `ColGrid`:



In [None]:
g = ColGrid(df, categorical_inputs, "Age", col_wrap=2)
g.map_dataframe(RainCloud);

### Facet Grids

Seaborn's facet grids work similarly to the ColGrids that we have already used. The main difference is that where ColGrids would go over combinations of columns, facet grids go over the different values of a single discrete variable or a pair of discrete variables. We can plot all the various kinds of plots in a facet grid.

#### Age Distribution by Passenger Class

For instance, we could plot "Age" histograms across different passenger classes and compare.



In [None]:
g = sns.FacetGrid(df, col="Pclass")
g.map_dataframe(sns.histplot, x="Age", kde=True)
g.set_axis_labels("Age", "Count")
plt.savefig("output/facet.svg", bbox_inches="tight", pad_inches=0)

##### Automatic Column Wrapping

If we wanted to do the same kind of plot, but with "SibSp", there would be too many different values to fit into a compact plot. Luckily, we can again use the `col_wrap` argument to automatically wrap columns into multiple rows.



In [None]:
g = sns.FacetGrid(df, col="SibSp", col_wrap=4)
g.map_dataframe(sns.histplot, x="Age", kde=True)
g.set_axis_labels("Age", "Count")
plt.savefig("output/facet_wrap.svg", bbox_inches="tight", pad_inches=0)

##### 2-Dimensional Facet Grids

It is also possible to create a 2-dimensional facet grid, where columns represent one discrete variable and the rows a different one. For instance, we could have "Pclass" change across columns and "Embarked" across rows.



In [None]:
g = sns.FacetGrid(df, col="Pclass", row="Embarked")
g.map_dataframe(sns.histplot, x="Age", kde=True)
g.set_axis_labels("Age", "Count")
plt.savefig("output/facet_2d.svg", bbox_inches="tight", pad_inches=0)

##### Other Kinds of Plots

The true power of `FacetGrid` is, of course, that just like `ColGrid` it works with any kind of plot, e.g.:



In [None]:
g = sns.FacetGrid(df, col="Pclass", row="Embarked")
g.map_dataframe(sns.countplot, x="Survived")
g.set_axis_labels("Survived", "Count")
plt.savefig("output/facet_count.svg", bbox_inches="tight", pad_inches=0)

In [None]:
g = sns.FacetGrid(df, col="Pclass", row="Embarked")
g.map_dataframe(sorted_order(sns.violinplot), x="Survived", y="Age")
g.set_axis_labels("Survived", "Age")
plt.savefig("output/facet_violin.svg", bbox_inches="tight", pad_inches=0)

### Pie Charts Using Pandas

Finally, `seaborn` has no support for pie charts, but `pandas` does. To display the proportion of the male and female sex in the dataset, you might count the values of "Sex" using `df["Sex"].value_counts()` and then plot this using `.plot.pie`. We can specify `autopct='%1.0f%%'` to have the percentages displayed in the plot and use the `explode` argument to move some pieces of the pie outwards.



In [None]:
df["Sex"].value_counts()

In [None]:
df["Sex"].value_counts().plot.pie(autopct='%1.0f%%', explode=[0, 0.05]);
plt.savefig("output/pie.svg", bbox_inches="tight", pad_inches=0)

Given that these charts do not have the standard `seaborn` interface, they cannot be used `ColGrid` or `FacetGrid`. We can still display them in a grid though, using `matplotlib`'s `subplots`, of course, which is a little bit more laborious, but works just as well.



In [None]:
# the columns to plot
cols_to_plot = categorical_inputs + [output]

# number of columns and rows in the subplots grid
ncols=3
nrows=int(np.ceil(len(cols_to_plot)/ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(12, 8))

# plot each column in a separate subplot
for col, ax in zip(cols_to_plot, np.ravel(axes)):
    df[col].value_counts().plot.pie(autopct='%1.0f%%', ax=ax)

# to hide unused axes
for ax in np.ravel(axes)[len(cols_to_plot):]:
    ax.axis('off')
    
plt.savefig("output/pie_subplots.svg", bbox_inches="tight", pad_inches=0)

**Note that the use pie charts is generally discouraged, because they are more difficult to read than bar plots (it is more difficult to compare the size of the portions).** 

### Stacked Bar Plots

When showing proportions and you do not want to use a pie chart or a standard bar plot, you may also want to consider a stacked bar plot as an alternative.

To recreate the pie charts above, we could write:



In [None]:
df_percentages = df["Sex"].value_counts(normalize=True) * 100
pd.DataFrame(df_percentages).T.plot(kind='bar', stacked=True)
plt.grid(ls='--')
plt.gca().set_axisbelow(True)
plt.ylabel("percentages")
plt.savefig("output/stacked_bar_plot_sex.svg", bbox_inches="tight", pad_inches=0)

In [None]:
# the columns to plot
cols_to_plot = categorical_inputs + [output]

# number of columns and rows in the subplots grid
ncols=3
nrows=int(np.ceil(len(cols_to_plot)/ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(12, 8))

# plot each column in a separate subplot
for col, ax in zip(cols_to_plot, np.ravel(axes)):
    df_percentages = df[col].value_counts(normalize=True) * 100
    pd.DataFrame(df_percentages).T.plot(kind='bar', stacked=True, ax=ax)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.grid(ls='--')
    ax.set_axisbelow(True)
    
# to hide unused axes
for ax in np.ravel(axes)[len(cols_to_plot):]:
    ax.axis('off')

plt.savefig("output/stacked_bar_plot_grid.svg", bbox_inches="tight", pad_inches=0)

Naturally, you can create a stacked bar plot using any data – it does not need to be proportional. E.g. we could create a pivot table, where we'd have the place of embarkation in the rows, the sex of the passenger in the columns and the values would be the survival counts:



In [None]:
df_pivot = pd.pivot_table(df, index="Embarked", columns="Sex", values="Survived", aggfunc='sum')
df_pivot

We could then create a stacked bar plot out of this pivot table simply by calling:



In [None]:
df_pivot.plot(kind='bar', stacked=True)
plt.grid(ls='--')
plt.gca().set_axisbelow(True)
plt.savefig("output/stacked_bar_plot_pivot.svg", bbox_inches="tight", pad_inches=0)

### Other Plots

Note that there are other, more advanced plots that can be useful to you when doing exploratory analysis. We are going to discuss some of them in later modules. They include e.g.:

* Reducing the dimensionality of the original data (using e.g. PCA or UMAP) and plotting it in 2D;
* Doing hierarchical clustering on the data and displaying the resulting dendrogram (by itself or as a part of a heatmap);
* ...
