# Seaborn

Seaborn is a very easy to use Python library that is ideal for data science. It is built on top of Matplotlib and has been made to work with pandas dataframes.
Check out the [`documentaton`](https://seaborn.pydata.org/) for a more in-depth understanding.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## load_dataset()
Seaborn comes with a handy way to quickly get some datasets to play with, but please note this is NOT the normal way of loading a CSV file.  Usually we'd use pandas.read_csv() as we've seen so far. However, this method provides us with a pandas data frame.

In [None]:
tips = sns.load_dataset("tips") #This is a dataset tipping behaviour of diners

In [None]:
tips

In [None]:
penguins = sns.load_dataset("penguins") #This loads a dataset about a penguin population

## Scatterplots

First we would like to use the default seaborn theme. Without it this figure looks like a plain pandas plot.

In [None]:
sns.set_theme()

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip")

A very useful tool in seaborn is to distinguish between features with the help of several arguments. The first one we will look at is the `hue`, which uses colours to seperate the data.

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="smoker")

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="sex")

You can also pass non-binary data to the hue argument. However, this can bekome rather confusing to interpret.

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="day")

Apart from `hue`, you can also use `style` to distinguish your data.

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="sex", style="time")

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="sex", style="sex")

In some cases it makes sense to change the size accoring to the input. In this case we are using the group size of the people dining.

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip", size="size")

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip", size="size", hue="sex") #It seems men still pay the bills.

## Line plots

In pandas or Matplotlib it would be quite cumbersome to create a line plot from data out of a dataframe. You would first have to do a groupby reduction before passing the result to Matplotlib. You can also plot straight from pandas, however, you do not have a lot control over how the plot should look like in the end. 

In [None]:
flights = sns.load_dataset("flights") #This is a dataset about the number of passengers on every month of the year

In [None]:
flights.head()

## Seaborn Line Plots

Seborn automatically makes the aggregation and reduction. The default estimator is the mean. By default you also get a 95% confidence interval displayed.

In [None]:
flights = sns.load_dataset("flights")

In [None]:
sns.lineplot(data=flights, x="year", y="passengers")

In [None]:
trips.head(3)

In [None]:
flights.groupby("year")["passengers"].mean().plot()

In [None]:
sns.lineplot(data=flights, x="year", y="passengers", estimator="sum")

In [None]:
sns.lineplot(data=flights, x="year", y="passengers", hue="month")

## Relplots

Seaborn offers a fantastic collection of [`relational plots`](https://seaborn.pydata.org/generated/seaborn.relplot.html), which give you a much better insight into your dataset. What you end up with is a facet grid, a figure level method with subplots in rows and columns.

In [None]:
trips["hour"] = trips["pickup"].dt.hour

At a fist glance, the plott looks very similar to what we have already created earlier. Hoever, as soon as you pass data to the row and column argument, you end up with something truly useful.

In [None]:
sns.relplot(data=tips, x="total_bill", y="tip", kind="scatter")

Let's split up our data in male and female:

In [None]:
trips.head()

We can break this up even further, by bringing `hue` into play again.

In [None]:
sns.lineplot(data=trips, x="hour", y="total", hue="payment", style="color", ci=None)

Now let's introduce rows as well. You end up with a 2x2 grid, which comes from the fact, that there are only two sexes and two different meals in our dataset.

## The Figure-Level Relplot( ) Method

Let's get back to the taxi trips dataset. This time we would like to split up the data according to the pick-up borough and the payment method.

In [None]:
tips.head()

This time we'll go completely nuts by introducing some rows as well. This might take a while to complete...

In [None]:
sns.relplot(
    data=trips, 
    x="hour",
    y="total", 
    kind="line", 
    col="pickup_borough", 
    hue="payment", 
    row="dropoff_borough"
)

## Changing Plot Sizes

When it comes to axes plots, such as line or scatter plots, we have to use plt.figure(figsize=()) to determin the size of the figure.

In [None]:
plt.figure(figsize=(8,5))
sns.lineplot(data=flights, x="year", y="passengers", hue="month")

However, when it comes to figure plots, such as relplots, we need to use the keyword "height", which determines the hight of one facet (subplot). To set the width, we need the keyword "aspect" to manipulate the aspect ratio of width to height of one facet (default = 1).

In [None]:
sns.relplot(
    data=tips, 
    x="total_bill", 
    y="tip", 
    kind="scatter", 
    hue="smoker", 
    col="day",
    row="sex",
    height=2,
    aspect=1.3
)

## Histograms

Let's briefly delve into one of the classics when it comes to distribution plots, the histogram.
The arguments are very much like the ones we got to know earlier.

In [None]:
sns.histplot(data=tips, x="tip")

We can split up the data here as well (in this case with `hue`). It might, however, not always be useful to have the bars overlapping, so we can use the `multiple` argument to change the default e.g. to stack the bars, or display them side by side.

In [None]:
sns.histplot(data=tips, x="tip", hue="time")

In [None]:
sns.histplot(data=tips, x="tip", hue="smoker", multiple="stack")

In [None]:
sns.histplot(data=tips, x="tip", hue="smoker", multiple="dodge")

We can also determine the amount of bins and the binwidth used.

In [None]:
sns.histplot(data=tips, x="tip", bins = 20)

If you want to display a kernel density estimate (KDE) on top of your histogram, set kde=True.

In [None]:
sns.histplot(data=tips, x="tip", bins = 20, kde=True)

## KDE Plots

Kernel density estimate plots provide you with the plot of a continuous distribution function, rather than the discrete one of a histogram.

In [None]:
trips.head()

In [None]:
sns.kdeplot(data=trips, x="hour")

In [None]:
sns.kdeplot(data=trips, x="hour", hue="payment")

In [None]:
sns.kdeplot(data=trips, x="hour", hue="payment", bw_adjust=0.4)

In [None]:
sns.kdeplot(data=trips, x="hour", hue="pickup_borough", multiple="stack")

## Multivariate Distributions

Let's take a look at our dataset about penguins.

In [None]:
penguins.head()

In [None]:
penguins.species.unique()

Now we would like to create some ordinary histograms, first with the flipper length and then with the body mass.

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm")

In [None]:
sns.histplot(data=penguins, x="body_mass_g")

The histogram can also be created with two variables. What you get is very similar to a heatmap, where the darker areas show a higher concentration than the lighter ones.

In [None]:
sns.histplot(data=penguins, x="body_mass_g", y="flipper_length_mm")

The same goes for the kdeplot, which is now very much like a contour plot.

In [None]:
sns.kdeplot(data=penguins, x="body_mass_g", y="flipper_length_mm")

How about splitting the data by means of the colour?

In [None]:
sns.kdeplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species")

In [None]:
sns.histplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species")

In [None]:
sns.rugplot(data=tips, x="tip", height=0.2)

In [None]:
sns.rugplot(data=tips, y="tip", height=0.2)

In [None]:
sns.kdeplot(data=tips, x="total_bill")
sns.rugplot(data=tips, x="total_bill", height=0.07)

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip")
sns.rugplot(data=tips,x="total_bill", y="tip")

In [None]:
diamonds = sns.load_dataset("diamonds")
sns.scatterplot(data=diamonds, x="carat", y="price", s=5)
sns.rugplot(data=diamonds, x="carat", y="price", lw=1, alpha=.005)

## Displots

Similar to using Relplots we can use [`Displots`](https://seaborn.pydata.org/generated/seaborn.displot.html), a figure based method to create subplots depicting the distribution of data. You just need to say which distribution graph you would like to use with the "kind" parameter. Without using the column (`col`) or row (`row`) parameters the outcome is the same as with e.g. the hisplot() function.

In [None]:
sns.displot(kind="hist", data=penguins, x="body_mass_g", height=3, aspect=2)

In [None]:
sns.displot(
    kind="hist", 
    data=penguins, 
    hue="sex",
    x="body_mass_g", 
    col="species"
)

Rug plots show you every data point and give you a better understanding of the distribution. To enable that set `rug` to True.

In [None]:
sns.displot(data=tips, kind="kde", x="tip", col="time", rug=True)

In [None]:
sns.displot(data=tips, kind="kde", x="total_bill", y="tip", rug=True)

## Countplots

Countplots to exactly what the name suggests. They count the number of observations for categorical data.

In [None]:
titanic = sns.load_dataset("titanic")

In [None]:
titanic.head()

In [None]:
sns.countplot(data=titanic, x="class")

In [None]:
sns.countplot(data=titanic, x="class", hue="sex")

We can also switch to horizontal.

In [None]:
sns.countplot(data=titanic, y="class", hue="sex")

## Stripplot & Swarmplot

Scatterplots are not very useful when we are dealing with categorical data. In such cases we might want to switch to Strip- or Swarmplots, which give us a sense of the distribution as well.

In [None]:
sns.scatterplot(data=trips, x="pickup_borough", y="distance")

In [None]:
plt.figure(dpi=100)
sns.stripplot(data=trips, x="pickup_borough", y="distance")
plt.title("Taxi Trip Distance By Burough")

If the dataset is too large, the stripplot quickly becomes uninterpretable. The same goes for the swarmplot. So, let's reduce the amound of data in our case.

In [None]:
trips_sample = trips.nlargest(600, "total")

In [None]:
plt.figure(figsize=(14,5))
sns.swarmplot(data=trips_sample, x="pickup_borough", y="distance")

In [None]:
plt.figure(figsize=(12,5))
sns.stripplot(data=trips_sample, x="pickup_borough", y="distance")
plt.title("Taxi Trips By Borough")

In [None]:
plt.figure(figsize=(12,5))

sns.swarmplot(data=titanic, x="pclass", y="age", hue="sex")

## Boxplots

Boxplots only really make sense, when you compare data. However, you can of course also just create a single boxplot.

In [None]:
sns.boxplot(data=titanic, x="age")

In [None]:
sns.boxplot(data=trips, x="pickup_borough", y="total")

You can also set the whisker size as a proportion of the inter quartile range. Everything outside of this range will be classified as outliers (fliers).

In [None]:
sns.boxplot(data=trips, x="pickup_borough", y="total", whis=2.5, fliersize=2)

In [None]:
plt.figure(figsize=(12,5))
sns.swarmplot(data=titanic, x="class", y="age")
sns.boxplot(data=titanic, x="class", y="age")

## Boxenplots

[`Boxenplots`](https://seaborn.pydata.org/generated/seaborn.boxenplot.html) work just like boxplots, but give you additional quantiles, which give you a better understanding of the distribution of your data.

In [None]:
sns.boxplot(data=trips, x="pickup_borough", y="total")

For the boxenplot to display correctly, we need to adjust die figsize.

In [None]:
plt.figure(figsize=(10,6))
sns.boxenplot(data=trips, x="pickup_borough", y="total")

## Violinplots

In simple terms, a [`violinplot`](https://seaborn.pydata.org/generated/seaborn.violinplot.html) are an alternative to the boxenplot, which shows the distribution of quantitative data from categorical variables. It enables you to compare distributions easily and intuitivly. The violin plot features a kernel density estimation (KDE) of the underlying distribution.

In [None]:
sns.violinplot(data=titanic, x="age")
# sns.boxplot(data=titanic, x="age")

In [None]:
sns.violinplot(data=titanic, x="class", y="age")

In [None]:
sns.violinplot(data=titanic, x="class", y="age", hue="sex")

One nice feature of the violin plot ist the fact that we can split to get a more compact depiction.

In [None]:
plt.figure(figsize=(10,4))
sns.violinplot(data=titanic, x="class", y="age", hue="sex", split=True, palette="muted")

## Barplots

[`Barplots`](https://seaborn.pydata.org/generated/seaborn.barplot.html) give you a different output than you might expect from other APIs. By default you get the aggregated mean from your data. You can change the reduction with the `estimator` parameter (e.g. to "sum").
In addition, you also get some uncertainty information in form of an error bar by default. Should you wish to change the type of error, just change `ci` accordingly, or set it to "None".

In [None]:
sns.barplot(data=trips, x="pickup_borough", y="distance")

In [None]:
sns.barplot(data=trips, x="pickup_borough", y="total")

In [None]:
sns.barplot(data=trips, x="pickup_borough", y="total", estimator=sum, ci=None)

Again, you could achieve something similar by using groupby in pandas, Seaborn is just much less cumbersome.

In [None]:
trips.groupby("pickup_borough")["total"].sum().plot(kind="bar")

In [None]:
sns.barplot(data=trips, x="pickup_borough", y="distance", hue="color")

You can change the orientation, by switching x and y.

In [None]:
sns.barplot(data=trips, y="pickup_borough", x="distance", hue="color", dodge=False)

That does't work, however, if both variables are categorical and numeric. In such a case, you have to change the orientation and set `orient`="h".

## Catplots

[`Catplots`](https://seaborn.pydata.org/generated/seaborn.catplot.html#) are the third group of figure based plots and work similar to Relplots and Displots. The result facet grid with a categorical plot. Again, you need to fill `col` and/or `row` to create sub-plots.

In [None]:
sns.catplot(data=titanic, x="sex", y="survived", kind="bar")

In [None]:
sns.catplot(data=titanic, x="sex", y="survived", kind="bar", col="class")

In [None]:
sns.catplot(
    data=trips, 
    kind="strip", 
    x="pickup_borough", 
    y="distance", 
    col="color",
    height=5,
    aspect=1.5
)

In [None]:
sns.catplot(data=trips,
            kind="violin",
            x="pickup_borough",
            y="distance",
            hue="payment",
            split=True,
            col="color")

In [None]:
sns.catplot(
    data=titanic, 
    kind="bar",
    x="who",
    y="survived",
    col="class",
    ci=None,
)

## Pairplots

[`Pairplots`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) are very useful in feature engineering for machine learning, as they plot pairwise relationships in a dataset.
From the Seaborn documentation: "By default, this function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column."

In [None]:
iris = sns.load_dataset("iris")

In [None]:
iris.head()

In [None]:
sns.pairplot(data=iris, hue="species")

## Heatmaps

[`Heatmaps`](https://seaborn.pydata.org/generated/seaborn.heatmap.html) can be really useful to visualise correlations between features.

In [None]:
mpg = sns.load_dataset("mpg")

In [None]:
mpg.head()

In [None]:
mpg.corr()

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(data=mpg.corr(), linewidth=0.5, annot=True)

# Save to file

In [None]:
import matplotlib.pyplot as plt

In [None]:
mpg = sns.load_dataset("mpg")

In [None]:
figure = plt.figure(figsize=(8,6))
ax = sns.heatmap(data=mpg.corr(), linewidth=0.5, annot=True)
plt.savefig("seaborn_heatmap.svg", format="svg")