# Intro to Seaborn

> Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

(from [Seaborn homepage](https://seaborn.pydata.org/))

Seaborn makes (usually) very easy the task of creating clean and visually appealing plots. It supports themes and tries to relieve the user from needing to worry about many details. 

For reference, keep in mind the official material:

* [seaborn website](https://seaborn.pydata.org/)
* [example gallery](https://seaborn.pydata.org/examples/index.html)
* [tutorials](https://seaborn.pydata.org/tutorial.html)


# If Seaborn is so cool, why we study pyplot?

Let's discuss pros and cons.

# Setup

In [None]:
#default Colab seaborn is a lagging a bit behind...
!pip install seaborn>=0.12.0

#the standard seaborn import
import seaborn as sns
print(sns.__version__)

#other stuff that we'll use for sure
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Data

We are going to use again the Canada immigration dataset, as seen with pyplot.

In [None]:
#reading the data in
df_canada = pd.read_excel(
    'https://github.com/ne1s0n/dataviz_python/raw/main/resources/Canada.xlsx',
    sheet_name = 'Canada by Citizenship',  #the file contains three sheets
    skiprows = range(20), #skip the first twenty rows
    skipfooter = 2        #skip the last two rows
)

#renaming a column
df_canada.rename(columns = {'OdName':'Country'}, inplace = True)

#using Country as index
df_canada.set_index('Country', inplace = True)

#a handy variable to select years
years = df_canada.columns[8:42]

#adding a "Total" column
df_canada['Total'] = df_canada.loc[:, years].sum(axis=1)

#let's check the result
print(df_canada.shape)
df_canada.head()

# Wide vs. long data format

Seaborn and many other graphical libraries prefer data in long format. 
With reference to our immigration datas, we would have

**WIDE FORMAT**

| Country      | 1980  | 1981  | 1982  |  1983 | 
|--------------|------:|------:|------:|------:|
| Italy        |   1   |     2 |     3 |     4 |
| France       |   5   |     6 |     7 |     8 |
| Spain        |   9   |    10 |    11 |    12 |


**LONG FORMAT**

| Country      | Year  | Immigrants |
|--------------|------:|------:|
| Italy        | 1980  |     1 |
| Italy        | 1981  |     2 |
| Italy        | 1982  |     3 |
| Italy        | 1983  |     4 |
| France       | 1980  |     5 |
| France       | 1981  |     6 |
| France       | 1982  |     7 |
| France       | 1983  |     8 |
| Spain        | 1980  |     9 |
| Spain        | 1981  |    10 |
| Spain        | 1982  |    11 |
| Spain        | 1983  |    12 |

This is easily obtained using method [melt()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html) from Pandas dataframes.

In [None]:
#selecting interesting columns
df_canada.loc[:, years].head()

In [None]:
#putting back the index as a regular column
df_canada.loc[:, years].reset_index().head()

In [None]:
#melting
df_canada.loc[:, years].reset_index().melt(id_vars = 'Country').head()

In [None]:
#melting using more meaningful column names
df_canada_long = df_canada.loc[:, years].reset_index().melt(id_vars = 'Country', var_name='Year', value_name='Immigrants')
df_canada_long.head()

In [None]:
df_canada_long.shape

# Quick example: barplot

We'll redo a barplot with the number of immigrants from three countries (to select them check the [.isin()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) method in Pandas dataframes. The corresponding Seaborn function is called, unsurprisingly, [barplot()](https://seaborn.pydata.org/generated/seaborn.barplot.html)

We'll pass three arguments:

* data: the full dataframe
* x: name of the column to be used for X axis
* y: name of the column to be used for Y axis

In [None]:
#limiting to three countries instead of two hundreds...
countries = ['Italy', 'France', 'Spain']
small_df = df_canada_long[df_canada_long['Country'].isin(countries)]

sns.barplot(x = 'Country', y = 'Immigrants', data = small_df)

if False:
  #alternative coding
  tmp = df_canada_long[df_canada_long['Country'].isin(countries)]
  sns.barplot(x = small_df['Country'], y = small_df['Immigrants'])

Notice the 95% confidence interval. The shown values ("estimators") are just the mean values.

Let's change the color.

In [None]:
#fixed color
sns.barplot(x = 'Country', y = 'Immigrants', data = small_df, color = 'red')

Using [palettes](https://seaborn.pydata.org/tutorial/color_palettes.html)

In [None]:
sns.barplot(x = 'Country', y = 'Immigrants', data = small_df, palette = 'mako')

Grouping using the `hue` parameter.

In [None]:
plt.figure(figsize = (20, 5))
sns.barplot(x = 'Year', y = 'Immigrants', hue = 'Country', data = small_df)

--- 

# ASSIGNMENT! Customize the barplot

Redo the three columns, aggretate barplot, but make it:

* horizontal
* with a different palette
* showing the max instead of the mean

---

In [None]:
#your solution here

# Styling a plot

Seaborn offers several ways to change the plot appearance. Many are directly inherited from pyplot, so it's going to be easy.

## Title, axis labels, figure size

This is straightforward from pyplot. 

In [None]:
plt.figure(figsize = (10, 5))
sns.barplot(x = 'Country', y = 'Immigrants', data = small_df)
plt.title('My very complex plot')
plt.xlabel('Countries in this plot')
plt.ylabel('People immigrated to Canada')

## Themes

Seaborn has a set of predefined themes, configurable via [`.set_style()`](https://seaborn.pydata.org/generated/seaborn.set_style.html)

See also [Seaborn's tutorial](https://seaborn.pydata.org/tutorial/aesthetics.html) on how to control figure aesthetics in general terms.

In [None]:
#the supported themes
themes = ["whitegrid", "darkgrid", "white", "dark", "ticks"]

#testing all of them, one at a time
for current_theme in themes:
  sns.set_style(current_theme)
  sns.barplot(x = 'Country', y = 'Immigrants', data = small_df)
  plt.title(current_theme)
  #this is needed, otherwise jupyter would only show the last one
  plt.show()

## Multiple charts

It is always possible to treat seaborn plots as pyplot subplots:

In [None]:
plt.figure(figsize = (12, 5))

plt.subplot(1, 2, 1)
sns.barplot(x = 'Country', y = 'Immigrants', data = small_df)
plt.title('First subplot')

plt.subplot(1, 2, 2)
sns.barplot(x = 'Country', y = 'Immigrants', data = small_df)
plt.title('Second subplot')

However, a number of Seaborn function support "faceting", meaning that they will automatically create subplot based on the values of a (discrete) variable. When that's the case the function will support a `col` parameter. Here "col" stands for columns, not color. 

An easy example comes from function [`.relplot()`](https://seaborn.pydata.org/generated/seaborn.relplot.html)


In [None]:
sns.relplot(data = small_df, x = 'Year', y = 'Immigrants', col = 'Country')

Each specific subplot can support multiple series, as usual, via the `hue` parameter.

In [None]:
#producing a dataset with two levels hierarchy of series
#can you guess the final content/shape of variable tmp?
tmp = df_canada.groupby(['AreaName', 'RegName']).sum()
tmp = tmp.loc[:, years].reset_index().melt(id_vars = ['AreaName', 'RegName'], var_name='Year', value_name='Immigrants')

#col commands number of panels
#hue commands number of lines
sns.relplot(data = tmp, x = 'Year', y = 'Immigrants', col = 'AreaName', hue = 'RegName', col_wrap = 3, kind = 'line')

## Adding notes to the plots

Again, this is done using pyplot. The main function for custom notes is [`.annotate()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.annotate.html). See also the annotation [tutorial](https://matplotlib.org/stable/tutorials/text/annotations.html#plotting-guide-annotation) for boxes, labels and other decorations.


In [None]:
sns.barplot(x = 'Country', y = 'Immigrants', data = small_df)
plt.annotate('A simple note', xy = (1, 3000))
plt.annotate('A note with\nan arrow', xy = (1, 600), xytext = (1.5, 1500), 
              arrowprops=dict(facecolor='black', shrink=0.05),
              horizontalalignment='center', verticalalignment='top'
             )

# A gallery of interesting plots

Mostly stuff that it's hard to do in basic pyplot

## Countplot

Show the counts of observations in each categorical bin using bars. See [countplot()](https://seaborn.pydata.org/generated/seaborn.countplot.html) function

In [None]:
#vertical
plt.figure(figsize = (15, 5))
sns.countplot(data = df_canada, x = 'AreaName')

In [None]:
#horizontal
plt.figure(figsize = (15, 5))
sns.countplot(data = df_canada, y = 'AreaName')

## Plotting distributions

We are going to use [`histplot()`](https://seaborn.pydata.org/generated/seaborn.histplot.html) and NOT `distplot()` (deprecated, but you still find tutorials online with it).

In [None]:
sns.histplot(data = small_df[small_df['Country'] == 'Italy'], x='Immigrants', kde = True)

In [None]:
plt.figure(figsize = (15, 8))
sns.histplot(data = small_df, x='Immigrants', hue = 'Country', multiple = 'dodge', shrink = 0.8)

## Heatmaps

Heatmaps are color coded matrix-like data. The function is [`.heatmap()`](https://seaborn.pydata.org/generated/seaborn.heatmap.html)

We use data in the wide format, since it's already matrix-like.*italicized text*

In [None]:
#twenty countries to keep everythin manageable
cnt_20 = df_canada.index[0:20]

plt.figure(figsize = (15, 10))
sns.heatmap(df_canada.loc[cnt_20, years])

For small heatmaps you can anotate the actual values.

In [None]:
#four countries, 5 years
cnt_4 = df_canada.index[0:4]
years_5 = years[0:5]

plt.figure(figsize = (15, 10))
sns.heatmap(df_canada.loc[cnt_4, years_5], annot = True)

## Line regression plot

This is usually a scatter plot (which would be obtained using the [`.scatterplot()`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) function) upon which is imprinted a linear regression, with or without confidence intervals. It's obtained usin the [`.lmplot()`](https://seaborn.pydata.org/generated/seaborn.lmplot.html) (here lm stands for linear model).

To do this kind of plot both X and Y variables must be numeric (it doesn't make sense to do a regression otherwise). In so far the `Year` column is considered as strings, but it can actually be easily converted to integer numbers. We do that in a copy of the dataframe to keep the original data format.

In [None]:
#copy data and convert year to int
tmp = small_df
tmp = tmp.astype({"Year": int})

sns.lmplot(data = tmp, x='Year', y='Immigrants', hue = 'Country')

We can raise the order of the fitter polynomial. Here we fit paraboles:

In [None]:
sns.lmplot(data = tmp, x='Year', y='Immigrants', hue = 'Country', order = 2)

## Violin plot

A violin plot combines a box plot and a kernel density estimate. They can easily become very beautiful piece of graphic and as easily become hard to interpret. The seaborn function to use is called
[`.violinplot()`](https://seaborn.pydata.org/generated/seaborn.violinplot.html)

In [None]:
sns.violinplot(data = small_df, x='Country', y='Immigrants')

Let' go to a different choice of countries, something that would allow us to appreciate better the plot.

In [None]:
countries = ['China', 'India']
tmp = df_canada_long[df_canada_long['Country'].isin(countries)]

sns.violinplot(data = tmp, x='Country', y='Immigrants')

It doesn't make sense to have negative immigrants, let's limit to the actual recorded values. Moreove, let's get rid of the internal boxplot and put the true observations there.

In [None]:
sns.violinplot(data = tmp, x='Country', y='Immigrants', cut = True, inner = 'point')

## Pairplot

Quick and powerful tool to do exploratory data analysis on a set of variables. It will show univariate distribution and bivariate scatterplots. It's obtained using the function [`pairplot()`](https://seaborn.pydata.org/generated/seaborn.pairplot.html)

In [None]:
countries = ['Italy', 'France', 'Spain', 'China', 'India']
sns.pairplot(df_canada.loc[countries, years].transpose())

Beware of your data shape! `pairplot()` works with wide-format and consider each *column* as a variable. Moreover, it will create a number of subplot that goes with the **square** of the number of variables. 

In the code above if we didn't traspose and we just ran:

```
sns.pairplot(df_canada.loc[countries, years])
```

it would have created a year vs. year pairplot, so 34 x 34 = 1156 subplots!

## Jointplot

A more specilized version of pairplot, [`jointplot()`](https://seaborn.pydata.org/generated/seaborn.jointplot.html) allows for an in-depth comparison of two variables.

In [None]:
sns.jointplot(data = df_canada.loc[:, years].transpose(), x = 'Italy', y = 'France')

Using KDE to extract level curves. 

If you are interested in this kind of plot there's also a [`kdeplot()`](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) without the marginal distributions.

In [None]:
sns.jointplot(data = df_canada.loc[:, years].transpose(), x = 'Italy', y = 'France', kind = 'kde')

--- 

# ASSIGNMENT! Analysis by year

Investigate the distribution of immigrants to Canada for years 1980, 1990, 2000, 2010. Produce at least one plot showing all four selected years and one plot with two of them (you chose which one).

---

In [None]:
#your solution here