# Visualisation (Python &ndash; seaborn)

## 1. Examples for the lecture

Here are examples of computations and graphs used for the lecture ***Tools for EDA & visualisation***. Study and run them, they may be useful for your work in the next section.

Complete tutorials to pandas and seaborn can be found at links:

* [Pandas](https://pandas.pydata.org)
* [Seaborn](https://seaborn.pydata.org)

First we read packages, setup the environment, read and adjust data.

In [4]:
### Setup
%matplotlib inline
# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme, set bigger font
sns.set_theme()

# Reading and adjusting data
K = pd.read_csv("application_train.csv")
K = K[0:500] # only first 500 records as a sample
K.columns = K.columns.str.lower() # column names to lowercase
# new columns with more intuitive values
K["age"] = -K["days_birth"] / 365.25 
K["yrs_employed"] = -K["days_employed"] / 365.25
K["yrs_employed"] = np.where(K["yrs_employed"] < 0, np.nan, K["yrs_employed"]) # cleaning from nonsense values

### 1.1 Distributions of individual variables

Basic seaborn method for plotting graph of individual distribution is [displot](https://seaborn.pydata.org/tutorial/distributions.html). It can make plots both for categorial and numeric variables.

Let's start with some **categorial variables**. We make a frequency table (we combine absolute and relative frequencies). For ordinal (ordered) variable, it may be meaningful to compute cumulative frequencies.

In [None]:
# Categorial variable - frequency table
freqtab = K.groupby("name_type_suite").agg(count=("sk_id_curr", "count")) # absolute frequencies (counts)
freqtab["count_rel"] = freqtab["count"] / sum(freqtab["count"]) # relative frequencies
freqtab

In [None]:
# for cumulative frequencies, the variable needs to be ordered
cat_type = CategoricalDtype(categories=["Lower secondary", "Secondary / secondary special",
                                        "Incomplete higher", "Higher education"],
                            ordered=True)
K["education"] = K["name_education_type"].astype(cat_type)
# frequency table
freqtab = K.groupby("education").agg(count=("sk_id_curr", "count")) # absolute frequencies (counts)
freqtab["count_cum"] = freqtab["count"].cumsum() # cumulative frequencies
freqtab["count_rel"] = freqtab["count"] / sum(freqtab["count"]) # relative frequencies
freqtab["count_relcum"] = freqtab["count_rel"].cumsum() # cumulative relative frequencies
freqtab

The visualisation of frequencies is simple &ndash; we use barplot, either standard (bars beside) or stacked (useful for cumulative frequencies). Variable name is assigned either to *x* or to *y* parameter, bars are then either vertical or horizontal.

In [None]:
# graphs for absolute and relative frequencies
# done directly from DataFrame, no need to compute frequency table
g = sns.displot(data=K, y="name_type_suite") # absolute freqs
g = sns.displot(data=K, y="name_type_suite", stat='probability') # relative freqs - difference only at Y scale

In [None]:
# frequencies for ordinal variable
g = sns.displot(data=K, y="education", stat="probability") # relative frequencies directly from DataFrame

# for stacked barplot, we use frequency table computed above
freqtab["hlp"] = [""] * len(freqtab) # dummy variable, just for filling the seaborn parameter
# "education" is an alternative name for the index here
g = sns.displot(data=freqtab, x="hlp", hue="education", multiple="stack", weights="count_rel")

# for stacked absolute frequencies, use "count" instead of "count_rel"

If we want to annotate the graph, we may use *set* methods. For more fine-tuning (colors etc.) see seaborn tutorial.

In [None]:
g = sns.displot(data=freqtab, x="hlp", hue="education", multiple="stack", weights="count_rel") \
    .set_axis_labels("Education", "Relative frequency") \
    .set(title="Distribution of education")

Now we treat some **numeric variables**. We make bunch of graphs with different level of detail and smoothing. Many of them use *displot* method and the parameter *kind* changes type of graph (ecdf, density etc.) from the default type, which is histogram. Some graphs use *catplot* method because stripplot and swarmplot are under that method, not under displot.

If the variable is numeric but with few unique values, we can treat it as categorial &ndash; note using *discrete* parameter to adjust bar positions in histogram.

In [None]:
### Numerical discrete variable
# treated as categorial
g = sns.displot(data=K, x="cnt_fam_members") # not so pretty
g = sns.displot(data=K, x="cnt_fam_members", discrete=True) # better adjusted bars

Continuous numeric variable can be plotted many ways depending on required completeness of information.

In [None]:
# rug can be displayed via catplot and stripplot or swarmplot
g = sns.stripplot(data=K, x="ext_source_1", jitter=False, size=2)
# for no overlapping, use
g = sns.catplot(data=K, x="ext_source_1")
g = sns.catplot(data=K, x="ext_source_1", kind="swarm")

In [None]:
# ecdf with rug
g = sns.displot(data=K, x="ext_source_1", kind="ecdf", rug=True)

In [None]:
# histogram
g = sns.displot(data=K, x="ext_source_1", bins=5)
# for less smoothing, use bigger number of bins:
g = sns.displot(data=K, x="ext_source_1", bins=20)

In [None]:
# density with rug
g = sns.displot(data=K, x="ext_source_1", kind="kde", rug=True, fill=True, bw_adjust=1.5)
# for less smoothing, use bigger number of bins:
g = sns.displot(data=K, x="ext_source_1", kind="kde", rug=True, fill=True, bw_adjust=0.5)

For numeric variable, information of distribution can be usually "compressed" into few numbers (statistics).

In [None]:
# computing statistical characteristics of distribution
print("Min and max age: ", "%.1f" % K["age"].min(), "--", "%.1f" % K["age"].max())
print("Mean age: ", "%.1f" % K["age"].mean())
print("Median age: ", "%.1f" % K["age"].median())
print("Std. dev. of age: ", "%.1f" % K["age"].std())

print("Decils of age:\n")
hlp_10s = [i/10.0 for i in range(0, 11)]
print(K["age"].quantile(hlp_10s))

For a skewed distribution, quantiles are more useful than mean or standard deviation. They can be plotted as ECDF (quantiles can be calculated from Y axis) or boxplot.

In [None]:
# quantiles for skewed distribution - ecdf, boxplot
g = sns.displot(data=K, x="yrs_employed", kind="ecdf") \
    .refline(y=0.25)
g = sns.catplot(data=K, y="yrs_employed", kind="box")

### 1.2 Relationships of variables

Method for analysis and plotting are different depending on type (categorial or numeric) of both variables. 

* If one of variables is categorial, the basic strategy is to split the data into categories by this variable and to study distribution of the other variable for each category (and to compare distributions among various categories).
* If both variables are numeric, then we use bivariate plots and compute statistics like correlation.

Let's start with the case of both variables categorial. In this case we usually compute a contingency table (2-D frequency table).

In [None]:
# contingency table with absolute frequencies
pd.crosstab(K["name_family_status"], K["code_gender"])

In [None]:
# for relative frequencies in contingency table, use parameter normalize:
pd.crosstab(K["name_family_status"], K["code_gender"], normalize="columns") # relative by columns

Visualisation of contingency table, similarly to frequency table, can be done by some kind of barplot. Bars can be:

* put beside one by one
* stacked within each category as absolute counts
* stacked within each category as relative counts (all stacked bars sum up to 1)

In [None]:
# barplot with bars beside
g = sns.displot(data=K, x="code_gender", hue="name_family_status", multiple="dodge")\
    .refline(x=0.5) # auxiliary line to split categories

In [None]:
# barplot with stacked bars as absolute counts
g = sns.displot(data=K, x="code_gender", hue="name_family_status", multiple="stack")

In [None]:
# barplot stacked as relative counts (sums up to 1)
# needs data preparation
hlp_df = pd.crosstab(K["name_family_status"], K["code_gender"], normalize="columns")
print(hlp_df)
# for plotting stacked barplot, we need to transform this "wide" format to "long" format
hlp_df.reset_index(inplace=True)
hlp_df = pd.melt(hlp_df, id_vars="name_family_status", var_name="code_gender", value_name="prop")
print(hlp_df)

g = sns.displot(data=hlp_df, x="code_gender", hue="name_family_status", multiple="stack", weights="prop")

Another idea is to make *heatmap* &ndash; replace each cell in a contingency table by color tone according to the value in the cell. This is good for plotting absolute frequencies but may be confusing for relative ones.

In [None]:
# discrete heatmap
g = sns.displot(data=K, x="code_gender", y="name_family_status", cbar=True)

Having one categorial and one numeric variable, we can split the data by categorial variable and compute statistics by categories. There are many ways how to do splitting by categories when plotting:

* multiple lines (curves), possibly overlapping
* use one axis for categories (sections inside one graph), distribution graph in each section separately
* split figure to separate graphs

We can either use *displot* with parameters *hue* or *col* or *catplot* with category variable as *x* (or *y*, if we want split the graph horizontally).

In [None]:
# numeric vs. category as overlapping lines/curves
g = sns.displot(data=K, x="ext_source_1", hue="code_gender") # overlapping histograms
g = sns.displot(data=K, x="ext_source_1", hue="code_gender", kind="kde") # overlapping KDE


In [None]:
# numeric vs. category as separate graphs
g = sns.displot(data=K, x="ext_source_1", col="code_gender",
                stat="probability", common_norm=False) # separate histograms

In [None]:
# numeric vs. category as sections of one graph
g = sns.catplot(data=K, x="code_gender", y="ext_source_1") # stripplot
g = sns.catplot(data=K, x="code_gender", y="ext_source_1", kind="violin") # violinplot
g = sns.catplot(data=K.assign(temp=""), x="temp", y="ext_source_1", hue="code_gender", kind="violin", split=True)

We may want to compute statistics like mean, median or SD by categories and compare them. Computing is easy by pandas *groupby* and *agg* methods. For plotting by seaborn we can use *barplot*, which is a special functionality of *catplot* method.

In [None]:
# statistics by categories
K.groupby("code_gender").agg({"ext_source_1": ["mean", "median", "std"]})

In [None]:
# barplots with estimator by categories
g = sns.catplot(data=K, x="code_gender", y="ext_source_1", kind="bar")
g = sns.catplot(data=K, x="code_gender", y="yrs_employed", kind="box")

When both variables are numeric, we use *relplot* or *displot* method with two basic cases:

* for each x value there can be more observations &ndash; *scatterplot* (a cloud of points), heatmap, contourplot
* for each x value there is only one observation or we want to aggregate over y axis &ndash; *lineplot* (time series)

In [None]:
g = sns.relplot(data=K, x="age", y="ext_source_1") # scatterplot
g = sns.displot(data=K, x="age", y="ext_source_1", cbar=True) # heatmap
g = sns.displot(data=K, x="age", y="ext_source_1", kind="kde") # contourplot

Scatterplot or contourplot can be combined with graphs of individual distributions (histogram, density). It does method *jointplot*.

In [None]:
# jointplot - both scatterplot and individual distributions
g = sns.jointplot(data=K, x="age", y="ext_source_1")

In [None]:
# lineplot
# try yourself :-)

## 2. Tasks for you

We will use the same data as above (file *application_train.csv* from *kaggle_home_credit.zip*) but bigger volume of it.

1. Read file *application_train.csv* again and make from it a random sample of 5 000 records.
2. Transform data as above: *data_birth* -> *age*, *days_employed* -> *years_employed*.
3. Explore distribution of *age* by ECDF, density estimation, histogram, boxplot:
   + In histogram use bins of 5 years, try to make reasonable boundaries of them (e. g. 20-25 etc., see parameter *bins*).
   + In density estimation, limit the curve to the variable range (see parameter *cut* in *kdeplot*). Try various amount of smoothing.
   + For one graph (no matter which one) do a neat annotation (proper title, axis labels), try to change theme (*set_theme* method), font size (*font_scale* in *set* method), color (find yourself).
4. Is distribution of *age* more likely Gaussian-like, or skewed? Does 1-sigma and 2-sigma rule hold for it?
5. Explore distribution of *cnt_children*, consider it like a categorial ordered variable &ndash; make frequency table(s) and graphs.
6. Explore relationship of *flag_own_car*, *name_family_status*, *yrs_employed* and *ext_source_1* to answer following questions:
   - What is share of car owners in groups by family status? (Compute owner shares as decimal numbers and plot them as by categories.)
   - Plot *ext_source_1* and *yrs_employed* first together and then with distinction of car ownership as a category. (Hint: making some axis in log scale may help.)
   - What are distributions of *ext_source_1* in groups by family status (make a plot)? What statistics do describe well them distribution? Compute them for each group.
   - Do the same for *yrs_employed* instead of *ext_source_1*. Do we use same or different statistics to describe distribution of *yrs_employed*? Again, compute them.
7. Make a plot of *age* distribution for grouping by *code_gender* and *cnt_children* (together, i. e. nested grouping).