# Visualisation (Python &ndash; seaborn)

## 2. Tasks for you

We will use the same data as above (file *application_train.csv* from *kaggle_home_credit.zip*) but bigger volume of it.

1. Read file *application_train.csv* again and make from it a random sample of 5 000 records.

In [None]:
### Setup
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme, set bigger font
sns.set_theme()

# Reading and adjusting data
K = pd.read_csv("application_train.csv")
K = K.sample(n=5000, axis=0) # random sample of 5000 records
K.columns = K.columns.str.lower() # column names to lowercase

2. Transform data as above: *data_birth* -> *age*, *days_employed* -> *years_employed* (nonsense data replace by np.nan).

In [None]:
K["age"] = -K["days_birth"] / 365.25 
K["yrs_employed"] = -K["days_employed"] / 365.25
K["yrs_employed"] = np.where(K["yrs_employed"] < 0, np.nan, K["yrs_employed"]) # cleaning from nonsense values

3. Explore distribution of *age* by ECDF, density estimation, histogram, boxplot:
   - In histogram use bins of 5 years, try to make reasonable boundaries of them (e. g. 20-25 etc., see parameter *bins*).
   - In density estimation, limit the curve to the variable range (see parameter *cut* in *histplot*).
   - In density, try various amount of smoothing.
   - For one graph (no matter which one) do a neat annotation (proper title, axis labels), try to change theme (*set_theme* method), font size (*font_scale* in *set* method), color (find yourself).

In [None]:
g = sns.displot(data=K, x="age", kind="ecdf")
g = sns.displot(data=K, x="age", kind="kde", cut=0, bw_adjust=0.8)
g = sns.displot(data=K, x="age", bins=range(20, 75, 5), color="green") \
    .set_axis_labels("Age [years]", "Count") \
    .set(title="Distribution of applicants' age")
g = sns.catplot(data=K, y="age", kind="box")

4. Is distribution of *age* more likely Gaussian-like, or skewed? Does 1-sigma and 2-sigma rule hold for it?

In [None]:
# From the histogram the distribution looks like something between uniform and normal distribution,
# i. e. it is close to Gaussian.
# For 1-sigma and 2_sigma rule, let's compute the mean and SD ("sigma"):
age_mean = K["age"].mean()
age_std = K["age"].std()
print("Mean age: ", "%.1f" % age_mean)
print("SD age: ", "%.1f" % age_std)
print("Share of record within 1 sigma:", "%.3f" % np.mean(np.abs(K["age"] - age_mean) < age_std))
print("Share of record within 2 sigma:", "%.3f" % np.mean(np.abs(K["age"] - age_mean) < 2*age_std))
# Data approximately holds both rules

5. Explore distribution of *cnt_children*, consider it like a categorial ordered variable &ndash; make frequency table(s) and graphs.

In [None]:
# frequency table(s)
hlp_df = K.groupby("cnt_children").agg(cnt_abs=("sk_id_curr", "count"))
hlp_df["cnt_cum"] = hlp_df["cnt_abs"].cumsum()
hlp_df["cnt_rel"] = hlp_df["cnt_abs"] / sum(hlp_df["cnt_abs"])
hlp_df["cnt_rel_cum"] = hlp_df["cnt_rel"].cumsum()
print(hlp_df)

# graphs
g = sns.displot(data=K, x="cnt_children", discrete=True)
hlp_df["hlp"] = ""
g = sns.displot(data=hlp_df, x="hlp", hue="cnt_children", multiple="stack", weights="cnt_rel")

6. Explore relationship of *flag_own_car*, *name_family_status*, *yrs_employed* and *ext_source_1* to answer following questions:
   - What is share of car owners in groups by family status? (Compute owner shares as decimal numbers and plot them as means in categories &ndash; mean of 0/1 variable is, in fact, share of 1's.)
   - Plot *ext_source_1* and *yrs_employed* first together and then with distinction of car ownership as a category.
   - What is distribution *ext_source_1* in groups by family status (make a plot)? What statistics do describe well this distribution? Compute them for each group.
   - Do the same for *yrs_employed* instead of *ext_source_1*. Do we use same or different statistics to describe distribution of *yrs_employed*? Again, compute them.

7. Make a plot of *age* distribution for grouping by *code_gender* and *cnt_children* (together, i. e. nested grouping).

In [None]:
g = sns.catplot(data=K, x="cnt_children", y="age", hue="code_gender", kind="violin", split=True)
g = sns.catplot(data=K, x="cnt_children", y="age", hue="code_gender", kind="box")