# Data understanding

We will analyze the *titanic* dataset:

* to realize what information we have (statistical units, variables)
* to check data quality and reliability of data
* to understand distributions of variables and their relationships
* to suggest steps for data cleaning
* to suggest useful data transformations

## 0. What is our goal?

Analysis of date comes out from the goal of the **business understanding**. So first we set that goal:

> We analyse Titanic data to find out how survival for each passenger can be predicted from his or her attributes.

Let's start with loading data and making a quick overview.

In [None]:
### Setup
%matplotlib inline
# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme
sns.set_theme()

# Reading and inspecting data
df = pd.read_csv("titanic_train.csv")
df

## 1. Basic overview of the data

1. Rows: How many? What are statistical units? How can a unit be identified?
2. Columns: How many? What are their names, types, meanings? At the first glance, do values seem plausible? Are all of them useful for our purpose?

Summary: do we need to carry out any initial transformations? (i. e. to make a sample of rows or columns; to convert column names to lowercase; to provide a column with ID; to remove some columns etc.)

In [None]:
print(df.shape) # count of rows and columns
# units are passengers, it seems they can be identified by passenger_id
# but are passenger_id unique?
print(df[["passenger_id"]].nunique()) # nunique method, DataFrame with one column only
print(len(np.unique(df["passenger_id"]))) # other way using NumPy unique function
# number of unique values is equal to number of rows => ok

In [None]:
# column names and types
print(df.dtypes)
# names are in lowercase, meaningful and short, no need to adjust

# column meaning
# at the first sight, meaning is clear and values seem plausible for all
# - except sibsp, parch, embarked, boat, body
# we have to get an explanation or extra info - can be found on the Internet
# sibsp: Number of siblings/spouses aboard
# parch: Number of parents/children aboard
# embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
# boat: Lifeboat
# body: Body Identification Number

# all columns seem to be useful for our goal

# column types
# all types seem proper except body and boat (we expect integer if it is id number)
# moreover the variable *embarked* could be ordered (the order of stops is S, C, Q)
# let's consider it later, at the moment of variable inspection

## 2. Checking the data quality

* Are there any duplicated rows (with exclusion of ID)?
* What are counts and shares of missing values in the dataset columns?
* Are counts of missing values expectable and acceptable?
* Are any columns or rows (almost) empty and may be removed as useless?
* In which columns should we consider fixing of values (correction, filling)?

In [None]:
# duplicated rows?
sum(df.duplicated(subset=['pclass', 'name', 'sex', 'age', 'sibsp', 'parch',
       'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest',
       'survived']))
# 0 duplicated rows

In [None]:
# counts of missing values
# absolute counts: len(df) - df.count()
# relative counts (shares):
print(1 - df.count()/len(df))

# occurences of missings: age (rather a lot => problem), fare and embarked (1 case each - no much problem),
# - cabin, boat and body (high but expectable => no much problem),
# - home.dest (around a half but maybe not important)

# the most important variable *survived* is available for all records
# => no need to remove rows
# some columns contain much missing values but still may be useful
# - we can use the fact of missingness as a variable/feature
# - (e. g. for *cabin* or *boat*)

# *age*, *fare* and *embarked* deserve fixing

In [None]:
# rows with many missing values
df.count(axis=1).value_counts()

# no row is complete but this is not surprising (*boat* and *body* are sparse)
# minimum of non-missing is 10 - looks like we have no "almost empty" rows

After all these check we can do a summary about data quality and make recommendations for preprocessing (cleaning, fixing) data. Some of them can be done immediately if it is necessary or may be useful for the analysis.

In [None]:
# the variable *embarked* could be ordered (the order of stops is S, C, Q)
embarked_type = CategoricalDtype(categories=["S", "C", "Q"], ordered=True)
df["embarked"] = df["embarked"].astype(embarked_type)

## 3. Checking variable distributions

It's a good idea to start with the most important variables: the target one (*survived*) and the ones we expect to provide great information for the target one while being complete (*sex*, *pclass*, *fare*, *embarked*). Then we go to variables which are more complicated or need a fixing (*age*).

For each of those six variables above, try to do following:

* Make descriptive statistics of the distribution and a proper graph.
* Consider if the distribution is expectable and seems plausible (no strange or obviously invalid values).
* If the variable has missing values, try to figure out reasons of it and to suggest a fixing, if necessary.

In [None]:
# Example: embarked
# frequency table
freqtab = df.groupby("embarked").agg(count=("passenger_id", "count")) # absolute frequencies (counts)
freqtab["count_cum"] = freqtab["count"].cumsum() # cumulative frequencies
freqtab["count_rel"] = freqtab["count"] / sum(freqtab["count"]) # relative frequencies
freqtab["count_relcum"] = freqtab["count_rel"].cumsum() # cumulative relative frequencies
print(freqtab)

# graph
g = sns.displot(data=df, y="embarked", stat="proportion") # relative frequencies directly from DataFrame

# for stacked barplot, we use frequency table computed above
g = sns.displot(data=freqtab.assign(hlp=""),
                x="hlp", hue="embarked", multiple="stack", weights="count_rel")

In [None]:
# one missing value - attempt to fix
# who has an empty *embarked*?
print(df[df["embarked"].isna()])
# did she share the ticket or the cabin with anyone else?
print(df[df["ticket"]=="113572"])
print(df[df["cabin"]=="B28"])
# unfortunately, no one else travelled on that ticket or in that cabin

# we can try to estimate embarkment place by fare, class and cabin first letter
df[(df["pclass"]==1) & (df["cabin"].str.slice(stop=1) == "B")]
# there is a big share of Cherbourg embarkment among females with similar fare and "B" cabins
# so we can fill "C" for the case with missing *embarked* value

## 4. Analysis of relationships

The last part of this practice section is to analyze relationship between variables. Check how is *survival* related to each of five remaining variables considered in the previous part (*sex*, *pclass*, *fare*, *embarked*, *age*).

In [None]:
# Example: survival by class
print(df.groupby("pclass").agg(surv_class=("survived", "mean")))
g = sns.catplot(data=df, x="pclass", y="survived", kind="bar", errorbar=None)