# Data understanding

We will analyze the *titanic* dataset:

* to realize what information we have (statistical units, variables)
* to check data quality and reliability of data
* to understand distributions of variables and their relationships
* to suggest steps for data cleaning
* to suggest useful data transformations

## 0. What is our goal?

Analysis of date comes out from the goal of the **business understanding**. So first we set that goal:

> We analyse Titanic data to find out how survival for each passenger can be predicted from his or her attributes.

Let's start with loading data and making a quick overview.

In [None]:
### Setup
%matplotlib inline
# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme
sns.set_theme()

# Reading and inspecting data
df = pd.read_csv("titanic_train.csv")
df

## 1. Basic overview of the data

1. Rows: How many? What are statistical units? How can a unit be identified?
2. Columns: How many? What are their names, types, meanings? At the first glance, do values seem plausible? Are all of them useful for our purpose?

Summary: do we need to carry out any initial transformations? (i. e. to make a sample of rows or columns; to convert column names to lowercase; to provide a column with ID; to remove some columns etc.)

## 2. Checking the data quality

* Are there any duplicated rows (with exclusion of ID)?
* What are counts and shares of missing values in the dataset columns?
* Are counts of missing values expectable and acceptable?
* Are any columns or rows (almost) empty and may be removed as useless?
* In which columns should we consider fixing of values (correction, filling)?

After all these check we can do a summary about data quality and make recommendations for preprocessing (cleaning, fixing) data. Some of them can be done immediately if it is necessary or may be useful for the analysis.

## 3. Checking variable distributions

It's a good idea to start with the most important variables: the target one (*survived*) and the ones we expect to provide great information for the target one while being complete (*sex*, *pclass*, *fare*, *embarked*). Then we go to variables which are more complicated or need a fixing (*age*).

For each of those six variables above, try to do following:

* Make descriptive statistics of the distribution and a proper graph.
* Consider if the distribution is expectable and seems plausible (no strange or obviously invalid values).
* If the variable has missing values, try to figure out reasons of it and to suggest a fixing, if necessary.

## 4. Analysis of relationships

The last part of this practice section is to analyze relationship between variables. Check how is *survival* related to each of five remaining variables considered in the previous part (*sex*, *pclass*, *fare*, *embarked*, *age*).