# Exploratory Data Analysis with Pandas
Now that we have the fundamental knowledge about Python and in particular Pandas we can start to do some exploratory data analysis (EDA). EDA is the first step in the data science process and it is very important to understand the data that we are working with. EDA is used by data scientists to understand the data, to identify patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.


In this notebook, we will see how to load data, and do an initial exploration of the data over the titanic dataset.  In 1912, during its maiden voyage, the widely considered "unsinkable" RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
With this analysis, we will be able to answer some questions like:
- What is the distribution of the passengers by age?
- Who were the passengers in the Titanic?
- What deck were the passengers on and how does that relate to their class?
- Where did the passengers come from?
- Who was alone and who was with family?
- What factors helped someone survive the sinking?
- Did the deck have an effect on the passengers survival rate?
- Did having a family member increase the odds of surviving the crash?
- etc.

references:
- https://www.kaggle.com/learn/pandas
- Navlani, A.,  Fandango, A.,  Idris, I. (2021). Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python. Packt. 3rd Edition
- Brandt. S. (2014). Data Analysis: Statistical and Computational Methods for Scientists and Engineers. Springer. 4th Edition


- https://eugenelohh.medium.com/data-analysis-on-the-titanic-dataset-using-python-7593633135f2
- https://medium.datadriveninvestor.com/hypothesis-testing-intuitively-explained-using-the-titanic-dataset-in-python-5afa1e580ba6

## Load the data
Let's start with the Titanic data set. This is a very famous data set that is used to demonstrate data analysis and machine learning. It is a very small data set, but it is a good place to start. The data set is available, e.g., on Kaggle (https://www.kaggle.com/datasets/vinicius150987/titanic3)

In [None]:
# load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas_bokeh
pandas_bokeh.output_notebook()

In [None]:
df = pd.read_excel('./data/titanic/Titanic.xls')
df.head()

Our data set has 1309 rows and 14 columns. Let's see what the columns and their data types are.

In [None]:
df.info()

The columns are:
- pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
- survived: Survival (0 = No; 1 = Yes)
- name: Name
- sex: Sex  (male = Male; female = Female)
- age: Age  (in years)
- sibsp: Number of siblings/spouses aboard. The dataset defines family relations in this way:
    - Sibling = brother, sister, stepbrother, stepsister
    - Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: Number of Parents/Children Aboard. The dataset defines family relations in this way:
    - Parent = mother, father
    - Child = daughter, son, stepdaughter, stepson
    Some children travelled only with a nanny, therefore parch=0 for them.
- ticket: Ticket Number
- fare: Passenger Fare
- cabin: Cabin Number
- embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat: Lifeboat Number
- body: Body Number (if did not survive and body was recovered)
- home.dest: Home/Destination

We can see that there are missing values in the age and body columns. This is also visible running the describe method. The include='all' argument allows to see the summary of the non-numerical columns (to include top - most frequent value; freq - frequency of the most frequent value; unique - number of unique values; and the other seams to be self-explanatory.

In [None]:
df.describe(include='all')

Let us add a new column to the data frame with the family size. This is the sum of the number of siblings/spouses and the number of parents/children, plus 1 for the passenger itself.

In [None]:
df['family_size'] = df['sibsp'] + df['parch'] + 1

## Plotting the data (a first look)
We can plot the **histogram** of the numerical columns to see how the data is distributed. (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html)

In [None]:
# plot histograms of the numerical columns
_ = df.hist(figsize=(15, 10), bins=10)

<span style="color:red"> - what conclusion can you draw from the histogram of the age column? </span>
<span style="color:red"> - what conclusion can you draw from the histogram of the fare column? </span>.

It is also useful to plot the bar charts of the categorical columns. We can do this using the value_counts() method from Pandas to get the count of unique values in each column and then plot the bar charts. (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html)

In [None]:
# get the categorical columns
cat_cols = df.select_dtypes(include=['object']).columns

# plot the bar charts of the categorical columns
for col in cat_cols:
    df_temp = df[col].value_counts()
    # plot only the dataframes with less than 30 unique values - this will remove, e.g., the name, ticket and cabin column
    if df_temp.size < 30:
        df_temp.plot(kind='bar', title=col, figsize=(30, 5))
        plt.show()

<span style="color:red"> - what conclusion can you draw from the bar charts? </span>

## Pivot tables and group by methods

Pivot tables and group by methods are a very useful tool to summarize data.
### Pivot table
The pivot table method allows to summarize data by grouping the data by one or more columns and applying an aggregation function to the other columns. The pivot_table method is very similar to the groupby method. Its arguments include:
- index: column(s) to group by on the rows
- columns: column(s) to group by on the columns
- values: column(s) to apply the aggregation function on
- aggfunc: aggregation function to apply. Default is mean if values is a numerical column and count if values is a categorical column. Other possible values are sum, min, max, std, var, median, first, last, nunique, and size.
- fill_value: value to replace missing values, default is 0. If you want to keep the missing values, you can use np.nan.
- margins: add row/column subtotals and grand total (default is False)
- margins_name: name of the row/column subtotals and grand total (default is 'All')
- dropna: do not include columns whose entries are all NaN (default is True)

For example, if you want to summarize the number of survived passengers into a pivot table with sex and pclass as the index and columns, respectvely, you can do the call below. In this case, the aggfunc is sum as we want to count the number of survived passengers and survived is a boolean column, where 1 survived and 0 did not survive.

In [None]:
pt = df.pivot_table(index='sex', values='survived', columns='pclass', aggfunc='sum', margins=True, margins_name='Total')
pt

<span style="color:red"> What was the total number of passengers that survived? </span>

<span style="color:red"> How many passerenger from class 1 survived? </span>

<span style="color:red"> How many female passengers survived? </span>

<span style="color:red"> Who did survive more: females or males? </span>

To present the pivot table with percentages we can divide it by the total number of passengers that survived.

In [None]:
number_of_survived = sum(df['survived'])
pt / number_of_survived

<span style="color:red"> In a similar manner, how can you summarize the passengers that did not survive? </span>

The pivot table can be plotted using a bar plot showing the number of survived passerger by gender a passenger class. The drop method is used to drop the Total row and column from the pivot table.

Besides, we are also using __pandas Bokeh__ to plot the pivot table. This is a wrapper around the bokeh library that allows to plot pandas dataframes and series. The plot method has approximatly the same arguments as the pivot_table method. The plot method returns a bokeh figure object that can be used to customize the plot. For example, we can change the title of the plot and the labels of the axes.

In [None]:
_ = pt.drop('Total', axis=1).drop('Total', axis=0).plot_bokeh(kind='bar', title='Survived passengers per gender and class')

The stacked bar plot is useful to more easily see the proportion of survived passengers within each gender and passenger class.

In [None]:
# pt.drop('Total', axis=1).drop('Total', axis=0).plot(kind='bar', stacked=True)
_ = pt.drop('Total', axis=1).drop('Total', axis=0).plot_bokeh(kind='bar', stacked=True, title='Survived passengers per gender and class')


The transpose of the pivot table allows for a different visualization (note the ".T"). Now, each bar is associated to the class.

In [None]:
pt.T

And the corresponding bar plot is:

In [None]:
# pt.drop('Total', axis=1).drop('Total', axis=0).T.plot(kind='bar', stacked=True)
_ = pt.drop('Total', axis=1).drop('Total', axis=0).T.plot_bokeh(kind='bar', stacked=True)

Another interesting graph are pie plots.

In [None]:
#pt.drop('Total', axis=1).drop('Total', axis=0).plot(kind='pie', subplots=True, figsize=(15, 10), autopct='%1.1f%%')
_ = pt.drop('Total', axis=1).drop('Total', axis=0).plot_bokeh(kind='pie')

In [None]:
_ = pt.T.drop('Total', axis=1).drop('Total', axis=0).plot_bokeh(kind='pie')

<span style="color:red"> In which class did a greater proportion of males survive compared to females? </span>

We can make a pivot table with multiple values. For example, we can make a pivot table indexing the values of survived and sex, and columns by pclass and values by ticket (we will be just couting the ticket and, remember, the tickets' column has no missing values).

In [None]:
pt_multi = df.pivot_table(index=['survived','sex'], values='ticket', columns='pclass', aggfunc='count', margins=True, margins_name='Total')
pt_multi

<span style="color:red"> how many passenger were in class 1?</span>
<span style="color:red"> how many females passengers did not survive? And how many males?</span>
<span style="color:red"> how many females passengers from class 1 did not survive? And how many males?</span>

And the pie plot is given by

In [None]:
_ = pt_multi.drop('Total', axis=1).drop('Total', axis=0).plot_bokeh(kind='pie', stacked=True)

<span style="color:red"> What is represented by the largest orange "arch"?</span>

### Group by method

The group by method is another way to summarize data. It is more flexible than the pivot table method, but it is more difficult to use. The groupby method is used to split the data into groups based on one or more columns and then apply an aggregation function to each group. The groupby method returns a SeriesGroupBy object. This object can be used to apply an aggregation function to each group. The aggregation function can be applied to all the columns or to a specific column. Some arguments arguments are:
- by: column(s) to group by
- axis: axis to group by (default is 0)
- as_index: group by as index (default is True)
- sort: sort group keys (default is True)
- observed: only show observed values for categorical groupers (default is False)

For example, grouping by passengers class and gender and summing the survived column gives a similar result as the pivot table above.

In [None]:
grp = df.groupby(['sex', 'pclass'])['survived'].sum()
grp

To get the same result as the pivot table, we can unstack the SeriesGroupBy object.

In [None]:
grp = grp.unstack()
grp

The second example shown above can also be done using groupby and unstack.

In [None]:
df.groupby(['survived', 'sex', 'pclass'])['ticket'].count().unstack()

### Exercises
<span style="color:red"> Find the minimum/maximum fare paid by each passenger class and gender.</span>

<span style="color:red"> Find the minimum/maximum fare paid by each adult (age>18) passenger by class and gender.</span>

<span style="color:red"> Add a row to the index of the previous pivot table dividing the results in adult and juvenile.</span>

In [None]:
df.pivot_table(index=['pclass', 'sex'], values='fare', aggfunc=['min', 'max'])

In [None]:
df[df['age'] >= 18].pivot_table(index=['pclass', 'sex'], values='fare', aggfunc=['min', 'max'])

In [None]:
df_with_age_group = df.copy()
df_with_age_group['age_group'] = df['age'].apply(lambda x: 'adult' if x >= 18 else 'juvenile')
df_with_age_group.pivot_table(index=['age_group','pclass', 'sex'], values='fare', aggfunc=['min', 'max'])

## Statistical analysis

Exploratory data analysis (EDA) is not only about visualizing the data, but also about understanding the data. This is done by performing statistical analysis on it.

### Types of variables

Data types are fundamental concepts in statistical  analysis, being divided into the following main categories:
- **Nominal attributes** refer to variables that are categorized by names or labels. These variables have categorical, qualitative, and unordered values, such as brand names, product names, zip codes, gender, or marital status. The value of a nominal attribute can be represented by the symbol or name of an item. It is not meaningful to calculate the mean or median values for nominal attributes, but data analysts can calculate the mode, which is the value that appears most frequently.

- **Ordinal attributes** are variables that have names or labels with a meaningful order or ranking, but their exact magnitude is unknown. These attributes measure subjective qualities, which make them ideal for surveys that collect information on customer satisfaction, product ratings, and movie reviews. For example, customer satisfaction ratings may range from very dissatisfied to very satisfied, or the size of a drink may be classified as small, medium, or large. The median and mode are the only measures of central tendency that should  be used for ordinal attributes, as the mean cannot be calculated due to their qualitative nature.

- **Numeric attributes** are variables that are quantitatively represented as either integer or real values. For example, the number of children in a family is a numeric attribute. The mean, median, and mode are all appropriate measures of central tendency for numeric attributes.

#### Discrete and continuous variables

Variables can be divided into two main categories:
- **discrete variables** are variables that can take on only a finite number of values. For example, the number of children in a family is a discrete variable, as it can only take on the values 0, 1, 2, 3, etc.
- **continuous variables** are variables that can take on an infinite number of values. For example, the height of a person is a continuous variable, as it can take on any value between 0 and ? meters.


### Measures of central tendency
#### Mean
The mean is the most common measure of central tendency. It is the sum of all values divided by the number of values. The mean is a good measure of central tendency for continuous variables. However, it is not a good measure of central tendency for discrete variables, as it is sensitive to outliers. The mean is calculated using the following formula:
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$$
where $x_i$ is the $i$-th value of the variable and $n$ is the number of values.

For examples the mean of the age of the passengers is given by

In [None]:
df['age'].mean()


#### Median
The median is the middle value of a sorted list of values. If the number of values is even, the median is the average of the two middle values. The median is a good measure of central tendency for both discrete and continuous variables. The median is calculated using the following formula:
$$\text{median} = \begin{cases} \frac{x_{\frac{n}{2}} + x_{\frac{n+1}{2}}}{2} & \text{if } n \text{ is even} \\ x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \end{cases}$$
where $x_i$ is the $i$-th value of the variable and $n$ is the number of values.

For example, the median of the age of the passengers is given by

In [None]:
df['age'].median()

Along with the median it is usual to define the percentile of a variable. The $p$-th percentile of a variable is the value $x_p$ such that $p$% of the values are less than or equal to $x_p$. For example, the 25th percentile of the age of the passengers is given by the value such that 25% of the values are less than or equal to it.

To get the 25th percentile of the age of the passengers we can use `quantile` method with the parameter q=0.25.

In [None]:
print("25% of the passengers are younger than", df['age'].quantile(q=0.25), "years old.")

#### Mode
The mode is the value that appears most frequently in a list of values. The mode is a good measure of central tendency for nominal variables.

For example, the mode of the embarked column is given by

In [None]:
df['embarked'].mode()

The `value_counts` method can be used to count the number of times each value appears in a column and check if the mode is correct.

In [None]:
df['embarked'].value_counts()

### Measures of dispersion

#### Range
The range is the difference between the maximum and minimum values of a variable. The range is a good measure of dispersion for discrete variables. However, it is not a good measure of dispersion for continuous variables, as it is sensitive to outliers. The range is calculated using the following formula:
$$\text{range}(x) = x_{\text{max}} - x_{\text{min}}$$
where $x_{\text{max}}$ is the maximum value of the variable and $x_{\text{min}}$ is the minimum value of the variable.

For example, the range of the age of the passengers is given by

In [None]:
df['age'].max() - df['age'].min()

#### Variance and standard deviation
The variance is the average of the squared differences from the mean. The variance is a good measure of dispersion for continuous variables. The variance is calculated using the following formula:
$$\text{var}(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$
where $x_i$ is the $i$-th value of the variable, $\bar{x}$ is the mean of the variable, and $n$ is the number of values.

For example, the variance of the age of the passengers is given by

In [None]:
df['age'].var()

On the other hand, the standard deviation is the square root of the variance. The standard deviation is also a good measure of dispersion for continuous variables. The standard deviation is calculated using the following formula:
$$\text{std}(x) = \sqrt{\text{var}(x)}.$$

The standard deviation is on the most common measure of dispersion being measured in the same units as the variable.

For example, the standard deviation of the age of the passengers is given by

In [None]:
df['age'].std()

#### Interquartile range (IQR)
The interquartile range is the difference between the 75th and 25th percentiles of a variable. The interquartile range is a good measure of dispersion for continuous variables. The interquartile range is calculated using the following formula:
$$\text{IQR}(x) = x_{75} - x_{25}$$
where $x_{75}$ is the 75th percentile of the variable and $x_{25}$ is the 25th percentile of the variable.

For example, the interquartile range of the age of the passengers is given by

In [None]:
df['age'].quantile(0.75) - df['age'].quantile(0.25)

### Measures of skewness and kurtosis

#### Skewness

Skewness measures the symmetry of a distribution. A distribution is symmetric if it looks the same to the left and right of the center point. A distribution is skewed if it is longer in one tail than the other. The skewness of a distribution is:
  - **positive if the tail on the right side of the distribution is longer** (that is, outliers are skewed to the right and data stacked up on the left) and
  - **negative if the tail on the left side of the distribution is longer**.
  - The skewness of a distribution is **zero** if the tails on both sides of the distribution are the same length.

Further positive skewness occurs when the mean is greater than the median and the mode. Negative skewness occurs when the mean is less than the median and mode.

Let us calculate the skewness of the numeric attributes of the titanic dataset.

In [None]:
numeric_attributes = df.select_dtypes(include=['int64', 'float64'])

df[numeric_attributes.columns].skew()

Without looking at the histogram plot:
<span style="color:red"> what conclusions can you take about the distibution of the pclass?</span>
<span style="color:red"> what conclusions can you take about the distibution of the fare?</span>
<span style="color:red"> what conclusions can you take about the distibution of the age?</span>

<span  style="color:red"> Replot the histogram of the pclass, fare and age attributes and comment on the skewness of the distributions.</span>

In [None]:
# TODO


#### Kurtosis
kurtoisis measures the tail heaviness of a distribution, i.e., whether the tails are heavy or light relative to a normal distribution. The kurtosis of a distribution is positive if the tails are heavier than a normal distribution and negative if the tails are lighter than a normal distribution.

Let us calculate the kurtosis of the numeric attributes of the titanic dataset.


In [None]:
df[numeric_attributes.columns].kurt()

In [None]:
def plot_histogram_and_normal_dist(df, column, bins=10):

    mean_age = df[column].mean()
    std_age = df[column].std()

    normal_dist = np.random.normal(mean_age, std_age, 10000)

    df[column].plot(kind='hist', figsize=(15,5), bins=bins, density=True, alpha=0.5, color='red', title=f'Histogram of a normal distribution and "{column}" which has has a swewness of {df[column].skew():.2f} and a kurtosis of {df[column].kurt():.2f}')
    plt.hist(normal_dist, bins=2*bins, density=True, alpha=0.5, color='blue')

In [None]:
plot_histogram_and_normal_dist(df, 'age', bins=20)

In [None]:
plot_histogram_and_normal_dist(df, 'fare', bins=20)

### Unsderstanding relationships between variables
Measuting the relationship between two variables is important in order to understand the data. There are several ways to measure the relationship between two variables. The covariance and the correlation coefficient are two of the most common.

#### Covariance

The covariance is a measure of the joint variability of two random variables. It shows the degree to which two variables change together. i.e., if the two variables tend to increase together or decrease together and by how much.

The covariance is calculated using the following formula:
$$\text{cov}(x,y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$
where $x_i$ is the $i$-th value of the variable $x$, $\bar{x}$ is the mean of the variable $x$, $y_i$ is the $i$-th value of the variable $y$, $\bar{y}$ is the mean of the variable $y$, and $n$ is the number of values.

The covariance varies between -$\infty$ and $\infty$. The covariance is positive if the two variables tend to increase together, and negative if one variable tends to increase as the other decreases. The covariance is zero if the two variables are independent.

The problem with the covariance is that it is difficult to interpret and it is not normalized. The covariance of two variables is not comparable to the covariance of two other variables. For example, the covariance of the fare and the age of the passengers is 143.3, while the covariance of the fare and the sibps of the passengers is 8.64. The covariance of the age and the fare of the passengers is much higher than the covariance of the fare and the sibps of the passengers. However, the age and the fare of the passengers are not more related than the age and the pclass of the passengers as we can see next from the correlation matrix.


In [None]:
df.cov()


#### Correlation matrix
Correlation matrix allow to see the correlation between the numerical columns. The correlation coefficient ranges from -1 to 1. A value of 1 means that there is a perfect positive correlation between the two columns, a value of -1 means that there is a perfect negative correlation between the two columns, and a value of 0 means that there is no correlation between the two columns. The correlation matrix is a symmetric matrix, so we only need to plot the upper triangle of the matrix.

The correlation between two variables is calculated using the following formula:
$$\text{corr}(x,y) = \frac{\text{cov}(x,y)}{\sigma_x \sigma_y}$$
where $\sigma_x$ is the standard deviation of the variable $x$ and $\sigma_y$ is the standard deviation of the variable $y$. The correlation coefficient is normalized, so it is comparable between different variables.

In [None]:
df.drop('body',axis=1).corr()

Ploting the correlation matrix allow to see the correlation between the numerical columns in a more visual way. The seaborn library provides a heatmap function that allows to plot the correlation matrix.

In [None]:
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

 This can also be done using the style.background_gradient() method from Pandas.

<span style="color:red"> - what conclusion can you draw from the correlation matrix? E.g., was age vs survided expectable or...?</span>

<span style="color:red"> - Would you be expecting a "high" correlation between age and body?</span>

<span style="color:red"> -Why is correlation between survived and body `nan`?</span>

In [None]:
numeric_attributes = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_attributes].corr().style.background_gradient(cmap='coolwarm')

Another to have a good idea of the correlation between the numerical columns is to make a scatter plot matrix. Although in the case, due to the discrete nature of the data, the scatter plot matrix is not very useful/easy to interpret. Further, the scatter plot does not show the density of the data (e.g., you can not conclude from the scatter plot that the majority of the passengers were in the 3rd class, although the histogram of the pclass column shows that this is the case).

<span style="color:red"> - can you discerne the "higher" correlations?   </span>

In [None]:
_ = pd.plotting.scatter_matrix(df, figsize=(15, 10))

#### Spearman's rank correlation

The Spearman's rank correlation is a nonparametric measure of the monotonicity of the relationship between two variables.  The Spearman's rank correlation is calculated using the following formula:
$$\text{corr}(x,y) = \frac{\text{cov}(\text{rank}(x),\text{rank}(y))}{\sigma_{\text{rank}(x)} \sigma_{\text{rank}(y)}}$$
where $\text{rank}(x)$ is the rank of the variable $x$ and $\sigma_{\text{rank}(x)}$ is the standard deviation of the rank of the variable $x$.

For the Spearman's rank correlation, the variables do not need to be normally distributed. The Spearman's rank correlation is a monotonic measure, so it is not affected by the outliers. The Spearman's rank correlation is also not affected by the monotonic transformation of the variables.

For example

In [1]:
df.corr(method='spearman').style.background_gradient(cmap='coolwarm')

NameError: name 'df' is not defined

# Exercises

[05_exercise_adult_part_1.ipynb](05_exercise_adult_part_1.ipynb)