# 01. Exploratory data analysis

http://patbaa.web.elte.hu/physdm/data/titanic.csv

On the link above you will find a dataset about the Titanic passengers. Your task is to explore the dataset.

Help for the columns:
 - `SibSp` - number of sibling/spouses on the ship
 - `Parch` - number of parent/children on the ship
 - `Cabin` - the cabin they slept in (if they had a cabin)
 - `Embarked` - harbour of entering the ship
 - `Pclass` - passenger class (like on trains)
 
#### Source and more information on the dataset
https://public.opendatasoft.com/explore/dataset/titanic-passengers/

In [None]:
import os
import datetime
import numpy as np
import pandas as pd
from tabulate import tabulate

import seaborn as sns
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from mpl_toolkits.axes_grid1 import make_axes_locatable

from IPython.display import display

In [None]:
data = './data/'
out = './out/'

# Bold print for Jupyter Notebook
b1 = '\033[1m'
b0 = '\033[0m'

### Just some matplotlib and seaborn parameter tuning

In [None]:
axistitlesize = 20
axisticksize = 17
axislabelsize = 26
axislegendsize = 23
axistextsize = 20
axiscbarfontsize = 15

# Set axtick dimensions
major_size = 6
major_width = 1.2
minor_size = 3
minor_width = 1
mpl.rcParams['xtick.major.size'] = major_size
mpl.rcParams['xtick.major.width'] = major_width
mpl.rcParams['xtick.minor.size'] = minor_size
mpl.rcParams['xtick.minor.width'] = minor_width
mpl.rcParams['ytick.major.size'] = major_size
mpl.rcParams['ytick.major.width'] = major_width
mpl.rcParams['ytick.minor.size'] = minor_size
mpl.rcParams['ytick.minor.width'] = minor_width

mpl.rcParams.update({'figure.autolayout': False})

# Seaborn style settings
sns.set_style({'axes.axisbelow': True,
               'axes.edgecolor': '.8',
               'axes.facecolor': 'white',
               'axes.grid': True,
               'axes.labelcolor': '.15',
               'axes.spines.bottom': True,
               'axes.spines.left': True,
               'axes.spines.right': True,
               'axes.spines.top': True,
               'figure.facecolor': 'white',
               'font.family': ['sans-serif'],
               'font.sans-serif': ['Arial',
                'DejaVu Sans',
                'Liberation Sans',
                'Bitstream Vera Sans',
                'sans-serif'],
               'grid.color': '.8',
               'grid.linestyle': '--',
               'image.cmap': 'rocket',
               'lines.solid_capstyle': 'round',
               'patch.edgecolor': 'w',
               'patch.force_edgecolor': True,
               'text.color': '.15',
               'xtick.bottom': True,
               'xtick.color': '.15',
               'xtick.direction': 'in',
               'xtick.top': True,
               'ytick.color': '.15',
               'ytick.direction': 'in',
               'ytick.left': True,
               'ytick.right': True})

# Colorpalettes, colormaps, etc.
sns.set_palette(palette='rocket')

# To use with plots with `Pclass` on X-axis
class_palette = {
    1 : '#FFCC00',
    2 : 'gainsboro',
    3 : 'peru'
}
# To use with plots with `Survived` on X-axis
survival_palette = {
    0 : 'indianred',
    1 : 'cornflowerblue'
}
# To use with plots with `Embarked` on X-axis
embarked_palette = {
    'C' : 'tab:orange',
    'Q' : 'dodgerblue',
    'S' : 'forestgreen'
}
# The same, but for the map
embarked_map = {
    'C' : 'orange',
    'Q' : 'darkblue',
    'S' : 'darkgreen'
}
# To use when visualizing NaN counts
nan_color = 'firebrick'

# Set alpha on barplots
alpha = 0.9

#### Lookup Tables for Seaborn boxplots

In [None]:
# To enumerate classes on the ship
classes_lut = {
    '1' : '1st class',
    '2' : '2nd class',
    '3' : '3rd class' 
}

# Historically well-known locations
embark_lut = {
    'C' : 'Cherbourg',
    'Q' : 'Queenstown,\nIreland',
    'S' : 'Southampton'
}

## 1. Preprocessing

Load the above-linked csv file as a pandas dataframe. Check & plot if any of the columns has missing values. If they have, investigate if the missingness is random or not. 

Impute the missing values in a sensible way:
 - if only a very small percentage is missing, imputing with the column-wise mean makes sense, or also removing the missing rows makes sense
 - if in a row almost all the entries is missing, it worth to remove that given row
 - if a larger portion is missing from a column, usually it worth to encode that with a value that does not appear in the dataset (eg: -1). 
 
The imputing method affects different machine learning models different way, but now we are interested only in EDA, so try to keep as much information as possible!

### 1./a. Load in and describe dataset

In [None]:
os.listdir(data)

In [None]:
df = pd.read_csv(data + 'titanic.csv', sep=',')

In [None]:
display(df.head())
display(df.tail())

#### Some notes on this dataset

The exact number of passengers and crew members who sailed on the maiden voyage of the RMS Titanic is still a debated topic and slightly varies from source to source. Since a lot of bodies were lost to the sea forever and there are numerous cases of ought-to-be passengers with already redeemed tickets who simply couldn't embark onto the ship, we can only estimate the real number of passengers. According to the database maintained by the *Encyclodepia Titanica*[1], there were $324$ first, $284$ second, and $709$ third-class passengers embarked on the ship. Taking everything into account the passengers numbered $1\,317$ people and were accompanied by approximately $890$ crew members, leaving a total number of $2\,207$ people on board. There are a total of $712$ people who had survived the disaster, while approximately $1\,503$ died during the events. The dataset used in this notebook encompasses data of $891$ passengers out the total of $1\,317$.

---
#### Sources
[1] : ["Titanic Passengers and Crew Listings"](https://www.encyclopedia-titanica.org/manifest.php?q=1). *Encyclopedia Titanica*. Retrieved 12 September 2020.

### 1./b. Explore some columns in the dataset

#### Age distribution

In [None]:
# Collect non-NaN age values
age_col = df[~df['Age'].isna()]['Age']

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*15, nrows*9))

axes.hist(age_col, bins=(int(np.max(age_col)) - int(np.min(age_col))), density=True,
          color='tab:blue', alpha=alpha,
          ec='black', lw=0.5, ls='--')

axes.set_title('Fig. 1. Distribution of the age of passengers on Titanic',
               fontsize=axistitlesize, y=-0.20)
axes.set_xlabel('Age of passengers [ $years$ ]', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('PMF', fontsize=axislabelsize, fontweight='bold')
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=8)

plt.show()

#### Place of embarkation

In [None]:
embark_col = df['Embarked']
embark_num = np.unique(embark_col[~embark_col.isna()], return_counts=True)
embark_types = [embark_lut[str(c)] for c in embark_num[0]]

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*6))

axes.bar(embark_types, embark_num[1], width=0.9,
         color=list(embarked_palette.values()), alpha=alpha,
         ec='black', lw=0.5, align='center')
# Write height value of bars over them for clarity
for i, v in enumerate(embark_num[1]):
    axes.text(x=i, y=(v+axistextsize), s=str(int(v)),
              ha='center', va='center',
              color='black',
              fontsize=axistextsize, fontweight='bold')
# Just for the over-bar number to look better
axes.set_ylim(None, 700)

axes.set_title('Fig. 2. Number of people by place of their embarkation',
               fontsize=axistitlesize, y=-0.28)
axes.set_xlabel('Place of embarkation', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Count of passengers', fontsize=axislabelsize, fontweight='bold')
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=8)
    
plt.show()

#### Show them on a map

In [None]:
import folium
from  geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='titanic')

In [None]:
lat = []
long = []
for c in list(embark_lut.values()):
    loc = geolocator.geocode(c)
    lat.append(loc.latitude)
    long.append(loc.longitude)

In [None]:
# Creating a Folium map object
# --------
# Center the map placed to the mean coordinates of the places of embarkation 
lat_c = np.mean(lat)
long_c = np.mean(long)
# Here I am using CartoDB's Positron map style for the folium map
T_map = folium.Map(location=[lat_c, long_c], width='100%', height='100%',
                   zoom_start=7, tiles='CartoDB Positron',
                   control_scale=True, scrollWheelZoom=False,
                  )

# Mark cities on map
for i in range(len(embark_lut)):
    marker = folium.Marker(location=[lat[i], long[i]],
                           icon=folium.Icon(icon='map-marker-alt', prefix='fa', color=list(embarked_map.values())[i]),
                           tooltip=list(embark_lut.values())[i],
                          )
    marker.add_to(T_map)

display(T_map)

#### Number of siblings

In [None]:
sibling_col = df['SibSp']
sibling_count = np.unique(sibling_col, return_counts=True)

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*6))

axes.bar(sibling_count[0], sibling_count[1], width=0.9,
         color='tab:blue', alpha=alpha,
         ec='black', lw=0.5, align='center')
# Write height value of bars over them for clarity
for i, (sct, n) in enumerate(zip(sibling_count[0], sibling_count[1])):
    axes.text(x=sct, y=(n+axistextsize), s=str(int(n)),
              ha='center', va='center',
              color='black',
              fontsize=axistextsize, fontweight='bold')
# Just for the over-bar number to look better
axes.set_ylim(None, 700)

# Format X ticks
axes.set_xticks(range(np.max(sibling_count[0]) + 1))
axes.set_xticklabels(range(np.max(sibling_count[0]) + 1))

axes.set_title('Fig. 3. Number of passengers with given number of siblings.\nThe figure shows, that there aren\'t any passengers with\n$6$ or $7$ siblings.',
               fontsize=axistitlesize, y=-0.45)
axes.set_xlabel('Number of siblings', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Count of passengers', fontweight='bold', fontsize=axislabelsize)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=8)
    
plt.show()

In [None]:
parch_col = df['Parch']
parch_count = np.unique(parch_col, return_counts=True)

In [None]:
parch_count

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*6))

axes.bar(parch_count[0], parch_count[1], width=0.9,
         color='tab:blue', alpha=alpha,
         ec='black', lw=0.5, align='center')
# Write height value of bars over them for clarity
for i, (pct, n) in enumerate(zip(parch_count[0], parch_count[1])):
    axes.text(x=pct, y=(n+axistextsize), s=str(int(n)),
              ha='center', va='center',
              color='black',
              fontsize=axistextsize, fontweight='bold')
# Just for the over-bar number to look better
axes.set_ylim(None, 750)

# Format X ticks
axes.set_xticks(range(np.max(parch_count[0]) + 1))
axes.set_xticklabels(range(np.max(parch_count[0]) + 1))

axes.set_title('Fig. 4. Number of passengers with given number of parents/children',
               fontsize=axistitlesize, y=-0.28)
axes.set_xlabel('Number of parents/children', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Count of passengers', fontweight='bold', fontsize=axislabelsize)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=8)
    
plt.show()

#### Distribution of fare prices

Fare prices on the Titanic were not standard and were dependent of numerous factors[2], like place of embarkation, size and location of the rented cabin, furnishing, provision, etc. This is the cause of the diverse spectrum of fares seen on the distributions below.

---
#### Sources
[2] : [Encyclopedia Titanica Forums](https://www.encyclopedia-titanica.org/discus/messages/5660/90776.html?1238981549) *Encyclopedia Titanica* Retrieved 12 September 2020.

In [None]:
fare_col = df['Fare']

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*15, nrows*9))

axes.hist(fare_col, bins=50, density=True,
          color='tab:blue', alpha=alpha,
          ec='black', lw=0.5, ls='--')
axes.set_yscale('log')

axes.set_title('Fig. 5. Distribution of fare prices',
               fontsize=axistitlesize, y=-0.20)
axes.set_xlabel('Fare of passengers [ $GBP$ ]', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('PMF', fontsize=axislabelsize, fontweight='bold')
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=8)

plt.show()

#### Fare prices on different classes

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*9))

sns.boxplot(x='Pclass', y='Fare', data=df,
            palette=class_palette,
            showfliers=False,
            ax=axes)

# Format X ticks
# First get the auto-generated labels, then simply translate them using the LUT
axes.set_xticklabels(list(map(
                                classes_lut.get, # This is the LUT
                                [i.get_text() for i in axes.xaxis.get_ticklabels()] # This is the auto-gen. X-ticks
                             )))

axes.set_title('Fig. 6. Fare price distribution by classes without outliers.\n' +
               'The boxplot shows the first quartile of the dataset.',
               fontsize=axistitlesize, y=-0.28)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)
axes.set_xlabel('Rented class', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Fare price distribution [ $GBP$ ]', fontsize=axislabelsize, fontweight='bold')

plt.show()

### 1./c. Explore missing entries

In [None]:
# Create a mask to analyze missing entries easier
nan_mask = df.isna()

#### Number of missing values

In [None]:
nan_count = nan_mask.sum()

In [None]:
print('Count of missing values:\n' +
      '========================')
print(tabulate([[c, nan_count[c]] for c in nan_count.index], headers=['Feature', 'Count of NaNs']))

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*6))

axes.bar(nan_count.index, nan_count.values, width=0.9,
         color=nan_color, alpha=alpha,
         ec='black', lw=0.5, align='center')
# Write height value of bars over them for clarity
for i, v in enumerate(nan_count):
    if v != 0:
        axes.text(x=i, y=(v+axistextsize), s=str(v),
                  ha='center', va='center',
                  color='black',
                  fontsize=axistextsize, fontweight='bold')
# Just for the over-bar number to look better
axes.set_ylim(None, 750)
    
axes.set_title('Fig. 7. Number of missing values in the dataset by columns',
               fontsize=axistitlesize, y=-0.45)
axes.set_xlabel('Feature columns', fontsize=axislabelsize, fontweight='bold', labelpad=-20)
axes.set_ylabel('Count of NaNs', fontsize=axislabelsize, fontweight='bold')
axes.tick_params(axis='both', which='major', labelsize=axisticksize)
axes.tick_params(axis='x', which='major', rotation=42)

plt.show()

#### Correlation of missing entries

Since there are effectively only two columns with missing entries, I have to deal with these only in my analysis. The figure above shows, that there are $177$ rows with missing `Age` entires and $687$ with missing `Cabin` entries in the dataset. To discover any correlation between these columns, we can count how many rows are missing both of them. If there are a lot of rows, where there are both values are missing, we can confidently declare, there is probably a correlation between the missingness of entries in these columns.

In [None]:
# Count rows where
nan_rows_cab = nan_mask[(nan_mask['Cabin']) & (~nan_mask['Age'])].sum().sum()
nan_rows_age = nan_mask[(nan_mask['Age']) & (~nan_mask['Cabin'])].sum().sum()
nan_rows_both = nan_mask[(nan_mask['Cabin']) & (nan_mask['Age'])].sum().sum()/2

# Collecting all into lists
corr_plot_x = ['Cabin', 'Both', 'Age']
corr_plot_y = [nan_rows_cab, nan_rows_age, nan_rows_both]

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*6))

axes.bar(corr_plot_x, corr_plot_y, width=0.9,
         color=nan_color, alpha=alpha,
         ec='black', lw=0.5, align='center')
# Write height value of bars over them for clarity
for i, v in enumerate(corr_plot_y):
    axes.text(x=i, y=(v+axistextsize), s=str(int(v)),
              ha='center', va='center',
              color='black',
              fontsize=axistextsize, fontweight='bold')
# Just for the over-bar number to look better
axes.set_ylim(None, 600)

axes.set_title('Fig. 8. Number of missing values in the `Cabin` and `Age` columns.',
               fontsize=axistitlesize, y=-0.28)
axes.set_xlabel('Position of NaN values', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Count of NaNs', fontsize=axislabelsize, fontweight='bold')
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=8)
    
plt.show()

The analysis clearly shows, there is no correlation between the missingness of values among the features `Age` and `Cabin`.

## 2. Heatmap
Create a heatmap which shows how many people survived and dies with the different Pclass variables. You need to create a table where the columns indicates if a person survived or not, the rows indicates the different Pclass and the cell values contains the number of people belonging the that given category. The table should be colored based on the value of the cells in the table.

### 2./a. Exploring the `Pclass` column

In [None]:
classes_col = df['Pclass']
classes_count = np.unique(classes_col, return_counts=True)
classes_types = list(classes_lut.values())

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*6))

axes.bar(classes_types, classes_count[1], width=0.9,
         color=list(class_palette.values()), alpha=alpha,
         ec='black', lw=0.5, align='center')
# Write height value of bars over them for clarity
for i, v in enumerate(classes_count[1]):
    axes.text(x=i, y=(v+axistextsize), s=str(int(v)),
              ha='center', va='center',
              color='black',
              fontsize=axistextsize, fontweight='bold')
# Just for the over-bar number to look better
axes.set_ylim(None, 600)

axes.set_title('Fig. 9. Number of passengers on different classes',
               fontsize=axistitlesize, y=-0.28)
axes.set_xlabel('Rented class', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Count of passengers', fontweight='bold', fontsize=axislabelsize)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=8)
    
plt.show()

### 2./b. Creating heatmap

The task description asks us to create a data table with two columns (which indicates the number of survived and deceased people) and $N$ number of rows, where $N$ is the number of the different `Pclass` values. In this case, there are $3$ different classes on the ship, thus the final dimensions of the data table would be $3 \times 2$.

In [None]:
# Group classes together and count the number of survived people
df_heatmap = df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=True).sum()
df_heatmap['Deceased'] = classes_count[1] - df_heatmap['Survived']

# Convert it to a matrix
heatmap = df_heatmap.values

In [None]:
fig, axes = plt.subplots(figsize=(12,12))

# Hide grid and render tiles as squares on the figure
axes.grid(False)
axes.set_aspect('equal')

im = axes.imshow(heatmap)#, cmap='viridis')
# Loop over data dimensions and create text annotations.
for x in range(heatmap.shape[0]):
    for y in range(heatmap.shape[1]):
        axes.text(y, x, heatmap[x, y], fontsize=30,
                  ha='center', va='center', color='white', fontweight='bold', 
                  bbox=dict(color=np.array((0,0,0,0.2)), lw=0)
                 )

# Format X and Y ticks to fit the 2D matrix plot
axes.set_xticks([0, 1])
axes.set_xticklabels(df_heatmap.columns)
axes.set_yticks([0, 1, 2])
axes.set_yticklabels(classes_types)

axes.set_title('Fig. 10. Heatmap of the number of survived and\ndeceased people on different classes',
               fontsize=axistitlesize, y=-0.16)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)

# Create an axis on the right side of `axes`. The width of `cax` will be 5%
# of `axes` and the padding between `cax` and axes will be fixed at 0.1 inch
divider = make_axes_locatable(axes)
cax = divider.append_axes('right', size='5%', pad=0.1)
cbar = plt.colorbar(mappable=im, cax=cax)
cbar.ax.tick_params(labelsize=axiscbarfontsize, colors='black')
        
plt.show()

#### 3./c. Visualizing the data above on a stacked barplot

Just because that's much more clear and meaningful in my opinion

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*9))

colors = list(survival_palette.values())
ax1 = axes.bar(classes_types, df_heatmap['Survived'], width=0.9,
               color=colors[1], alpha=alpha,
               ec='white', label='Survived')
ax2 = axes.bar(classes_types, df_heatmap['Deceased'], bottom=df_heatmap['Survived'], width=0.9,
               color=colors[0], alpha=alpha,
               ec='white', label='Deceased')
# Write height of bars in their centers
# Clear code from : https://medium.com/@priteshbgohil/stacked-bar-chart-in-python-ddc0781f7d5f
for r1, r2 in zip(ax1, ax2):
    h1 = r1.get_height()
    h2 = r2.get_height()
    plt.text(r1.get_x() + r1.get_width() / 2., h1 / 2., "%d" % h1, fontsize=axistextsize,
             ha='center', va='center', fontweight='bold')
    plt.text(r2.get_x() + r2.get_width() / 2., h1 + h2 / 2., "%d" % h2, fontsize=axistextsize,
             ha='center', va='center', fontweight='bold')
    

axes.set_title('Fig. 11. Number of survived and deceased people by classes',
               fontsize=axistitlesize, y=-0.22)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)
axes.set_xlabel('Rented class', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Count of passengers', fontsize=axislabelsize, fontweight='bold')

axes.legend(fontsize=axislegendsize)

plt.show()

The survival rate is seemingly correlates with the class of the passangers. Naturally it is hardly surprising, since there are numerous factors, which evidently favored first class travelers. First class passengers were accommodated nearest to the deck of the ship, while people on the third class were placed much closer to the bottom of the ship. The first class passengers also had priority during boarding the lifeboats.

## 3. Boxplots

Create boxplots for each different Pclass. The boxplot should show the age distribution for the given Pclass. Plot all of these next to each other in a row to make it easier to compare!

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*9))

sns.boxplot(x='Pclass', y='Age', data=df,
            palette=class_palette,
            showfliers=True,
            ax=axes)

# Format X ticks
# First get the auto-generated labels, then simply translate them using the LUT
axes.set_xticklabels(list(map(
                                classes_lut.get, # This is the LUT
                                [i.get_text() for i in axes.xaxis.get_ticklabels()] # This is the auto-gen. X-ticks
                             )))

axes.set_title('Fig. 12. Age distribution of people by classes.\n' +
               'The boxplot shows the first quartile of the dataset, while\n'+
                'outliers are also plotted outside of the whiskers.',
               fontsize=axistitlesize, y=-0.28)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)
axes.set_xlabel('Rented class', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Age distribution [ $years$ ]', fontsize=axislabelsize, fontweight='bold')

plt.show()

## 4. Correlation matrix

Calculate the correlation matrix for the numerical columns. Show it also as a heatmap described at the 1st task. Which feature seems to play the most important role in surviving/not surviving? Explain how and why could that feature be important! 

#### Step 1. Handle NaN entries

Since `Age` is a numerical column, we should replace the missing entries in that first to allow the exploration of correlations between `Age` and other columns. The goal is to find a feature, which has the highest impact on survival rate. To minimalize the bias introduced by artificially adding new data, I'll sample ages randomly from the age distribution, discussed on Fig. 12. above. This will keep the distribution the same as before, and won't alternate the measurable correlation between `Age` and other variables. Or at least I hope. I can't think of anything else better currenly (besides dropping all entries with NaN values, but that would heavily reduce the size of this already small dataset...)

In [None]:
# Non-NaN age values sorted by `Pclass` values
class_age = [df[(~df['Age'].isna()) & (df['Pclass'] == (i+1))]['Age'] for i in range(3)]
# NaN age values sorted by `Pclass` values
class_age_nan = [df[(df['Age'].isna()) & (df['Pclass'] == (i+1))]['Age'] for i in range(3)]

In [None]:
# Create a new DataFrame for the correlation analysis
df_no_nan = df.copy()

In [None]:
for i in range(3):
    # Number of NaN values where `Pclass` == i
    n_nan = len(class_age_nan[i].values)
    # Replace the NaN entries in the original slice of the pd.DataFrame `df_non_nan`,
    # while keeping the indeces in slice of this pd.DataFrame unchanged.
    # The values are sampled from the distribution of the numeric `Age` values.
    df_no_nan.loc[class_age_nan[i].index, 'Age'] = np.random.choice(class_age[i], size=n_nan)

#### Step 2. Create correlation matrix

In [None]:
# Collect numeric cols from the dataset
numeric_cols = ['Survived', 'Pclass', 'Age', 'Sex', 'SibSp', 'Parch', 'Fare']
df_numeric = df_no_nan[numeric_cols]
# Sex could be also numeric, if we denote (male, female) by (0, 1)
df_numeric['Sex'] = df_numeric['Sex'].map({'male': 0, 'female': 1})

#### Step 3. Visualize correlation matrix

In [None]:
# Create a correlation matrix using the built-in method of Pandas
df_corr = df_numeric.corr(method='pearson')
np_corr = np.array(df_corr)

In [None]:
fig, axes = plt.subplots(figsize=(12,12))

# Hide grid and render tiles as squares on the figure
axes.grid(False)
axes.set_aspect('equal')

im = axes.imshow(np_corr, vmin=-1, vmax=1)#, cmap='viridis')
# Loop over data dimensions and create text annotations.
for x in range(np_corr.shape[0]):
    for y in range(np_corr.shape[1]):
        axes.text(y, x, '{0:.4f}'.format(np_corr[x, y]), fontsize=15,
                  ha='center', va='center', color='white', fontweight='bold', 
                  bbox=dict(color=np.array((0,0,0,0.2)), lw=0)
                 )

# Format X and Y ticks to fit the 2D correlation plot
axes.set_xticks(range(len(df_corr.columns)))
axes.set_xticklabels(df_corr.columns)
axes.set_yticks(range(len(df_corr.columns)))
axes.set_yticklabels(df_corr.columns)
# Place X ticks on top
axes.xaxis.tick_top()

axes.set_title('Fig. 13. Correlation matrix of numeric columns in the dataset',
               fontsize=axistitlesize, y=-0.16)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)

# Create an axis on the right side of `axes`. The width of `cax` will be 5%
# of `axes` and the padding between `cax` and axes will be fixed at 0.1 inch
divider = make_axes_locatable(axes)
cax = divider.append_axes('right', size='5%', pad=0.1)
cbar = plt.colorbar(mappable=im, cax=cax)
cbar.ax.tick_params(labelsize=axiscbarfontsize, colors='black')

plt.show()

This map needs some explanation to correctly understand it.

It seems so, that being female is the main factor in survival on the Titanic, which is completely understandable by being aware of the historical events taken place at the evening hours on the 14th and 15th of April, 1912. It seems so that age mattered the least in survival, which sounds surprising in the light of the same events mentioned in the previous sentence. The latter could be explained by the discrepancy between the infamous, unwritten nautical rule "Women and children first" and the actual survival rate of children. However it is true, that almost every children survived on the first and second classes, but just a very few were saved on the third-class[3].

The negative correlation between `Age` and `Pclass` or `Fare` and `Pclass` originates from the same problem. `Pclass` considered better, when it's value is lower (1st class is the best, while 3rd is the worst). However smaller `Pclass` values always accompanied with higher `Fare` prices. In simple terms, this means that more luxurious classes costs more, which is pretty uninterestingly common knowledge. But on this plot it could be an easily deceptive information by looking simply at the correlation coefficient and thinking about the fact, that  in reality there is an evident correlation between fare price and the quality of provision. The negative correlation between `Pclass` and `Age` arises from the same thing. Younger people simply don't had as much money, as older people. That's why older passangers was more likely to buy a ticket on better classes (denoted by lower numbers).  
The negative correlation between `Pclass` and `Survival` could be also deceptive. As we've seen above, people on higher-tier classes had an advantage at boarding the lifeboats. Also first-class cabins were the nearest to the deck of the ship.

There is a negative correlation between `Age` and `SibPc`, which could be simply explained by the fact, that younger people tend to have more siblings. Child mortality was still high in the early 20th century, while life expectancy was still much lower than nowadays. Mothers gave birth to more children, but a lot of them died early and only a few people lived over $60$ years. The result of all of this was a society with a lot of children, and just a handful if elderly people. That's why it was much more likely to find younglings with siblings on the ship.

---
#### Sources
[3] : ["Children on the Titanic"](https://www.encyclopedia-titanica.org/children-on-titanic/). *Encyclopedia Titanica*. Retrieved 12 September 2020.

## 5. Interpretation

Create two plots which you think are meaningful. Interpret both of them. (Eg.: older people buy more expensive ticket? people buying more expensive ticket survive more? etc.)

### Ideas

It would be interesting to expolre the connection between currently still unexplored columns. I really running out of ideas actually. But here's some graphs that I found intersting.

#### 5./a. If you payed more, you had a much better chance for survival. Oof.

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*9))

sns.boxplot(x='Survived', y='Fare', data=df,
            palette=survival_palette,
            showfliers=False,
            ax=axes)

# Format X ticks
axes.set_xticklabels(df_heatmap.columns[::-1])

axes.set_title('Fig. 14. Correlation of fare price and survival without outliers.\n' +
               'The boxplot shows the first quartile of the dataset.',
               fontsize=axistitlesize, y=-0.28)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)
axes.set_xlabel('Survival', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Fare price distribution [ $GBP$ ]', fontsize=axislabelsize, fontweight='bold')

plt.show()

#### 5./b. Where did wealthy people embarked on the ship?

It seems so, that Cherbourg was the home of most wealthy people. Or was it? At least it seems to be a robust statement that people who embarked there payed the most for their tickets.

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*9))

sns.boxplot(x='Embarked', y='Fare', data=df,
            palette=embarked_palette,
            showfliers=False,
            ax=axes)

# Format X ticks
# First get the auto-generated labels, then simply translate them using the LUT
axes.set_xticklabels(list(map(
                                embark_lut.get, # This is the LUT
                                [i.get_text() for i in axes.xaxis.get_ticklabels()] # This is the auto-gen. X-ticks
                             )))

axes.set_title('Fig. 15. Fare price distribution by place of embarkation without outliers.\n' +
               'The boxplot shows the first quartile of the dataset.',
               fontsize=axistitlesize, y=-0.28)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)
axes.set_xlabel('Place of embarkation', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Fare price distribution [ $GBP$ ]', fontsize=axislabelsize, fontweight='bold')

plt.show()

#### 5./c. Where did older people embarked on the ship?

There's really no significant difference here. Most of the people embarked in Southampton, which would probably just offset the function closer to the true age distribution of people living in England at the time.

In [None]:
nrows = 1
ncols = 1
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*14, nrows*9))

sns.boxplot(x='Embarked', y='Age', data=df,
            palette=embarked_palette,
            showfliers=False,
            ax=axes)

# Format X ticks
# First get the auto-generated labels, then simply translate them using the LUT
axes.set_xticklabels(list(map(
                                embark_lut.get, # This is the LUT
                                [i.get_text() for i in axes.xaxis.get_ticklabels()] # This is the auto-gen. X-ticks
                             )))

axes.set_title('Fig. 16. Age distribution by place of embarkation without outliers.\n' +
               'The boxplot shows the first quartile of the dataset.',
               fontsize=axistitlesize, y=-0.28)
axes.tick_params(axis='both', which='major', labelsize=axisticksize, pad=10)
axes.set_xlabel('Place of embarkation', fontsize=axislabelsize, fontweight='bold')
axes.set_ylabel('Age distribution [ $years$ ]', fontsize=axislabelsize, fontweight='bold')

plt.show()

### Hints:
 - On total you can get 10 points for fully completing all tasks.
 - Decorate your notebook with, questions, explanation etc, make it self contained and understandable!
 - Comments you code when necessary
 - Write functions for repetitive tasks!
 - Use the pandas package for data loading and handling
 - Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation
 - Use the scikit learn package for almost everything
 - Use for loops only if it is really necessary!
 - Code sharing is not allowed between student! Sharing code will result in zero points.
 - If you use code found on web, it is OK, but, make its source clear! 