# Disclaimer  
* This project is only for personal challenge and educational purpose, no other pretention than those ones.
* My goal was not to (obviously) reivent the wheel, you will find nothing really new here. (Almost) eveything comes from public Kaggle kernels and this 'work' is highly inspired from several (good) readings
* In the end, objective was also to discover, understand and improve my personal skills in data exploration, correlation + manipulation of pandas, seaborn packages.

## Good readings before starting
* [Kaggle kernel from Manav Sehgal](https://www.kaggle.com/startupsci/titanic-data-science-solutions/data)
* [PyconUK tutorial notebooks](https://nbviewer.jupyter.org/github/savarin/pyconuk-introtutorial/tree/master/notebooks/)
* this [EDA notebook](https://www.kaggle.com/ash316/eda-to-prediction-dietanic)
* and others public Kaggle [kernels/tutorials](https://www.kaggle.com/c/titanic#tutorials) to follow

In [28]:
import pandas as pd

## Setup libs + loading datasets
Authorize *pandas* to display wider than the console output width and load both datasets (we assume they are stored in a folder called `'datasets'` available in the same directory than this notebook.

In [29]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
train_df = pd.read_csv('./datasets/train.csv')
test_df = pd.read_csv('./datasets/test.csv')
nb_rows = 5

## Display basic informations on datasets
Let's start with a basic and simple information display

In [30]:
# First 'nb_rows' rows of the 'Training' dataset:
train_df.head(nb_rows)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
# And 'Test' dataset looks like:
test_df.head(nb_rows)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [32]:
# Shapes of the 'Training' and 'Test' dataset are, respectively:
train_df.shape

(891, 12)

In [33]:
test_df.shape

(418, 11)

In [34]:
# Quick statistics summary for the 'Training' dataset:
train_df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Milling, Mr. Jacob Christian",male,,,,1601.0,,C23 C25 C27,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [35]:
# Same for 'Test':
test_df.describe(include='all')

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,418.0,418.0,418,418,332.0,418.0,418.0,418,417.0,91,418
unique,,,418,2,,,,363,,76,3
top,,,"Collett, Mr. Sidney C Stuart",male,,,,PC 17608,,B57 B59 B63 B66,S
freq,,,1,266,,,,5,,3,270
mean,1100.5,2.26555,,,30.27259,0.447368,0.392344,,35.627188,,
std,120.810458,0.841838,,,14.181209,0.89676,0.981429,,55.907576,,
min,892.0,1.0,,,0.17,0.0,0.0,,0.0,,
25%,996.25,1.0,,,21.0,0.0,0.0,,7.8958,,
50%,1100.5,3.0,,,27.0,0.0,0.0,,14.4542,,
75%,1204.75,3.0,,,39.0,1.0,0.0,,31.5,,


In [36]:
# Other interesting information for 'Training' dataset:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [37]:
# and for 'Test':
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


## First observations
1st step is to analyze our data and find correlations, check if there are missing value (if yes, we will need to decide what to do with it)

In [38]:
# Columns containing 'null' data for the 'Training' dataset are:
train_df.columns[train_df.isnull().any()]

Index(['Age', 'Cabin', 'Embarked'], dtype='object')

In [39]:
# In 'Test' dataset:
test_df.columns[test_df.isnull().any()]

Index(['Age', 'Fare', 'Cabin'], dtype='object')

==> **What have we seen ?**  
1) Seems that we can get rid of useless informations for this problem:
* 'Name' column has no duplication across the dataset (unique=count). It will not help us to categorize anything
* 'Ticket' column contains duplication (families ?) and is quite missing (681/891 = 23.5%) in 'Training'
* 'Cabin' column has a lot of missing values (almost 700 for 'Training' dataset and contains duplicates)

2) From 'Training' dataset 38% of people have Survived (*mean* value)

3) 'Sex' and 'Embarked' are non numerical and categorical/qualitative => need to map them to numerical values
* In 'Training', more people were men (577/891 = 64.7%)
* In 'Training', more people came from 'S' port (top = 'S' and freq = 644 -> ~72%)

4) 'Pclass' is qualitative and ordinal (value = 1, 2 or 3)

## Specific analysis for few features

In [42]:
# How many people survived, grouped by their 'Pclass' category ?
# ==> the 'upper' is the class, better is the chance for survival --> this feature has an impact on 'decision/prediction'
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=True).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


In [43]:
# What about the gender ?
# ==> women had better chance to survive (74% vs. 18% for men) [Remember that the 'Training' dataset is quite 
# representative with 38% of survival (2224 people with 1502 died = 722/2224 = 32.4%)]
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=True).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


In [44]:
# Same for the 'Embarked'
# ==> more chance to survive if boarding were from 'C' [Remember that 'S' is the most frequent, at least in the 'Training'
# dataset (~72%) so perhaps there is no direct relation: a lot of people died but most of them came from 'S' does not mean that
# 'S' is more linked to death]
train_df[["Embarked", "Survived"]].groupby(['Embarked'], as_index=True).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0_level_0,Survived
Embarked,Unnamed: 1_level_1
C,0.553571
Q,0.38961
S,0.336957


In [52]:
# Has the 'Age' any impact ?
# train_df[["Age", "Survived"]].groupby(['Age'], as_index=True).mean().sort_values(by='Age', ascending=True)
# ==> Irrelevant to display it this way, there are 88 different values for 'Age', bette to visualize it (later)

In [47]:
# In the end, give a try to family analysis: more chance to survive if people were alone or not ?
# Let's start with siblings and spouse
# [Sibling = brother, sister, stepbrother, stepsister / Spouse = husband, wife (mistresses and fiancés were ignored)]
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=True).mean().sort_values(by='SibSp', ascending=True)

Unnamed: 0_level_0,Survived
SibSp,Unnamed: 1_level_1
0,0.345395
1,0.535885
2,0.464286
3,0.25
4,0.166667
5,0.0
8,0.0


In [48]:
# Do the same with parents and children
# [Parent = mother, father / Child = daughter, son, stepdaughter, stepson
# Some children travelled only with a nanny, therefore parch=0 for them]
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=True).mean().sort_values(by='Parch', ascending=True)

Unnamed: 0_level_0,Survived
Parch,Unnamed: 1_level_1
0,0.343658
1,0.550847
2,0.5
3,0.6
4,0.0
5,0.2
6,0.0
