<a id='top'></a>

# Titanic Case Study
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

## Table of Contents
1. [Dataset Description](#dataset) 
2. [Data Cleaning](#cleaning) 
    1. [Missing Values](#mv)
    2. [Feature Engineering](#fe)
    3. [Feature Reshaping](#fr)
3. [Exploratory Analysis](#ea)
    1. [Features Distributions](#fd)
    2. [Dispersion and Outliers](#do)
    3. [Correlations](#cc)

In [19]:
# let's import useful packages

%matplotlib inline
import numpy as np
import pandas as pd 
import scipy as sp
import sklearn as sk # data mining tools
import matplotlib.pylab as plt # plotting
import seaborn as sns # advanced plotting
from pandas.plotting import scatter_matrix

import warnings
warnings.filterwarnings("ignore")

<a id='dataset'></a>
## 1. Dataset description ([to top](#top))
As first step we load the whole Titanic Dataset and make confidence with its features...

In [20]:
# load the dataset 
titanic = pd.read_csv("data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Each record is described by 12 variables:

- The ``Survived`` is a binary nominal datatype of 1 for survived and 0 for did not survive.

- The ``PassengerID`` and ``Ticket`` variables are assumed to be random unique identifiers, they will be excluded from analysis.

- The ``Pclass`` variable is an ordinal datatype for the ticket class, a proxy for socio-economic status (SES), representing 1 = upper class, 2 = middle class, and 3 = lower class.

- The ``Name`` variable is a nominal datatype. It could be used in feature engineering to derive the gender from title, family size from surname, and SES from titles like doctor or master. 

- The ``Sex`` and ``Embarked`` variables are a nominal datatype. **Embarked** indicates where an individual has been embarked (harbor). They will be converted to dummy variables for mathematical calculations.

- The ``Age`` and ``Fare`` variable are continuous quantitative datatypes. **Fare** is cumulative for each family.

- The ``SibSp`` represents number of related siblings/spouse aboard and ``Parch`` represents number of related parents/children aboard. Both are discrete quantitative datatypes. This can be used for feature engineering to create a family size and is alone variable.

- The ``Cabin`` variable is a nominal datatype that can be used in feature engineering for approximate position on ship when the incident occurred and SES from deck levels. However, since there are many null values, it does not add value and thus is excluded from analysis.

<a id='cleaning'></a>
## 2. Data Cleaning ([to top](#top))

In this stage, we will clean our data by 
 1. handling missing information, 
 2. creating new features for analysis, and 
 3. converting fields to the correct format for calculations and presentation.

<a id='mv'></a>
### 2.A Missing Values ([to top](#top))
Reviewing the data, there does not appear to be any aberrant or non-acceptable data inputs.

Are there null values or missing data?

In [None]:
titanic.isnull().sum()

In [None]:
# concise summary of the dataset
titanic.describe(include = 'all')

In our scenario is safe to *impute* missing values

In [None]:
# setting the median to null values in Age
titanic['Age'].fillna(titanic['Age'].median(), inplace = True)
titanic.Age.isnull().sum()

In [None]:
# using the mode to fill null values in Embarked
titanic['Embarked'].fillna(titanic['Embarked'].mode()[0], inplace = True)
titanic.Embarked.isnull().sum()

In [None]:
titanic['Fare'].fillna(titanic['Fare'].median(), inplace = True)
titanic.Fare.isnull().sum()

Moreover, not all the columns in our dataframe are useful for our analysis...

In [None]:
drop_column = ['PassengerId', 'Cabin', 'Ticket']
titanic.drop(drop_column, axis=1, inplace = True) # it modifies the dataframe directly

In [None]:
titanic.isnull().sum()

<a id='fe'></a>
### 2.B Feature Engineering ([to top](#top))
Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome. 

In order to better explicitate information hidden in the original data we engenier some new features

#### Creating discrete variables as combinations of existing ones

In [None]:
titanic['FamilySize'] = titanic ['SibSp'] + titanic['Parch'] + 1

In [None]:
titanic['IsAlone'] = 1 # initialize to yes/1 is alone
titanic['IsAlone'].loc[titanic['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

titanic[['FamilySize', 'IsAlone']].head()

Since ``Fare`` value is cumulative for each Family, we can now calculate its correct value for each passenger

In [2]:
titanic['Fare'] = titanic['Fare']/titanic['FamilySize']
titanic.Fare.head()

NameError: name 'titanic' is not defined

#### Transform Categorical (String) variables

In [3]:
titanic.Name.head()

NameError: name 'titanic' is not defined

In [None]:
# Identify title names (Mr. Miss. Mrs. etx)
# Split title from name

titanic['Title'] = titanic['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

titanic[['Name', 'Title']].head()

In [4]:
# cleanup rare title names

stat_min = 10 # the minumum frequency of a title 
title_names = (titanic['Title'].value_counts() < stat_min) # create a true false series with title name as index
titanic['Title'] = titanic['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)

titanic['Title'].value_counts()

NameError: name 'titanic' is not defined

<a id='fr'></a>
### 2.C Feature Reshaping ([to top](#top))

Last, but certainly not least, we'll deal with formatting. Our categorical data imported as objects, which makes it difficult for mathematical calculations. We will convert object datatypes to categorical dummy variables.

#### Convert categotical variables to numerical ones using Label Encoder

In [5]:
from sklearn.preprocessing import LabelEncoder

# encode labels with value between 0 and n_classes-1.
sex_encoder = LabelEncoder()
embarked_encoder = LabelEncoder()
title_encoder = LabelEncoder()

titanic['Sex_Code'] = sex_encoder.fit_transform(titanic['Sex'])
titanic['Embarked_Code'] = embarked_encoder.fit_transform(titanic['Embarked'])
titanic['Title_Code'] = title_encoder.fit_transform(titanic['Title'])

titanic.head()

NameError: name 'titanic' is not defined

In [6]:
# we can also invert the encoding
sex_encoder.inverse_transform([[0, 1]])

NotFittedError: This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

<a id='ea'></a>
## 3. Exploratory Analysis ([to top](#top))
Now that our data is cleaned, we will explore our data with descriptive and graphical statistics to describe and summarize our variables. 

<a id='fd'></a>
### 3.A Features Distributions ([to top](#top))

In order to understand how the values of a continuos feature distribute we can use the kde (Kernel Density Estimate) plot

In [7]:
age = titanic['Age'].plot.kde()

NameError: name 'titanic' is not defined

In [None]:
fare = titanic['Fare'].plot.kde()

#### Conditional Feature Distribution

We can build kde plots also by grouping values of a same feature w.r.t. a categorical variable.

For instance we can check if there are differences on the Age/Sex distributions of Survived/Dead passengers.

In [8]:
ax = titanic.groupby(['Survived']).Age.plot.kde()
plt.legend()
plt.show()

NameError: name 'titanic' is not defined

In [9]:
ax = titanic.groupby(['Sex']).Age.plot.kde()
plt.legend()
plt.show()

NameError: name 'titanic' is not defined

In [10]:
ax = titanic.groupby(['Survived']).Fare.plot.kde()
plt.legend()
plt.show()

NameError: name 'titanic' is not defined

In [None]:
ax = titanic.groupby(['Sex']).Fare.plot.kde()
plt.legend()
plt.show()

In [11]:
ax = titanic.groupby(['Sex', 'Survived']).Age.plot.kde()
plt.legend()
plt.show()

NameError: name 'titanic' is not defined

#### Histogram plot
We can also use Histograms instead of kde to capture binned class distribution.

In [12]:
sx = titanic.FamilySize.plot.hist()

NameError: name 'titanic' is not defined

#### (Conditional, Stacked) histograms

In [13]:
def conditional_histogram(df, column):

    booldf1 = pd.DataFrame(titanic[titanic['Survived']==0][column])
    booldf1.columns = ['Dead']
    booldf2 = pd.DataFrame(titanic[titanic['Survived']==1][column])
    booldf2.columns = ['Survived']
    row_concat = pd.concat([booldf1, booldf2], axis=1)
    
    ax = row_concat.plot.hist(stacked=True, alpha=0.6)
    ax.set_xlabel(column)

In [14]:
conditional_histogram(titanic, 'Fare')

NameError: name 'titanic' is not defined

In [15]:
conditional_histogram(titanic, 'Embarked_Code')

NameError: name 'titanic' is not defined

In [16]:
conditional_histogram(titanic, 'FamilySize')

NameError: name 'titanic' is not defined

In [None]:
conditional_histogram(titanic, 'Age')

In [17]:
conditional_histogram(titanic, 'Sex_Code')

NameError: name 'titanic' is not defined

In [None]:
conditional_histogram(titanic, 'FamilySize')

#### Bar charts
Conversely from histograms (used to plot quantitative data with ranges of the data grouped into bins or intervals), bar charts plot categorical data.


In [None]:
# Survived by sex

sx = titanic.groupby(['Sex']).Survived.sum().plot.barh()

In [None]:
# Survived count

sx = titanic.groupby(['Survived']).Survived.count().plot.barh()

In [None]:
# Alone passengers

sx = titanic.groupby(['IsAlone']).IsAlone.count().plot.barh()

In [None]:
# Alone passangers grouped by sex

sx = titanic.groupby(['IsAlone', 'Sex']).IsAlone.count().plot.barh()

In [None]:
# Do being alone affect the survival rate?

sx = titanic.groupby(['IsAlone', 'Survived']).IsAlone.count().plot.barh()

#### (Conditional and Normalized) Bar plot

In [None]:
def conditional_bar_plot(df, columns, by):
    t1 = pd.DataFrame(df[columns].groupby(by).sum())
    t1.columns = ['Survived']
    t2 = pd.DataFrame(titanic[columns].groupby(by).count())
    t2.columns = ['Total']
    row_concat = pd.concat([t1, t2], axis=1)
    row_concat['Percentage'] = row_concat['Survived'] / row_concat['Total']
    return row_concat['Percentage'].plot.barh()

In [None]:
# Survival rate per Class

sp = conditional_bar_plot(titanic, ['Survived', 'Pclass'], ['Pclass'])

In [None]:
# Survival rate per Embarked

se = conditional_bar_plot(titanic, ['Survived', 'Embarked'], ['Embarked'])

In [None]:
# Survival rate per Family Size

sf = conditional_bar_plot(titanic, ['Survived', 'FamilySize'], ['FamilySize'])

<a id='do'></a>
### 3.B Dispersion and Outliers ([to top](#top))

Box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles.

Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers.

In [None]:
fare_box = titanic.boxplot(['Fare'], showfliers=True)

In [None]:
age_box = titanic.boxplot(['Age'], showfliers=True)

In [None]:
fs_box = titanic.boxplot(['FamilySize'], showfliers=True)

#### Conditional box plots

In [None]:
fare_by_cs = titanic.boxplot(['Fare'], by=['Pclass', 'Survived'])

In [None]:
age_by_cs = titanic.boxplot(['Age'], by=['Pclass', 'Survived'])

In [None]:
family_by_cs = titanic.boxplot(['FamilySize'], by=['Pclass', 'Survived'])

<a id='cc'></a>
### 3.C Correlations ([to top](#top))

A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationships between two variables.

Several types of correlation coefficients exist, each with their own definition and own range of usability and characteristics. They have in common that they assume values in the range from −1 to +1, where +1 indicates the strongest possible agreement and −1 the strongest possible disagreement. By default Pandas adopts Pearson correlation.

In [None]:
# Target label
Target = ['Survived']

titanic_1 = titanic[['Sex','Pclass', 'Embarked', 'Title', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Survived']]
titanic_1.head()

In [None]:
# Discrete Variable Correlation by Survival 

for x in titanic_1:
    if titanic_1[x].dtype != 'float64' and x!=Target[0]:
        print('\nSurvival Correlation by:', x)
        cor = titanic_1[[x, Target[0]]].groupby(x).mean()
        print(cor)

We can observe that are more likely to survive:
 - Female passengers
 - 1st class passengers
 - C embarked
 - Those who are not alone
 - Those who have a FamilySize in [2, 4] 

#### Correlation matrix

The correlation matrix computes the Pearson correlation coefficients of the columns of a matrix. That is, row i and column j of the correlation matrix is the correlation between column i and column j of the original matrix. Note that the diagonal elements of the correlation matrix will be 1 (since they are the correlation of a column with itself). The correlation matrix is also symmetric since the correlation of column i with column j is the same as the correlation of column j with column i.

In [None]:
import seaborn as sns
corr = titanic_1.corr()
plt.subplots(figsize =(14, 12))
hm = sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, annot=True)

#### Scatter plots

A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram)[3] is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
sm = scatter_matrix(titanic_1)

In [None]:
af = titanic.plot.scatter(x='Age', y='Fare')

In [None]:
af = titanic.plot.scatter(x='Age', y='FamilySize')