# Titanic Data #

## Questions on data ##

   - Was it more likely to survive as male or female passenger?
   - Was ist more likely to survive as first, second or third class passenger?
   - Was ist more likely to survive on the different location of embarkments?
   - Was ist more like to survive as a family member or single person?   
   - Was ist more like to survive in certain Age groups?   

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.stats.api as sms
from scipy import stats

filename = 'titanic-data.csv'
titanic_df = pd.read_csv(filename)

## Analysis of data in dataset ##

### Structure of dataset ###

|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked
|-|-|-|-|-|-|-|-|-|-|-|
|Passenger Id|Survival (0 = No; 1 = Yes)|Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)|Name|Sex|Age|Number of Siblings/Spouses Aboard|Number of Parents/Children Aboard|Ticket Number|Passenger Fare|Cabin|Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

#### Example data ####

In [None]:
titanic_df.head()

### Problems in the dataset ###

In [None]:
titanic_df.count()

Not for all passengers the 'Age' is available

Not for all passengers 'Cabin' available (This dows not matter as we a are not interested in this field)

Not for all passengers 'Embarked' is available

### Data Cleaning and Wrangling ###

We removing the columns Name, Ticket and Fare  as we do not need them to answer our questions.

We extract the deck from the cabin and remove the Cabin column. We extract a new field that indicates, if someone is a family member based on the Parch SibSp. We removing those two fields afterwards.

We extract different Age groups and remove the Age column afterwards.

In [None]:
del titanic_df['Name']
del titanic_df['Ticket']
del titanic_df['Fare']

In [None]:
def get_deck(row):
    if row['Cabin'] == row['Cabin']:
        return row['Cabin'][0:1]
    return row['Cabin']
    
# We add a new column to the dataset
titanic_df['Deck'] = titanic_df.apply(get_deck, axis=1)


del titanic_df['Cabin']

In [None]:
# Definition: A passenger is a family member if either SibSp or Parch is > 0
def is_family_member(row):
    return not(row['Parch'] == 0 and row['SibSp'] == 0)

# We add a new column to the dataset
titanic_df['IsFamilyMember'] = titanic_df.apply(is_family_member, axis=1)

# We remove the Parch and SibSp columns
del titanic_df['Parch']
del titanic_df['SibSp']

In [None]:
def age_group(row):
    age = row['Age']
    if age > 0 and age <= 13:
        return 'Child'
    elif age > 13 and age <= 18:
        return 'Teenager'
    elif age > 18 and age <= 21:
        return 'Young Adult'
    elif age > 21 and age <= 55:
        return 'Adult'
    elif age > 55:
        return 'Senior'
    else:
        return row['Age']

titanic_df['AgeGroup'] = titanic_df.apply(age_group, axis=1)
del titanic_df['Age']

#### Wrangled Data ####

In [None]:
titanic_df.head(50)

## Questions on data ##

### Overall Survival Probability ###

Survival Probability of all passengers in dataset:

In [None]:
titanic_df.mean()['Survived']

###  Was it more likely to survive as male or female passenger? ###

In [None]:
survival_df_female = titanic_df[titanic_df['Sex'] == 'female']['Survived']
survival_df_male = titanic_df[titanic_df['Sex'] == 'male']['Survived']

Survival rate by sex:

In [None]:
titanic_df.groupby('Sex').mean()['Survived']

In [None]:
%matplotlib inline
sns.factorplot(x = 'Sex', y = 'Survived', data = titanic_df, kind = 'bar')

 #### Statistical Test ####
    
   ##### Null Hypothesis: ##### 
   
   $H_0 : \mu_{female} - \mu_{male} = 0$       
   
   The survival probability of the female and male passengers is the same
    
   ##### Alternate Hypothesis: #####
   
   $H_A : \mu_{female} - \mu_{male} > 0$       
   
   The survival probality of female passengers is higher than the survival probabilty of male passengers.
   
   
   
   
   
   We will perform a ** two tailed unpaired t-test **. We perform a t-test as we have a normal distribution and we do not know the population standard deviation. It's unpaired as the groups of female and male passengers are independent of each other. We choose an alpha level $\alpha = .01.$ 


###### Mean 

In [None]:
titanic_df.groupby('Sex').mean()['Survived']

$\bar{x}_{female} = 0.742038$

$\bar{x}_{male} = 0.188908$

###### Mean difference

In [None]:
survival_df_female.mean() - survival_df_male.mean()

$\bar{x}_{x_{female}- x_{male}} = 0.5531300709799203$

###### Standard deviation

In [None]:
titanic_df.groupby('Sex').std()['Survived']

$S_{female} = 0.438211$

$S_{male} = 0.391775$

###### Standard error of the mean

In [None]:
stats.sem(survival_df_female)

$SEM_{female} = 0.024729688908190332$

In [None]:
stats.sem(survival_df_male)

$SEM_{male} = 0.016309818218993685$

###### Standard error of the difference

In [None]:
pow(pow(survival_df_female.std(), 2) / survival_df_female.count() + pow(survival_df_male.std(), 2) / survival_df_male.count(), 0.5)

$SE = 0.029623768899862974$

###### Degrees of freedom

In [None]:
df = survival_df_female.count() + survival_df_male.count() - 2
df

$df = n_{female} + n_{male} - 2 = 889$

###### t and p value

In [None]:
t, p = stats.ttest_ind(survival_df_female,survival_df_male)

In [None]:
t

$t = 19.297816550123351$

In [None]:
p


$p = 1.4060661308802594e-69$

###### Confidence interval

In [None]:
cm = sms.CompareMeans(sms.DescrStatsW(survival_df_female), sms.DescrStatsW(survival_df_male))
cm.tconfint_diff(alpha=.01, alternative='two-sided')

###### Effect size

In [None]:
pow(t,2)/(pow(t,2) + df)

$r^2 = \frac{t^2}{t^2 + df} = .30$

#### Result ####

$t(889) = 19.30, p < .0001, two-tailed$
    
$99\% CI = (0.48, 0.63)$

$r^2 = .30$

I ** reject ** $H_0$, because the t value is in the critical region with p < .0001. As p is below .0001 the difference is considered to be ** extremely statistically significant **. The effect size is 30%.

This means that survival rate of the female passengers (74,2%) in the dataset is higher than the overall survival rate (38.4%) and higher than the survival rate of the male passengers (18.9%). This is due the fact that female passengers (and children) are rescued first. (["Women and children first" code of conduct](https://en.wikipedia.org/wiki/Women_and_children_first))

### Was ist more likely to survive as first, second or third class passenger? ###

Survival rate by passenger class

In [None]:
titanic_df.groupby('Pclass')['Survived'].mean()

In [None]:
%matplotlib inline
sns.factorplot(x = 'Pclass', y ='Survived', data = titanic_df, kind = 'bar')

#### Where are the cabins of each passenger class located?

##### Titanic deck layout

![Titanic deck layout](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Titanic_cutaway_diagram.png/400px-Titanic_cutaway_diagram.png "Titanic deck layout")

In [None]:
# We drop all NaN entries in the Deck subset, we also drop the entry with the deck 'T' as 
# there was no deck 'T'
deck_titanic_df = titanic_df[titanic_df.Deck != 'T'].dropna(subset=['Deck'])

In [None]:
%matplotlib inline
sns.countplot(y="Deck", hue="Pclass", data=deck_titanic_df.sort_values(by='Deck'));

In [None]:
deck_titanic_df.groupby(['Pclass']).Survived.count()

In [None]:
deck_titanic_df[deck_titanic_df['Pclass'] == 1].groupby(['Pclass','Deck']).Survived.mean()

#### Conclusion ####

The Survival probability of first class passengers was the highest (63.0%). On the other hand the third class had a very low probability of surviving the desaster (24.2%). This is due the fact, that first class passengers where rescued because of their social status.

Another idea is that the passengers with cabins on the top of ship were rescued first. Therefore we checked the 


### Was ist more likely to survive on the different location of embarkments? ###

Survival rate by embarkement location:

In [None]:
titanic_df.groupby('Embarked')['Survived'].mean()

In [None]:
%matplotlib inline
sns.factorplot(x = 'Embarked', y = 'Survived', data = titanic_df, kind = 'bar')

Passengers by embarkment location and class:

In [None]:
titanic_df.groupby(['Embarked','Pclass',]).size() / titanic_df.groupby('Embarked').size()

#### Conclusion ####

The probabilty of surviving the titanic disaster was highest if someone boarded in Cherbourg.
This is because a lot of passengers who embarked in Cherbourg had first class ticket.

### Was ist more like to survive as a family member or single person? ###

In [None]:
titanic_df.groupby('IsFamilyMember')['Survived'].mean()

In [None]:
titanic_df.groupby(['Pclass','IsFamilyMember'])['Survived'].mean()

In [None]:
%matplotlib inline
sns.factorplot(x = 'IsFamilyMember', y ='Survived', data = titanic_df, kind = 'bar')

### Survival by Age Group

In [None]:
titanic_df.groupby('AgeGroup')['Survived'].count().sort_values()

In [None]:
titanic_df.groupby('AgeGroup')['Survived'].mean().sort_values()

In [None]:
%matplotlib inline
sns.factorplot(x = 'AgeGroup', y ='Survived', data = titanic_df, kind = 'bar')