# Titanic Data #

## Questions on data ##

   - Was it more likely to survive as male or female passenger?
   - Was ist more likely to survive as first, second or third class passenger?
   - Was ist more likely to survive on the different location of embarkments?
   - Was ist more like to survive as a family member or single person?
   
   

In [39]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.stats.api as sms
from scipy import stats

filename = 'titanic-data.csv'
titanic_df = pd.read_csv(filename)

## Analysis of data in dataset ##

### Structure of dataset ###

|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked
|-|-|-|-|-|-|-|-|-|-|-|
|Passenger Id|Survival (0 = No; 1 = Yes)|Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)|Name|Sex|Age|Number of Siblings/Spouses Aboard|Number of Parents/Children Aboard|Ticket Number|Passenger Fare|Cabin|Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

#### Example data ####

In [40]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Problems in the dataset ###

In [41]:
titanic_df.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

Not for all passengers the 'Age' is available

Not for all passengers 'Cabin' available (This dows not matter as we a are not interested in this field)

Not for all passengers 'Embarked' is available

### Data Wrangling ###

We removing the columns Name, Ticket, Fare and Cabin as we do not need them to answer our questions.

In [42]:
del titanic_df['Name']
del titanic_df['Ticket']
del titanic_df['Fare']
del titanic_df['Cabin']

#### Wrangeled Data ####

In [43]:
titanic_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked
0,1,0,3,male,22.0,1,0,S
1,2,1,1,female,38.0,1,0,C
2,3,1,3,female,26.0,0,0,S
3,4,1,1,female,35.0,1,0,S
4,5,0,3,male,35.0,0,0,S
5,6,0,3,male,,0,0,Q
6,7,0,1,male,54.0,0,0,S
7,8,0,3,male,2.0,3,1,S
8,9,1,3,female,27.0,0,2,S
9,10,1,2,female,14.0,1,0,C


## Questions on data ##

### Overall Survival Probability ###

Survival Probability of all passengers in dataset:

In [44]:
titanic_df.mean()['Survived']

0.38383838383838381

###  Was it more likely to survive as male or female passenger? ###

In [45]:
survival_df_female = titanic_df[titanic_df['Sex'] == 'female']['Survived']
survival_df_male = titanic_df[titanic_df['Sex'] == 'male']['Survived']

Survival rate by sex:

In [46]:
titanic_df.groupby('Sex').mean()['Survived']

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

In [47]:
sns.factorplot(x = 'Sex', y = 'Survived', data = titanic_df, kind = 'bar')

<seaborn.axisgrid.FacetGrid at 0xc3596d8>

 #### Statistical Test ####
    
   ##### Null Hypothesis: ##### 
   
   $H_0 : \mu_{female} - \mu_{male} = 0$       
   
   The survival probability of the female and male passengers is the same
    
   ##### Alternate Hypothesis: #####
   
   $H_A : \mu_{female} - \mu_{male} > 0$       
   
   The survival probality of female passengers is higher than the survival probabilty of male passengers.
   
   
   
   
   
   We will perform a ** two tailed unpaired t-test **. We perform a t-test as we have a normal distribution and we do not know the population standard deviation. It's upaired as the groups of female and male passengers are independent of each other. We choose an alpha level $\alpha = .05.$ 


In [48]:
titanic_df.groupby('Sex').mean()['Survived']

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

$\bar{x}_{female} = 0.742038$

$\bar{x}_{male} = 0.188908$

In [49]:
survival_df_female.mean() - survival_df_male.mean()

0.5531300709799203

$\bar{x}_{x_{female}- x_{male}} = 0.5531300709799203$

In [50]:
titanic_df.groupby('Sex').std()['Survived']

Sex
female    0.438211
male      0.391775
Name: Survived, dtype: float64

$S_{female} = 0.438211$

$S_{male} = 0.391775$

In [51]:
stats.sem(survival_df_female)

0.024729688908190332

$SEM_{female} = 0.024729688908190332$

In [52]:
stats.sem(survival_df_male)

0.016309818218993685

$SEM_{male} = 0.016309818218993685$

In [58]:
pow(pow(survival_df_female.std(), 2) / survival_df_female.count() + pow(survival_df_male.std(), 2) / survival_df_male.count(), 0.5)

0.029623768899862974

$SE = 0.029623768899862974$

In [54]:
survival_df_female.count() + survival_df_male.count() - 2

889

$df = n_{female} + n_{male} - 2 = 889$

In [60]:
stats.ttest_ind(survival_df_female,survival_df_male)

Ttest_indResult(statistic=19.297816550123351, pvalue=1.4060661308802594e-69)

$t = 19.297816550123351$

$p = 1.4060661308802594e-69$

In [62]:
cm = sms.CompareMeans(sms.DescrStatsW(survival_df_female), sms.DescrStatsW(survival_df_male))
cm.tconfint_diff(usevar='unequal')

ValueError: usevar can only be "pooled" or "unequal"

#### Conclusion ####

The Survival rate of the female passengers (74,2%) in the dataset is higher than the overall survival rate (38.4%) and higher than the survival rate of the male passengers (18.9%). This is due the fact that female passengers (and children) are rescued first. (["Women and children first" code of conduct](https://en.wikipedia.org/wiki/Women_and_children_first))

### Was ist more likely to survive as first, second or third class passenger? ###

Survival rate by passenger class

In [None]:
titanic_df.groupby('Pclass').mean()['Survived']

In [None]:
sns.factorplot(x = 'Pclass', y ='Survived', data = titanic_df, kind = 'bar')

#### Conclusion ####

The Survival probality of first class passengers was the highest (63.0 %). On the other hand the class 2 had a very low probability of surviving the desaster (24.2 %). This is due the fact, that first class passengers where on top of the ship and most probably as well because of the social status. This is interesting question.Accoording to [The layout of the titanic](http://www.dummies.com/education/history/titanic-facts-the-layout-of-the-ship/) the first class cabins were located in Deck C. Second and third class passengers were located in Deck D and E. As the cabins of the second and third class  were on the same decks, and the survival probability were much lower for the third class the 
It shows that 

![Titanic deck layout](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Titanic_cutaway_diagram.png/400px-Titanic_cutaway_diagram.png "Titanic deck layout")



### Was ist more likely to survive on the different location of embarkments? ###

Survival rate by embarkement location:

In [None]:
titanic_df.groupby('Embarked').mean()['Survived']

In [None]:
sns.factorplot(x = 'Embarked', y = 'Survived', data = titanic_df, kind = 'bar')

Passengers by embarkment location and class:

In [None]:
titanic_df.groupby(['Embarked','Pclass',]).size() / titanic_df.groupby('Embarked').size()

#### Conclusion ####

The probabilty of surviving the titanic disaster was highest if someone boarded in Cherbourg.
This is because a lot of passengers who embarked in Cherbourg had first class ticket.

### Was ist more like to survive as a family member or single person? ###

In [None]:
# Definition: A passenger is a family member if either SibSp or Parch is > 0
def is_family_member(row):
    return (row['Parch'] > 0 or row['SibSp'] > 0)

# We add a new column to the dataset
titanic_df['IsFamilyMember'] = titanic_df.apply(is_family_member, axis=1)

In [None]:
titanic_df.groupby('IsFamilyMember').mean()['Survived']

In [None]:
titanic_df.groupby(['Pclass','IsFamilyMember]).mean()['Survived']

In [None]:
sns.factorplot(x = 'IsFamilyMember', y ='Survived', data = titanic_df, kind = 'bar')
