# Titanic Data Set - Statistics Review

[Variable Descriptions](https://www.kaggle.com/c/titanic/data).

In [5]:
import scipy as sp
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
%matplotlib inline

**Describe the data. **
- How big?
- What are the columns and what do they mean?

In [39]:
df = pd.read_csv('titanic.csv')
df.shape
df.dtypes
data.head()

(891, 12)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**What’s the average age of:**

- Any Titanic passenger
- A survivor
- A non-surviving first-class passenger
- Male survivors older than 30 from anywhere but Queenstown

In [63]:
avg_ages = pd.Series([df.Age.mean(),
                     df[df["Survived"] == 1].Age.mean(),
                    df[(df['Survived'] == 0) & (df['Pclass'] == 1)].Age.mean(),
                    df[(df['Sex'] == 'male') &
                       (df['Survived'] == 1) & 
                       (df['Age'] > 30) &
                       (df['Embarked'] != 'Q')].Age.mean()])
type(avg_ages)
avg_ages.index = ['All Passengers', 'Survivors', 'Non-Survivors, 1st', 'Male, Survivors, 30+, Not QT']
avg_ages

pandas.core.series.Series

All Passengers                  29.699118
Survivors                       28.343690
Non-Survivors, 1st              43.695312
Male, Survivors, 30+, Not QT    41.487805
dtype: float64

**For the groups from the previous task, how far (in years) are the average ages from the median ages?**

In [69]:
median_ages = pd.Series([df.Age.median(),
                     df[df["Survived"] == 1].Age.median(),
                    df[(df['Survived'] == 0) & (df['Pclass'] == 1)].Age.median(),
                    df[(df['Sex'] == 'male') &
                       (df['Survived'] == 1) & 
                       (df['Age'] > 30) &
                       (df['Embarked'] != 'Q')].Age.median()])
median_ages.index = ['All Passengers', 'Survivors', 'Non-Survivors, 1st', 'Male, Survivors, 30+, Not QT']
diff_ages = median_ages - avg_ages
diff_ages

All Passengers                 -1.699118
Survivors                      -0.343690
Non-Survivors, 1st              1.554688
Male, Survivors, 30+, Not QT   -3.487805
dtype: float64

In [70]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**What’s the most common:**
- Passenger class
- Port of Embarkation
- Number of siblings or spouses aboard for survivors

In [73]:
df.Pclass.mode()
df.Embarked.mode()
df.SibSp.mode()

0    3
dtype: int64

0    S
dtype: object

0    0
dtype: int64

**Within what range of standard deviations from the mean (0-1, 1-2, 2-3) is the median ticket price? Is it above or below the mean?**

In [77]:
mu = df.Fare.mean()
sd = df.Fare.std()
med = df.Fare.median()
(med - mu) / sd # note that htis is the Z score

-0.3571902456652297

**How much more expensive was the 90th percentile ticket than the 5th percentile ticket? Are they the same class?**

In [81]:
ninetieth = df.Fare.quantile(q=.9)
tenth = df.Fare.quantile(q=.05)
ninetieth
tenth
ninetieth - tenth


77.9583

7.225

70.7333

In [82]:
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**The highest average ticket price was paid by passengers from which port? Null ports don’t count.**

In [84]:
pd.pivot_table(df, index='Embarked')

Unnamed: 0_level_0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C,30.814769,59.954144,0.363095,445.357143,1.886905,0.386905,0.553571
Q,28.089286,13.27603,0.168831,417.896104,2.909091,0.428571,0.38961
S,29.445397,27.079812,0.413043,449.52795,2.350932,0.571429,0.336957


**What is the most common passenger class for each port?**

In [92]:
pd.pivot_table(df, index='Embarked', values='Pclass', aggfunc=stats.mode)

Unnamed: 0_level_0,Pclass
Embarked,Unnamed: 1_level_1
C,"([1], [85])"
Q,"([3], [72])"
S,"([3], [353])"


In [93]:
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**What fraction of surviving 1st-class males paid lower than double the overall median ticket price?**

In [102]:
df.Fare.median()
2 * df.Fare.median()
df1 = df[(df['Sex'] == 'male') & (df['Survived'] == 1) & (df['Pclass'] == 1)].Fare
df1.head()
stats.percentileofscore(df1, 2 * df.Fare.median(), kind='strict')

14.4542

28.9084

23     35.5000
55     35.5000
97     63.3583
187    26.5500
209    31.0000
Name: Fare, dtype: float64

24.444444444444443

**How much older/younger was the average surviving passenger with family members than the average non-surviving passenger without them?**

In [112]:
df_surv_wfam = df[(df['Survived'] == 1) & ((df['SibSp'] > 0) | (df['Parch'] > 0))].Age
df_surv_wfam.head()
df_surv_wfam.mean()
df_nsurv_wofam = df[(df['Survived'] != 1) & ((df['SibSp'] == 0) & (df['Parch'] == 0))].Age
df_nsurv_wofam.head()
df_nsurv_wofam.mean()
df_surv_wfam.mean() - df_nsurv_wofam.mean()

1     38.0
3     35.0
8     27.0
9     14.0
10     4.0
Name: Age, dtype: float64

25.526062500000002

4     35.0
5      NaN
6     54.0
12    20.0
14    14.0
Name: Age, dtype: float64

32.41423357664234

-6.888171076642337

**Display the relationship (i.e. make a plot) between survival rate and the quantile of the ticket price for 20 integer quantiles.**
- To be clearer, what I want is for you to specify 20 quantiles, and for each of those quantiles divide the number of survivors in that quantile by the total number of people in that quantile. That’ll give you the survival rate in that quantile.
- Then plot a line of the survival rate against the ticket fare quantiles.
- Make sure you label your axes.

**For each of the following characteristics, find the median in the data:**
- Age
- Ticket price
- Siblings/spouses
- Parents/children

**If you were to use these medians to draw numerical boundaries separating survivors from non-survivors, which of these characteristics would be the best choice and why?**

**Plot the distribution of passenger ages. Choose visually-meaningful bin sizes and label your axes.**

**Find the probability that:**
- A passenger survived
- A passenger was male
- A passenger was female and had at least one sibling or spouse on board
- A survivor was from Cherbourg
- A passenger was less than 10 years old
- A passenger was between 25 and 40 years old
- A passenger was either younger than 20 years old or older than 50

**Knowing nothing else about the passengers aside from the survival rate of the population (see question above), if I choose 100 passengers at random from the passenger list, what’s the probability that exactly 42 passengers survive?**

**What’s the probability that at least 42 of those 100 passengers survive?**

**Take random samples of 100 passengers and find out how many you need before the fraction of those samples where at least 42 passengers survive matches the probability you calculated previously (within Δp≈0.05).**

Answers will vary based on chosen seeds. What would happen if you drew every sample with the same seed?

Plot the survival fraction vs the number of random samples.

**Is there a statistically significant difference between:**
- The ages of male and female survivors?
- The fares paid by passengers from Queenstown and the passengers from Cherbourg?

**Use a 95% confidence level.**

**Accompany your p-values with histograms showing the distributions of both compared populations.**

**Did survivors pay more for their tickets than those that did not? Use a 95% confidence level.**

**Did a given first-class passenger have fewer family members on board than a given third-class passenger? Use a 95% confidence level.**