# Question  My Heart Will Go On

![](https://camo.githubusercontent.com/78ca11f9a2e6c36bbee928124a7d3f9abc3abb2b/68747470733a2f2f696d672d73332e6f6e6564696f2e636f6d2f69642d3537616336353563393365613835613733323935343639652f7265762d302f7261772f732d613730613530323939633033303464336535383266356230373338613366653730396533613564662e6a7067)

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. The RMS Titanic was the largest ship afloat at the time it entered service and was the second of three Olympic-class ocean liners operated by the White Star Line. The Titanic was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, her architect, died in the disaster.  Incorporating both historical and fictionalized aspects, the film Titanic is a 1997 American epic romance and disaster film based on accounts of the sinking of the RMS Titanic directed, written, co-produced, and co-edited by James Cameron, and stars Leonardo DiCaprio and Kate Winslet as members of different social classes who fall in love aboard the ship during its ill-fated maiden voyage.

**Titanic dataset (titanic.xlsx)**
The titanic.xlsx contains list of detailed passenger information aboard with the description in the data dictionary as below.

Data Dictionary 

| Variable        | Definition           | Key  |
| ------------- |:-------------:| -----:|
| survived      | Survival | 0 = No, 1 = Yes |
| pclass      | Ticket class      |   1 = 1st, 2 = 2nd, 3 = 3rd |
| sex         | Gender   |      |
| age | Age in years      |     |
| sibsp | # of siblings / spouses aboard the Titanic      |   Sibling = brother, sister Spouse = husband, wife |
| parch | # of parents / children aboard the Titanic      |     |
| fare | Passenger fare      |     |
| cabin | Cabin number      |     |
| embarked | Port of Embarkation     |   C = Cherbourg, Q = Queenstown, S = Southampton  |
| class | Class of tickets      |  First, Second, Third class   |
| who   | Identity              |  man, woman, child            |
| adult_male |  Is male adult or not | Ture, False              |
| embark_town | The town of embarkation  | Cherbourg, Queenstown, Southampton |
| alive       | same as the survived  | no, yes |
| alone       | Is alone or not       | True, False | 

Read the `titanic.xlsx` and shows how many passnegers records in the data.

Due to the errors in the history archives, there are several problems you need to address first in order to obtain the correct data:

1. In the column of *`sibsp`*, the value of 1 is mistakenly recorded as -1
2. In the column of *`survived`*, the value of 0 is mistakenly recorded as NaN

In [2]:
import pandas as pd
import numpy as np

In [3]:
features= ['survived','pclass','sex','age','sibsp','parch','fare','cabin','embarked','class','who','adult_male','embark_town','alive','alone']
titanicDF = pd.read_excel('titanic.xlsx',header = 20).set_index('Unnamed: 0')

titanicDF

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


Show how many male and female passengers in terms of percentage:

In [4]:
male = titanicDF[titanicDF['sex']=='male']
male_perc = (len(male)/len(titanicDF))*100
female = titanicDF[titanicDF['sex']=='female']
female_perc = (len(female)/len(titanicDF))*100

print(f'In terms of percentage, there are {round(female_perc,2)}% of females and {round(male_perc,2)}% of male aboard the Titanic.')

In terms of percentage, there are 35.24% of females and 64.76% of male aboard the Titanic.


Show the average number of siblings/spouses for passengers embarked from Southampton

In [5]:
southamptonInfo = titanicDF[titanicDF['embark_town']=='Southampton']
av_sibsp = southamptonInfo['sibsp'].sum()/len(southamptonInfo)
print(f'The average number of siblings/spouses for passengers embarked from Southampton is {round(av_sibsp,2)} per passenger.')

The average number of siblings/spouses for passengers embarked from Southampton is 0.57 per passenger.


Show the median age of passengers that are adult male:

In [6]:
adult_male = titanicDF[titanicDF['adult_male']== True]
median_age = adult_male.median()['age']
print(f'The median age of passengers that are adult male is {median_age} years')

The median age of passengers that are adult male is 30.0 years


Show the mean difference of fares between First Class passengers and Third Class passengers: 

In [7]:
fare_class = titanicDF[['pclass','fare']].groupby('pclass').mean()
fare_dif = fare_class.iloc[0] - fare_class.iloc[2]

fare_dif

fare    70.479137
dtype: float64

Show the survival status of passengers with the top 10 highest fare:

In [8]:
high_fare = titanicDF[['fare','survived']].sort_values('fare',ascending=False)
top10 = high_fare.head(10).set_index('fare')
top10

Unnamed: 0_level_0,survived
fare,Unnamed: 1_level_1
512.3292,1
512.3292,1
512.3292,1
263.0,1
263.0,0
263.0,1
263.0,0
262.375,1
262.375,1
247.5208,0


Show the survival rate of men, women and children, respectively:

In [18]:
man_sur = titanicDF[(titanicDF['who']=='man') & (titanicDF['survived']== 1)]
man = titanicDF[(titanicDF['who']=='man')]
man_rate = (len(man_sur)/len(man))*100

woman_sur = titanicDF[(titanicDF['who']=='woman') & (titanicDF['survived']== 1)]
woman = titanicDF[(titanicDF['who']=='woman')]
woman_rate = (len(woman_sur)/len(woman))*100

child_sur = titanicDF[(titanicDF['who']=='child') & (titanicDF['survived']== 1)]
child = titanicDF[(titanicDF['who']=='child')]
child_rate = (len(child_sur)/len(child))*100

print(f' Male rate: {round(man_rate,2)}% \n Female rate: {round(woman_rate,2)}% \n Children rate: {round(child_rate,2)}%')


 Male rate: 16.39% 
 Female rate: 75.65% 
 Children rate: 59.04%


Did number of siblings increase the odds of survival of different sex (male, female)? Show the survival rate given different number of siblings for male and female passengers: 

In [10]:
male_sib = male[['survived','sibsp']].groupby('sibsp').sum()
male_sib_total = male[['survived','sibsp']].groupby('sibsp').count()

female_sib = female[['survived','sibsp']].groupby('sibsp').sum()
female_sib_total = female[['survived','sibsp']].groupby('sibsp').count()

male_sib_survivalrate = (male_sib/male_sib_total)*100
female_sib_survivalrate = (female_sib/female_sib_total)*100

print(f'Male survival rate per sibling: \n\n{male_sib_survivalrate}')
print(f'\nFemale survival rate per sibling: \n\n{female_sib_survivalrate}')  

Male survival rate per sibling: 

        survived
sibsp           
0      16.820276
1      31.067961
2      20.000000
3       0.000000
4       8.333333
5       0.000000
8       0.000000

Female survival rate per sibling: 

        survived
sibsp           
0      78.735632
1      75.471698
2      76.923077
3      36.363636
4      33.333333
5       0.000000
8       0.000000


Show the number of passengers across groups of cabin calss, gender, port of embarkation:

In [11]:
embark_town = titanicDF[['embark_town','fare']].groupby('embark_town').count()
embark_town

Unnamed: 0_level_0,fare
embark_town,Unnamed: 1_level_1
Cherbourg,168
Queenstown,77
Southampton,644


In [12]:
sex = titanicDF[['sex','fare']].groupby('sex').count()
sex

Unnamed: 0_level_0,fare
sex,Unnamed: 1_level_1
female,314
male,577


Now you need to show the top 5 demographic groups with the highest survival rate. The demographic group should consider the following three characteristics: *sex, ticket class and whether the passenger is alone*:

In [13]:
demographic = titanicDF[['sex','class','alone','survived']].groupby(['sex','class','alone']).sum()
demographic_total = titanicDF[['sex','class','alone','survived']].groupby(['sex','class','alone']).count()
demographic_rate = round((demographic/demographic_total)*100,2)

demographic_sorted = demographic_rate.sort_values('survived',ascending=False)
demographic_sorted.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,survived
sex,class,alone,Unnamed: 3_level_1
female,First,True,97.06
female,First,False,96.67
female,Second,False,93.18
female,Second,True,90.62
female,Third,True,61.67


Now show the survival rate of passengers across 5 quintile levels of ticket fares, namely, what is the survival rate among passengers with the top 20%, 20-40%, 40-60%, 60-80% and bottom 20% ticket fares. Can you identify the difference between male and female passengers?

In [16]:
titanicDF['quantile'] = pd.qcut(titanicDF['fare'],q=[0,0.2,0.4,0.6,0.8,1],labels=['0% - 20%', '20%-40%', '40%-60%', '60%-80%','80%-100%'])
titanicDF.groupby(['sex','quantile'])[['survived']].sum()/titanicDF.groupby(['sex','quantile'])[['survived']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,survived
sex,quantile,Unnamed: 2_level_1
female,0% - 20%,0.684211
female,20%-40%,0.527778
female,40%-60%,0.701299
female,60%-80%,0.685714
female,80%-100%,0.924731
male,0% - 20%,0.092199
male,20%-40%,0.121622
male,40%-60%,0.2
male,60%-80%,0.290909
male,80%-100%,0.325301
