Homework 1: Analysis of 'titanic.csv' file using Python and Pandas.

Code is built to do required analysis on Titanic population pool and answer each of the questions in the assignment.

Author: Jose Solomon

In [15]:
# Import pandas and numpy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# In-line plotting
%matplotlib inline

In [16]:
# Load a data from using 'pandas'
dfTitanic = pd.read_csv('titanic.csv')

In [17]:
#1. How many passengers are in our passenger list? 
dfTitanic.PassengerId.count()

891

In [19]:
# 2. What is the overall survival rate? 

# Given that the a passenger whom survived is marked as 1, and one that did not is 0, if you take the mean of 
# the 'Survived' column, you will get the survival rate
survivalRate = dfTitanic.Survived.mean()
print "%.2f%%" % (survivalRate * 100)

38.38%


In [20]:
# So a key facet is the total number of passengers in the log versus those passengers that actually embarked on
# the ship. The log has an entry under 'Embarked' for the port from which each passenger embarked the ship. As 
# stated in Kaggle: C = Cherbourg, Q = Queenstown, S = Southampton,

# If a passenger does not have en embarkation entry, I believe the person purchased a ticket but did not board.

# For every question that defines the population pool in terms of 'onboard' members, I check the 
# embark entry of the population log. I, however, count those passengers which did not embark as survivors since they
# in fact survived the ordeal, as indicated by their values of the 'Survived' entry.

# How many people got onboard?
print "Total number of people on list that embarked:"
print dfTitanic.Embarked.count()

Total number of people on list that embarked:
889


In [21]:
# Who did not get onboard?
luckyPeople = dfTitanic[dfTitanic.Embarked.isnull()]
luckyPeople

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38,0,0,113572,80,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,


So two lucky ladies did not embark the ship: they were both in first class and for some reason have the same ticket number and the same cabin. I consider these two passengers survivors, but not part of the solution set 
when questions specify the 'onboard' stipulation.

In [22]:
# 3. How many male passengers were onboard?
dfTitanic[(dfTitanic.Sex == 'male')].Embarked.count()

577

In [23]:
# 4. How many female passengers were onboard?
dfTitanic[(dfTitanic.Sex == 'female')].Embarked.count()

312

In [24]:
# Just for my own insight
# dfTitanic[(dfTitanic.Survived == 1)].groupby(['Sex']).count()

In [25]:
# The following two questions are answered together in the table below:

# 5. What is the overall survival rate of male passengers?
# 6. What is the overall survival rate of female passengers?

# Create two new variables
survived = dfTitanic[(dfTitanic.Survived == 1)].groupby(['Sex']).Sex.count()
total = dfTitanic.groupby(['Sex']).Sex.count()

#print survived
#print total
# Based on print statements above, 'female' is listed before 'male'
survivalFrame = pd.DataFrame([[float(survived[0])/float(total[0])], [float(survived[1])/float(total[1])]],
                             ['Female', 'Male'],columns=['Survival Rate'])
                                                                    
survivalFrame.head()

Unnamed: 0,Survival Rate
Female,0.742038
Male,0.188908


In [26]:
# 7. What is the average age of all passengers onboard?
#    * How did you calculate this average age?
#    * Note that some of the passengers do not have an age value. 
#      How did you deal with this? What are some other ways of dealing with this?

# Create a data frame with no nulls for the 'age' entry
noNullAge = dfTitanic[dfTitanic.Age.notnull()]

# Now check for those passengers that did not get on the ship
embarkedPassengers = noNullAge[noNullAge.Embarked.notnull()]

# Now calculate the average age of those that have reported their ages in the log
print "%.2f years old" % (embarkedPassengers.Age.mean())

# Another way to deal with this is to take the average directly from the original data frame, since 
# Pandas is able to remove 'NaNs' when it does a mean calculation, removing those two passengers that did
# not embark on the ship.

29.64 years old


In [27]:
# The following two questions are answered together in the table below:

# 8. What is the average age of passengers who survived?
# 9. What is the average age of passengers who did not survive?

# Note: I include the two passengers that did not embark as survivors... very lucky they overslept.

# Filter for passengers that survived
survived = dfTitanic[(dfTitanic.Survived == 1)].Age.mean()
notSurvived = dfTitanic[(dfTitanic.Survived == 0)].Age.mean()

survivalFrame = pd.DataFrame([survived, notSurvived],['Average age of those that survived',
                                                     'Average age of those that did not'],
                             columns=['Average Ages'])
survivalFrame.head()

Unnamed: 0,Average Ages
Average age of those that survived,28.34369
Average age of those that did not,30.626179


Question 10. At this (early) point in our analysis, what might you infer about any patterns you are seeing?

1. Female passengers had a much higher chance of survival than their male counterparts.
2. Younger passengers tended to fair better than older ones. It would be interesting to 
    see how the ticket class of the passenger effects the survival rate.

In [28]:
# 11. How many passengers are in each of the three classes of service (e.g. First,
# Second, Third?)
classFrame = pd.DataFrame(dfTitanic.groupby('Pclass').Pclass.count())
totalFirst = int(classFrame.iloc[0])
totalSecond = int(classFrame.iloc[1])
totalThird =  int(classFrame.iloc[2])
classFrame = pd.DataFrame([totalFirst , totalSecond, 
                           totalThird],['1st Class','2nd Class',
                                                     '3rd Class'],columns=['Class'])
classFrame.head()

Unnamed: 0,Class
1st Class,216
2nd Class,184
3rd Class,491


In [29]:
# 12. What is the survival rate for passengers in each of the three classes of service?
survivalClassFrame = pd.DataFrame(dfTitanic[(dfTitanic.Survived == 1)].groupby('Pclass').Pclass.count())
survivalR1st = int(survivalClassFrame.iloc[0])/float(totalFirst)
survivalR2nd = float(survivalClassFrame.iloc[1])/float(totalSecond)
survivalR3rd = float(survivalClassFrame.iloc[2])/float(totalThird)
survivalClassFrame = pd.DataFrame([survivalR1st ,survivalR2nd, survivalR3rd],['1st Class','2nd Class',
                                                     '3rd Class'],columns=['Class'])
survivalClassFrame.head()

Unnamed: 0,Class
1st Class,0.62963
2nd Class,0.472826
3rd Class,0.242363


Question 13. What else might you conclude?

As expected, if the passenger was in a higher class, the chances of survival were 
much greater.



Question 14. Last, if we were to build a predictive model, which features in the data do you
think we should include in the model and which can we leave out? Why?

It seems from this cursory review of the data, the boarding class, gender,
and age are key features of analyzing the survival rate of the passengers. It seems
that the ticket number on its own may not be so valuable and could potentially be left out of further studies.

That being said, it would be interesting to look at what deck level each of the passenger was
in terms of the ship layout, an aspect that could be tied to ticket number. Determining the proximity to the ship's deck could add insight into how the passengers' survival rate were effected by exit route.



