In [95]:
import pandas as pd

In [96]:
# 1. The categories of passengers who were most likely to survive the Titanic disaster are first class females
# with an average age of 28 years. There is evidence to suggest that having one or more family members aboard
# Titanic increased survival rate.

# Explanation of analysis: There were 891 registered passengers aboard Titanic. 342 passengers survived. Of those 342
# survivors, 233 were female and 109 were male. On average, 74.2 percent of females survived versus 18.9 percent of
# males. The expected value of a survivor’s ticket class is 1.95 versus 2.53 for non-survivors. In other words,
# while not all survivors were first class passengers, first class passengers had a higher survival rate.
# For example, of the 216 first class passengers, 136 survived (63.0 percent survival rate); of the 491 third class
# passengers, only 119 survived (24.2 percent survival rate). The average survivor was 28.3 years old.
# Independent of other attributes, the average female aboard Titanic was 27.9 years old versus an average age of 
# 30.7 years for males. While sex was indicative of survival rate, the females aboard Titanic were, on average,
# younger than the males. As for passengers with families, 354 passengers had at least one family member on Titanic.
# 179 of those 354 passengers survived, which is a 50.6 percent survival rate. Of the 537 passengers who not did not
# have any family aboard Titanic, only 30.4 percent survived (163 passengers). This implies that having at least one
# family member aboard Titanic increased a passenger’s chance of survival by 20.2 percentage points.

# 2. For this analysis, I grouped the DataFrame’s ‘Embarked’ attribute to explore whether or not a passenger’s place
# of origin contributed to overall survival rate. Grouping the DataFrame’s three ports of embarkment on the y-axis 
# (Cherbourg, Queenstown and Southampton), the x-axis contains ‘Survived’, ‘Pclass’, ‘Age’ and ‘TotalFam’. 
# Deploying the mean to each data field, we see the average survival rate, ticket class, age and total family members
# aboard for Cherbourg (France), Queenstown (Ireland) and Southampton (England). My hypothesis for ‘TotalFam’ was to
# see if a passenger’s “passport” affected their survival rate. The data illustrates that 55.4 percent of Cherbourg
# passengers survived whereas only 33.7 percent and 39.0 percent of Southampton and Queenstown passengers survived,
# respectively. At first glance, Cherbourg passengers appear to have had an advantage. However, the data also
# illustrates that the average Cherbourg passenger possessed first- or second-class status whereas those who embarked
# from Southampton and Queenstown were, on average, possessing second- or third-class status.

# Attributes Used: Survived; Pclass; Sex; Age; Embarked; TotalFam (SibSp + Parch).

# 3. I engineered the attribute ‘TotalFam’ which is the aggregate of ‘SibSp’ and ‘Parch’. The purpose of generating
# this new attribute was to explore whether or not having at least one family member aboard Titanic increased a
# passenger’s survival rate. 354 passengers had at least one family member on Titanic. 179 of those 354 passengers
# survived, which is a 50.6 percent survival rate. Of the 537 passengers who not did not have any family aboard
# Titanic, only 30.4 percent survived (163 passengers). This implies that having at least one family member aboard
# Titanic increased a passenger’s chance of survival by 20.2 percentage points.

# 4. Relevant attributes such as ‘Age’ and ‘Embarked’ were missing values. However, the dataset already accounted
# for this by imbedding NaN values in place of missing values. NaN values are necessary so Pandas knows how to
# handle those data fields. I confirmed this by comparing one of my grouped tables with itself but having NaN 
# values dropped. There was no difference in the output values meaning there is no data discrepancy in the analysis.

In [97]:
# Uploading data.

In [98]:
titanic_data = pd.read_csv('train.csv', sep=',')
titanic_data.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [99]:
# Checking DataFrame to confirm that all rows contain non-null data fields.
# 891 data records for majority of columns (891 passengers).

In [100]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [101]:
# Filtering DataFrame to only reflect survivors.

In [102]:
titanic_survived = titanic_data[titanic_data['Survived'] > 0]
titanic_survived.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


In [103]:
# Confirming that all rows contain non-null data fields. 342 data records for majority of columns (342 passengers
# who survived).

In [104]:
titanic_survived.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 342 entries, 1 to 889
Data columns (total 12 columns):
PassengerId    342 non-null int64
Survived       342 non-null int64
Pclass         342 non-null int64
Name           342 non-null object
Sex            342 non-null object
Age            290 non-null float64
SibSp          342 non-null int64
Parch          342 non-null int64
Ticket         342 non-null object
Fare           342 non-null float64
Cabin          136 non-null object
Embarked       340 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 34.7+ KB


In [105]:
# Counting survivors by ticket class.

In [106]:
titanic_survived['Pclass'].value_counts()

1    136
3    119
2     87
Name: Pclass, dtype: int64

In [107]:
# Counting ticket class frequencies (non-survivors included). Compare against passengers who survived.

In [108]:
titanic_data['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [109]:
# Counting survivors by gender.

In [110]:
titanic_survived['Sex'].value_counts()

female    233
male      109
Name: Sex, dtype: int64

In [111]:
# Counting passenger gender frequencies (non-survivors included). Compare against passengers who survived.

In [112]:
titanic_data['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [113]:
# Grouping DataFrame by survivors.

In [114]:
titanic_grouped_survivors = titanic_data.groupby('Survived')

In [115]:
titanic_grouped_survivors.groups

{0: Int64Index([  0,   4,   5,   6,   7,  12,  13,  14,  16,  18,
             ...
             877, 878, 881, 882, 883, 884, 885, 886, 888, 890],
            dtype='int64', length=549),
 1: Int64Index([  1,   2,   3,   8,   9,  10,  11,  15,  17,  19,
             ...
             865, 866, 869, 871, 874, 875, 879, 880, 887, 889],
            dtype='int64', length=342)}

In [116]:
# Deploying mean statistics on relevant columns to compare survivor averages versus non-survivor averages.

In [117]:
titanic_grouped_survivors.mean()

Unnamed: 0_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,447.016393,2.531876,30.626179,0.553734,0.32969,22.117887
1,444.368421,1.950292,28.34369,0.473684,0.464912,48.395408


In [118]:
# Data engineering new column. New column "TotalFam" for aggregating total family members aboard Titanic.

In [119]:
titanic_data['TotalFam'] = titanic_data['SibSp'] + titanic_data['Parch']

In [120]:
titanic_data.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TotalFam
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0


In [121]:
# Confirming number of records for all passengers who had at least one family member aboard.

In [122]:
titanic_data[titanic_data['TotalFam'] > 0].count()

PassengerId    354
Survived       354
Pclass         354
Name           354
Sex            354
Age            310
SibSp          354
Parch          354
Ticket         354
Fare           354
Cabin          110
Embarked       354
TotalFam       354
dtype: int64

In [123]:
# Confirming number of records for all passengers who had no family members aboard.

In [124]:
titanic_data[titanic_data['TotalFam'] == 0].count()

PassengerId    537
Survived       537
Pclass         537
Name           537
Sex            537
Age            404
SibSp          537
Parch          537
Ticket         537
Fare           537
Cabin           94
Embarked       535
TotalFam       537
dtype: int64

In [125]:
# Confirming number of records for survived passengers with at least one family member aboard.

In [126]:
titanic_data[(titanic_data['TotalFam'] > 0) & (titanic_data['Survived'] > 0)].count()

PassengerId    179
Survived       179
Pclass         179
Name           179
Sex            179
Age            160
SibSp          179
Parch          179
Ticket         179
Fare           179
Cabin           81
Embarked       179
TotalFam       179
dtype: int64

In [127]:
# Confirming number of records for survived passengers with no family members aboard.

In [128]:
titanic_data[(titanic_data['TotalFam'] == 0) & (titanic_data['Survived'] > 0)].count()

PassengerId    163
Survived       163
Pclass         163
Name           163
Sex            163
Age            130
SibSp          163
Parch          163
Ticket         163
Fare           163
Cabin           55
Embarked       161
TotalFam       163
dtype: int64

In [129]:
# Grouping DataFrame by sex. Deploying mean statistics on relevant columns to see how sex affected survival rate.

In [130]:
Sex_mean = titanic_data.groupby('Sex').mean()[['Survived', 'Pclass', 'Age', 'TotalFam']]

In [131]:
Sex_mean

Unnamed: 0_level_0,Survived,Pclass,Age,TotalFam
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.742038,2.159236,27.915709,1.343949
male,0.188908,2.389948,30.726645,0.665511


In [132]:
# Grouping DataFrame by where passengers embarked. Deploying mean statistics on relevant columns to see how port
# of embarkment affected survival rate.

In [133]:
embark_mean = titanic_data.groupby('Embarked').mean()[['Survived', 'Pclass', 'Age', 'TotalFam']]

In [134]:
embark_mean

Unnamed: 0_level_0,Survived,Pclass,Age,TotalFam
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C,0.553571,1.886905,30.814769,0.75
Q,0.38961,2.909091,28.089286,0.597403
S,0.336957,2.350932,29.445397,0.984472


In [135]:
# Confirming that grouping DataFrame handled non-null values appropriately.

In [136]:
embark_mean.dropna()

Unnamed: 0_level_0,Survived,Pclass,Age,TotalFam
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C,0.553571,1.886905,30.814769,0.75
Q,0.38961,2.909091,28.089286,0.597403
S,0.336957,2.350932,29.445397,0.984472


In [137]:
# Counting total passengers from each port (point of embarkment).

In [138]:
embark_count = titanic_data.groupby('Embarked').count()[['PassengerId']]

In [139]:
embark_count

Unnamed: 0_level_0,PassengerId
Embarked,Unnamed: 1_level_1
C,168
Q,77
S,644
