In [1]:
# 'pip': The PyPA recommended tool for installing Python packages
# the '!' runs the line the same as in the terminal

!pip install seaborn



In [183]:
# imports a library 'pandas', names it as 'pd'

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# enables inline plots, without it plots don't show up in the notebook
%matplotlib inline

In [189]:
# download the data and name the columns
cols = ['PassengerId' , 'Survived' , 'Pclass' , 'Name' , 'Sex' , 'Age' , 
        'SibSp' , 'Parch' , 'Ticket' , 'Fare' , 'Cabin' , 'Embarked']

In [187]:
df = pd.read_csv('titanic.csv')

In [188]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


In [190]:
df.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

Number 1: There are 891 Passengers on our passenger list.

In [191]:
df.Survived.value_counts()

0    549
1    342
dtype: int64

In [192]:
(549+342)
342/891.00

0.3838383838383838

Number 2: The overall survival rate is 38 percent.

In [193]:
df.Sex.value_counts()

male      577
female    314
dtype: int64

Number 3: There were 577 male passengers onboard.

Number 4: There were 314 female passengers onboard.

In [194]:
df.groupby(['Sex', 'Survived']).Age.agg(['count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Sex,Survived,Unnamed: 2_level_1
female,0,64
female,1,197
male,0,360
male,1,93


In [99]:
93.00/(93+360)

0.2052980132450331

In [100]:
197.00/(64+197)

0.7547892720306514

Number 5: The overall survival rate of male passengers is 20.5 percent.

Number 6: The overall survival rate of female passengers is 75.4 percent.

In [195]:
null_df = df[df.Age.isnull()]

In [196]:
df.Age

0     22
1     38
2     26
3     35
4     35
5    NaN
6     54
7      2
8     27
9     14
10     4
11    58
12    20
13    39
14    14
...
876    20
877    19
878   NaN
879    56
880    25
881    33
882    22
883    28
884    25
885    39
886    27
887    19
888   NaN
889    26
890    32
Name: Age, Length: 891, dtype: float64

In [201]:
df.Age.fillna(df.Age.mean())

0     22.000000
1     38.000000
2     26.000000
3     35.000000
4     35.000000
5     29.699118
6     54.000000
7      2.000000
8     27.000000
9     14.000000
10     4.000000
11    58.000000
12    20.000000
13    39.000000
14    14.000000
...
876    20.000000
877    19.000000
878    29.699118
879    56.000000
880    25.000000
881    33.000000
882    22.000000
883    28.000000
884    25.000000
885    39.000000
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

In [202]:
df.Age.mean()

29.69911764705882

Number 7: The average age of all passengers on board is 29.7 years old.

Number 7a: I averaged all the ages in the Age column.

Number 7b: I inputted the average of the column age into all the null objects.  We could have also used the median or exlcluded those rows from the analysis.

In [208]:
df.groupby(['Survived']).Age.agg(['mean'])

Unnamed: 0_level_0,mean
Survived,Unnamed: 1_level_1
0,30.626179
1,28.34369


Number 8: The average age of those who survived is 28.3 years old.

Number 9: The average age of those who did not survive is 30.6 years old.

Number 10: At this early point in the analysis we can see that women were more likely to survive than men.  We can also see that those who were relatively younger are slightly more likely to surive than those who were relatively older.

In [212]:
df.groupby(['Pclass']).Pclass.agg(['count'])

Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
1,216
2,184
3,491


Number 11: There were 216 passengers in 1st class, 184 passengers in 2nd class, and 491 passengers in 3rd class.

In [214]:
df.groupby(['Pclass','Survived']).Pclass.agg(['count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Pclass,Survived,Unnamed: 2_level_1
1,0,80
1,1,136
2,0,97
2,1,87
3,0,372
3,1,119


In [215]:
136.00/216

0.6296296296296297

In [216]:
87.00/184

0.47282608695652173

In [217]:
119.00/491

0.24236252545824846

Number 12: The survival rate for passengers in the 1st class is 63.0 percent, for passengers in the 2nd class is 47.2 percent, and for passengers in the 3rd class is 24.2 percent.

Number 13: We can also conclude that the higher class service a passenger was, the more likely they were to survive.  We can also assume that women in the 1st class were the most likely to survive out of all the passengers.

Number 14: I think we should include Sex and Pclass in the predictive model and leave out Age.  The percentage of survival changes between the two Sexes and between the three Pclasses was relatively significant, while the change in survival percentages was relatively not significant.