# Titanic Kaggle Competition

This notebook contains a little bit of exploratory data analysis of the Titanic-dataset. The goal is to find out if there are features in the dataset that have strong correlation with survival of a passenger.

https://www.kaggle.com/c/titanic

# Feature engineering and data exploration

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
dataset = pd.read_csv('train.csv')
testset = pd.read_csv('test.csv')
dataset = dataset.drop(['PassengerId'], axis=1)
testset = testset.drop(['PassengerId'], axis=1)

In [3]:
dataset.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


On the Kaggle's instruction video on the Titanic competition, it is said that female passengers were more likely to survive the wreck compared to male passengers. Let's explore this...

In [4]:
dataset.groupby('Sex').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


As it can be seen from the table above, at least on the training dataset, almost 75% of the female passengers survived, whereas only about 19% of the male passengers survived. So, 'Sex' feature will probably have strong effect on predicting who will survive.

Let's explore if the tickect class correlates with the survival of a passenger.

In [5]:
dataset.groupby('Pclass').mean()

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.62963,38.233441,0.416667,0.356481,84.154687
2,0.472826,29.87763,0.402174,0.380435,20.662183
3,0.242363,25.14062,0.615071,0.393075,13.67555


It seems that passengers of the ticket class 1 have bigger chance of surviving compared to the passengers of the other two classes.

Passengers embarked to Titanic from three locations, Cherbourg (C), Queenstown (Q), and Southampton (S). Let's explore if the port of embarkment correlates with the survival of a passenger.

Let's first check though, if the port of embarkement is known for each passenger. And also find out if there is missing values on other features.

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


Features 'Age', 'Cabin' and 'Embarked' contain empty values. Although, the port of embarkement data is missing only for two passengers, so let's choose the most frequent value for them. Since almost 800 passengers lack the 'Cabin' data, let's just drop that column.

In [7]:
testset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       332 non-null    float64
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      417 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 32.8+ KB


The testset also contains one null Fare.

In [8]:
dataset['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [9]:
# Since the spread of 'Age' is large, median is a better choice than mean.
dataset['Age'] = dataset['Age'].fillna(dataset['Age'].median())
testset['Age'] = testset['Age'].fillna(testset['Age'].median())
testset['Fare'] = testset['Fare'].fillna(testset['Fare'].median())

In [10]:
dataset['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [11]:
def emb(e):
    good = ['S', 'C', 'Q']

    if e not in good:
        return 'S'
    else:
        return e

In [12]:
dataset['Embarked'] = dataset['Embarked'].apply(emb)
testset['Embarked'] = testset['Embarked'].apply(emb)
dataset = dataset.drop(['Cabin'], axis=1)
testset = testset.drop(['Cabin'], axis=1)

In [13]:
dataset.groupby('Embarked').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
C,0.553571,1.886905,30.178095,0.386905,0.363095,59.954144
Q,0.38961,2.909091,28.032468,0.428571,0.168831,13.27603
S,0.339009,2.346749,29.307663,0.569659,0.411765,27.243651


As we can see, passengers who embarked from Cherbourg have higher chance of surviving compared to passengers who embarked from the other two ports.

The spread of the 'Fare', ie. ticket price, is also wide. Let's round the 'Fare' to whole number in order to group the fares more easily.

In [14]:
dataset['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [15]:
dataset['RoundFare'] = dataset['Fare'].round()
testset['RoundFare'] = testset['Fare'].round()

In [16]:
dataset.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,RoundFare
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,7.0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,71.0
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,8.0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,53.0
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,8.0


In [17]:
dataset.groupby('RoundFare').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
RoundFare,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,0.066667,1.933333,31.333333,0.000000,0.000000,0.00000
4.0,0.000000,3.000000,20.000000,0.000000,0.000000,4.01250
5.0,0.000000,1.000000,33.000000,0.000000,0.000000,5.00000
6.0,0.000000,3.000000,38.100000,0.200000,0.000000,6.42332
7.0,0.169231,3.000000,27.492308,0.076923,0.030769,7.17834
...,...,...,...,...,...,...
228.0,0.750000,1.000000,31.500000,0.250000,0.000000,227.52500
248.0,0.500000,1.000000,37.000000,0.000000,1.000000,247.52080
262.0,1.000000,1.000000,19.500000,2.000000,2.000000,262.37500
263.0,0.500000,1.000000,32.500000,2.500000,2.500000,263.00000


In [18]:
dataset['RoundFare'].value_counts()

8.0      206
7.0       65
26.0      46
13.0      46
10.0      43
        ... 
76.0       1
62.0       1
51.0       1
222.0      1
4.0        1
Name: RoundFare, Length: 90, dtype: int64

In [19]:
counts = dataset['RoundFare'].value_counts()

for fare, count in counts.items():
  print(fare, count)

8.0 206
7.0 65
26.0 46
13.0 46
10.0 43
16.0 28
14.0 28
9.0 27
27.0 17
28.0 16
31.0 16
30.0 16
21.0 15
0.0 15
12.0 14
24.0 10
52.0 9
40.0 9
15.0 9
56.0 9
23.0 9
19.0 9
29.0 8
20.0 8
53.0 8
47.0 7
70.0 7
79.0 6
11.0 6
34.0 6
83.0 5
74.0 5
39.0 5
50.0 5
6.0 5
80.0 5
77.0 5
18.0 5
78.0 5
120.0 4
37.0 4
22.0 4
42.0 4
36.0 4
228.0 4
152.0 4
111.0 4
263.0 4
25.0 4
90.0 4
57.0 4
134.0 4
55.0 3
86.0 3
136.0 3
33.0 3
113.0 3
512.0 3
82.0 3
17.0 3
211.0 3
153.0 3
71.0 3
91.0 2
147.0 2
65.0 2
262.0 2
106.0 2
165.0 2
94.0 2
67.0 2
89.0 2
69.0 2
248.0 2
109.0 2
58.0 2
32.0 2
35.0 2
61.0 2
59.0 1
75.0 1
63.0 1
212.0 1
5.0 1
38.0 1
76.0 1
62.0 1
51.0 1
222.0 1
4.0 1


With a quick glance we can see that for a little bit over half of the passengers (472) the fare was between 7 and 16. There's also 22 passengers with the fare price of under 7. So the first fare price group is 0-16. Second group is fare prices from 17 to 31, this consist of 186 passengers, or about 20%. For the rest of the fare prices, no bigger clusters can be found, so the third group is "the rest", ie. over 31, up to 512.

In [20]:
def age_group(age):
  if age < 17:
    return "group_1"
  elif age < 32:
    return "group_2"
  else:
    return "group_3"

In [21]:
dataset['AgeGroup'] = dataset['Age'].apply(age_group)
testset['AgeGroup'] = testset['Age'].apply(age_group)

In [22]:
dataset.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,RoundFare,AgeGroup
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,7.0,group_2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,71.0,group_3
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,8.0,group_2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,53.0,group_3
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,8.0,group_3


In [23]:
dataset['AgeGroup'].value_counts() / dataset.shape[0]

group_2    0.566779
group_3    0.320988
group_1    0.112233
Name: AgeGroup, dtype: float64

In [24]:
dataset.groupby('AgeGroup').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,RoundFare
AgeGroup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
group_1,0.55,2.61,8.0067,1.57,1.14,31.588877,31.55
group_2,0.338614,2.475248,25.456436,0.417822,0.227723,26.262135,26.289109
group_3,0.405594,1.909091,43.723776,0.342657,0.388112,42.91148,42.909091


As we can see above, the younger the passenger, the bigger the chance of surviving.

Feature 'sibsp' is the amount of siblings / spouses on board, and 'parch' is amount of parents / children on board. Let's combine them as a new feature 'FamilySize'.

In [25]:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch']
testset['FamilySize'] = testset['SibSp'] + testset['Parch']

In [26]:
dataset.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,RoundFare,AgeGroup,FamilySize
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,7.0,group_2,1
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,71.0,group_3,1
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,8.0,group_2,0
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,53.0,group_3,1
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,8.0,group_3,0


In [27]:
dataset.groupby('FamilySize').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,RoundFare
FamilySize,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.303538,2.400372,31.175047,0.0,0.0,21.242689,21.268156
1,0.552795,1.919255,30.928075,0.763975,0.236025,49.894129,49.857143
2,0.578431,2.22549,26.209118,0.872549,1.127451,39.692482,39.696078
3,0.724138,2.068966,18.945517,1.344828,1.655172,54.86451,54.965517
4,0.2,2.666667,22.733333,2.133333,1.866667,58.094453,57.8
5,0.136364,2.590909,18.409091,2.818182,2.181818,73.722727,73.818182
6,0.333333,3.0,15.166667,3.25,2.75,29.366667,29.083333
7,0.0,3.0,15.666667,4.333333,2.666667,46.9,47.0
10,0.0,3.0,28.0,8.0,2.0,69.55,70.0


In [28]:
dataset['FamilySize'].value_counts()

0     537
1     161
2     102
3      29
5      22
4      15
6      12
10      7
7       6
Name: FamilySize, dtype: int64

It seems that passengers with small "entourage" with them had the biggest chance of surviving. The bigger the "entourage", the smaller the change of survival. Also over half of the passengers travelled alone, only 30% of them surviving.

Each passenger seem to have some sort of a title included with their names. Let's see if it correlates with survival.

Since over half of the passengers travelled alone, et's create one more new feature 'TravelledAlone'.

In [29]:
def alone(family_size):
  if family_size == 0:
    return "alone"
  else:
    return "not_alone"

In [30]:
dataset['TravelledAlone'] = dataset['FamilySize'].apply(alone)
testset['TravelledAlone'] = testset['FamilySize'].apply(alone)

In [31]:
dataset.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,RoundFare,AgeGroup,FamilySize,TravelledAlone
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,7.0,group_2,1,not_alone
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,71.0,group_3,1,not_alone
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,8.0,group_2,0,alone
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,53.0,group_3,1,not_alone
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,8.0,group_3,0,alone


In [32]:
dataset.groupby('TravelledAlone').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,RoundFare,FamilySize
TravelledAlone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
alone,0.303538,2.400372,31.175047,0.0,0.0,21.242689,21.268156,0.0
not_alone,0.50565,2.169492,26.61065,1.316384,0.960452,48.832275,48.819209,2.276836


Travelling alone was more riskier, even though also only half of the passengers not alone, survived.

In [33]:
def title(name):
  return name.split(".")[0].split(",")[1].strip()

In [34]:
dataset['Title'] = dataset['Name'].apply(title)
testset['Title'] = testset['Name'].apply(title)

In [35]:
dataset['Title'].value_counts()

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Major             2
Col               2
Mlle              2
Ms                1
Capt              1
Lady              1
Sir               1
the Countess      1
Jonkheer          1
Don               1
Mme               1
Name: Title, dtype: int64

Let's for five title groups: Mr, Miss, Mrs, Master and Other. (in the test dataset, there is also one person with title Dona.)

In [36]:
def title_group(title):
  rest = ["Dr", "Rev", "Mlle", "Major", "Col", "the Countess", "Mme", "Jonkheer", "Sir", "Lady", "Ms", "Don", "Dona", "Capt"]

  if title not in rest:
    return title
  else:
    return "Other"

In [37]:
dataset['TitleGroup'] = dataset['Title'].apply(title_group)
testset['TitleGroup'] = testset['Title'].apply(title_group)

In [38]:
dataset['TitleGroup'].value_counts()

Mr        517
Miss      182
Mrs       125
Master     40
Other      27
Name: TitleGroup, dtype: int64

In [39]:
dataset.groupby('TitleGroup').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,RoundFare,FamilySize
TitleGroup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Master,0.575,2.625,6.91675,2.3,1.375,34.703125,34.725,3.675
Miss,0.697802,2.307692,23.005495,0.714286,0.549451,43.797873,43.796703,1.263736
Mr,0.156673,2.410058,31.362669,0.288201,0.152805,24.44156,24.464217,0.441006
Mrs,0.792,2.0,34.824,0.696,0.832,45.138533,45.104,1.528
Other,0.444444,1.333333,41.851852,0.296296,0.074074,39.111422,39.148148,0.37037


So once again: women and children are way more likely to survive than men.

In [40]:
dataset.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,RoundFare,AgeGroup,FamilySize,TravelledAlone,Title,TitleGroup
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,7.0,group_2,1,not_alone,Mr,Mr
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,71.0,group_3,1,not_alone,Mrs,Mrs
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,8.0,group_2,0,alone,Miss,Miss
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,53.0,group_3,1,not_alone,Mrs,Mrs
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,8.0,group_3,0,alone,Mr,Mr


Let's drop the columns / features we don't need anymore. Title can be dropped, because the exact same information is contained in TitleGroup. Ticket has too many different options, so it is impossible to derive any features from it. Also, name is not needed anymore.

In [41]:
dataset = dataset.drop(['Name', 'Title', 'Ticket'], axis=1)
testset = testset.drop(['Name', 'Title', 'Ticket'], axis=1)

In [42]:
dataset.to_csv('dataset.csv', index=None)
testset.to_csv('testset.csv', index=None)