# Predict which passengers survived the Titanic tragedy

---

### Content:
1. Overview
2. Understand the data
3. Exploratory Analysis
4. Feature engineering
5. Predict using Different Machine Learning Models
6. Summary








# 1. Overview
### Backgroud (Copied from kaggle competion description)
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this tutorial, we will analyze what sorts of people were likely to survive.

The dataset is downloaded from public website, and is alreay in tabular format. So we just skip the process of data collecting and data wrangling.

# 2. Understand the data 

In this section, we will finish following tasks
1. Load the data, look at the overall data info
2. Hava an explanation of each column
3. check the duplicates, null values

In [2]:
import pandas as pd

In [3]:
df_train = pd.read_csv("../../data/Titanic/train.csv")
df_test = pd.read_csv("../../data/Titanic/test.csv")

Let's look at each columns:
- PassengerId:
unique identifiers, no duplicates
- Survived：
Understand from the backgroud, 1502 out of 2224 were killed in tragedy. with a simple math we can see 1 means survived, 0 means not survived
- Pclass：cabin classes, 3 classes
- Name:
passenger's name
- Sex:
gender
- Age:
age, 177 rows have no age
- SibSp:
number of siblings and spouse
- Parch:
number of parents and children
- Ticket:
ticket
- Fare:
ticket fare
- Cabin:
cabin, only 204 rows have identified cabin
- Embarked:
embarked port, 2 rows missing embarked port

In [6]:
print("train set size:",df_train.shape)
print("test set size:",df_test.shape)
print("="*30)
print(df_train.ftypes)
print("="*30)
print(df_train.info(verbose=True))
print("="*30)

train set size: (891, 12)
test set size: (418, 11)
PassengerId      int64:dense
Survived         int64:dense
Pclass           int64:dense
Name            object:dense
Sex             object:dense
Age            float64:dense
SibSp            int64:dense
Parch            int64:dense
Ticket          object:dense
Fare           float64:dense
Cabin           object:dense
Embarked        object:dense
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
No

In [21]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [22]:
df_train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [27]:
# missing data
total_null_records = df_train.isnull().sum()
percent = total_null_records / df_train.isnull().count()
missing_data = pd.concat([total_null_records,percent], axis=1, keys=['total', 'percent']).sort_values(by='percent', ascending=False)
missing_data.head(20)

Unnamed: 0,total,percent
Cabin,687,0.771044
Age,177,0.198653
Embarked,2,0.002245
PassengerId,0,0.0
Survived,0,0.0
Pclass,0,0.0
Name,0,0.0
Sex,0,0.0
SibSp,0,0.0
Parch,0,0.0


In [23]:
# PassengerId
no_id_unique = df_train['PassengerId'].nunique()
print(no_id_unique)

891


In [24]:
# Survived
df_train['Survived'].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [25]:
# Pclass
df_train['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [26]:
# Sex
df_train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [26]:
# Age
print(df_train['Age'].value_counts(bins=10, ascending=False).sort_index(ascending=True))
print("="*30)
no_age_null = df_train[df_train['Age'].isnull()==True].shape[0]
print(no_age_null)

(0.339, 8.378]       54
(8.378, 16.336]      46
(16.336, 24.294]    177
(24.294, 32.252]    169
(32.252, 40.21]     118
(40.21, 48.168]      70
(48.168, 56.126]     45
(56.126, 64.084]     24
(64.084, 72.042]      9
(72.042, 80.0]        2
Name: Age, dtype: int64
177


In [28]:
# SibSp
print(df_train['SibSp'].value_counts())

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64


In [29]:
# Parch
print(df_train['Parch'].value_counts(normalize=True))

0    0.760943
1    0.132435
2    0.089787
5    0.005612
3    0.005612
4    0.004489
6    0.001122
Name: Parch, dtype: float64


In [30]:
# Ticket
print(df_train['Ticket'].value_counts())

1601                 7
347082               7
CA. 2343             7
347088               6
CA 2144              6
3101295              6
S.O.C. 14879         5
382652               5
113781               4
W./C. 6608           4
113760               4
LINE                 4
4133                 4
17421                4
347077               4
349909               4
2666                 4
PC 17757             4
19950                4
363291               3
C.A. 34651           3
110152               3
29106                3
110413               3
PC 17755             3
239853               3
248727               3
13502                3
PC 17582             3
C.A. 31921           3
                    ..
113509               1
11813                1
PC 17756             1
C 7076               1
2700                 1
14313                1
31028                1
PC 17318             1
36967                1
236853               1
347063               1
250652               1
345774     

# Fare
print(df_train['Fare'].value_counts(bins=10))

In [32]:
# Cabin
print(df_train[df_train['Cabin'].isnull()==True].shape[0])
print("="*30)
print(df_train['Cabin'].value_counts())

687
C23 C25 C27        4
G6                 4
B96 B98            4
D                  3
E101               3
F33                3
C22 C26            3
F2                 3
C93                2
C2                 2
C83                2
D20                2
F4                 2
D35                2
E25                2
C78                2
B57 B59 B63 B66    2
E33                2
E121               2
B18                2
C68                2
B5                 2
E24                2
D33                2
E67                2
C92                2
E8                 2
B22                2
D36                2
C124               2
                  ..
D50                1
C99                1
D11                1
A19                1
C104               1
C30                1
E17                1
E77                1
C110               1
B42                1
E10                1
E36                1
D19                1
T                  1
E34                1
D28                1
C128     

In [33]:
# Embarked
print(df_train[df_train['Embarked'].isnull()==True].shape[0])
print("="*30)
print(df_train['Embarked'].value_counts(normalize=True))

2
S    0.724409
C    0.188976
Q    0.086614
Name: Embarked, dtype: float64


**Generally it's time-consuming, but it will give us a flavour of the dataset**