# Titanic - Machine Learning from Disaster
### Public Leaderboard Score: 0.78468

[Kaggle Page](https://www.kaggle.com/c/titanic)

This notebook is a practice of the code from the following post.

<https://teddylee777.github.io/kaggle/kaggle(캐글)-Titanic-생존자예측-81-이상-달성하기>

In [1]:
import pandas as pd

## Read data

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
gender_submission = pd.read_csv('gender_submission.csv')

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [5]:
gender_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


## Preprocessing

### All columns
Make all column names lowercase for easier coding.

In [8]:
train.columns = train.columns.str.lower()
test.columns = test.columns.str.lower()

### Sex
Change dtype from object to category and assign numeric values.

In [9]:
train['sex'].dtype

dtype('O')

In [10]:
train['sex'].value_counts()

male      577
female    314
Name: sex, dtype: int64

In [11]:
test['sex'].value_counts()

male      266
female    152
Name: sex, dtype: int64

In [12]:
train['sex_code'] = train['sex'].astype('category').cat.codes
test['sex_code'] = test['sex'].astype('category').cat.codes

In [13]:
train['sex_code'].value_counts()

1    577
0    314
Name: sex_code, dtype: int64

In [14]:
test['sex_code'].value_counts()

1    266
0    152
Name: sex_code, dtype: int64

### Embarked
Fill na values with the most common value, 'S'. Then change dtype from object to category and assign numeric values.

In [15]:
train['embarked'].isna().sum()

2

In [16]:
test['embarked'].isna().sum()

0

In [17]:
train['embarked'].value_counts()

S    644
C    168
Q     77
Name: embarked, dtype: int64

In [18]:
test['embarked'].value_counts()

S    270
C    102
Q     46
Name: embarked, dtype: int64

In [19]:
train['embarked'] = train['embarked'].fillna('S')

In [20]:
train['embarked'].isna().sum()

0

In [21]:
train['embarked_code'] = train['embarked'].astype('category').cat.codes
test['embarked_code'] = test['embarked'].astype('category').cat.codes

In [22]:
train['embarked_code'].value_counts()

2    646
0    168
1     77
Name: embarked_code, dtype: int64

In [23]:
test['embarked_code'].value_counts()

2    270
0    102
1     46
Name: embarked_code, dtype: int64

### SibSp, Parch
combine sibsp and parch, and add 1 (to include self) to create new column 'family'.

In [24]:
train['sibsp'].value_counts()

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: sibsp, dtype: int64

In [25]:
train['parch'].value_counts()

0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: parch, dtype: int64

In [26]:
test['sibsp'].value_counts()

0    283
1    110
2     14
3      4
4      4
8      2
5      1
Name: sibsp, dtype: int64

In [27]:
test['parch'].value_counts()

0    324
1     52
2     33
3      3
4      2
9      2
6      1
5      1
Name: parch, dtype: int64

In [28]:
train['family'] = train['sibsp'] + train['parch'] + 1
test['family'] = test['sibsp'] + test['parch'] + 1

In [29]:
train['family'].value_counts()

1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: family, dtype: int64

In [30]:
test['family'].value_counts()

1     253
2      74
3      57
4      14
5       7
7       4
11      4
6       3
8       2
Name: family, dtype: int64

Add 'solo' column to indicate if the passenger is onboard alone.

In [31]:
train['solo'] = (train['family'] == 1)
test['solo'] = (test['family'] == 1)

In [32]:
train['solo'].value_counts()

True     537
False    354
Name: solo, dtype: int64

In [33]:
test['solo'].value_counts()

True     253
False    165
Name: solo, dtype: int64

### Fare
Fill na values with the median fare of its pclass. Apply binning (5 equal-sized buckets). Then assign numeric values to each bucket.

In [34]:
train['fare'].isna().sum()

0

In [35]:
test['fare'].isna().sum()

1

In [36]:
test.loc[test['fare'].isnull()]

Unnamed: 0,passengerid,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,sex_code,embarked_code,family,solo
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,1,2,1,True


In [37]:
test.groupby('pclass')['fare'].median()

pclass
1    60.0000
2    15.7500
3     7.8958
Name: fare, dtype: float64

In [38]:
test['fare'] = test.groupby('pclass')['fare'].transform(lambda x: x.fillna(x.median()))

In [39]:
test.groupby('pclass')['fare'].median()

pclass
1    60.0000
2    15.7500
3     7.8958
Name: fare, dtype: float64

In [40]:
test.loc[152:152, ]

Unnamed: 0,passengerid,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,sex_code,embarked_code,family,solo
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,7.8958,,S,1,2,1,True


In [41]:
pd.qcut(train['fare'], 5).value_counts()

(7.854, 10.5]        184
(21.679, 39.688]     180
(-0.001, 7.854]      179
(39.688, 512.329]    176
(10.5, 21.679]       172
Name: fare, dtype: int64

In [42]:
pd.qcut(test['fare'], 5).value_counts()

(-0.001, 7.796]     85
(10.667, 21.196]    84
(46.34, 512.329]    84
(21.196, 46.34]     83
(7.796, 10.667]     82
Name: fare, dtype: int64

In [43]:
train['fare_code'] = pd.qcut(train['fare'], 5).cat.codes 
test['fare_code'] = pd.qcut(test['fare'], 5).cat.codes 

In [44]:
train['fare_code'].value_counts()

1    184
3    180
0    179
4    176
2    172
Name: fare_code, dtype: int64

In [45]:
test['fare_code'].value_counts()

0    85
2    84
4    84
3    83
1    82
Name: fare_code, dtype: int64

### Name
Extract the title from each name. (How? utilize that fact that each title has . at the end.) Then, simplify titles and convert them to numeric values.

In [46]:
train['name'].str.extract('([A-Za-z]+)[.]')

Unnamed: 0,0
0,Mr
1,Mrs
2,Miss
3,Mrs
4,Mr
...,...
886,Rev
887,Miss
888,Miss
889,Mr


In [47]:
train['name_code'] = train['name'].str.extract('([A-Za-z]+)[.]')

In [48]:
train['name_code'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: name_code, dtype: int64

In [49]:
train['name_code'].value_counts().index

Index(['Mr', 'Miss', 'Mrs', 'Master', 'Dr', 'Rev', 'Mlle', 'Major', 'Col',
       'Countess', 'Capt', 'Ms', 'Sir', 'Lady', 'Mme', 'Don', 'Jonkheer'],
      dtype='object')

In [50]:
train['name_code'] = train['name_code'].replace(['Dr', 'Rev', 'Mlle', 'Major', 'Col',
       'Countess', 'Capt', 'Ms', 'Sir', 'Lady', 'Mme', 'Don', 'Jonkheer'], 'other')

In [51]:
train['name_code'].value_counts()

Mr        517
Miss      182
Mrs       125
Master     40
other      27
Name: name_code, dtype: int64

In [52]:
train['name_code'] = train['name_code'].astype('category').cat.codes

In [53]:
train['name_code'].value_counts()

2    517
1    182
3    125
0     40
4     27
Name: name_code, dtype: int64

In [54]:
test['name'].str.extract('([A-Za-z]+)[.]')

Unnamed: 0,0
0,Mr
1,Mrs
2,Mr
3,Mr
4,Mrs
...,...
413,Mr
414,Dona
415,Mr
416,Mr


In [55]:
test['name_code'] = test['name'].str.extract('([A-Za-z]+)[.]')

In [56]:
test['name_code'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: name_code, dtype: int64

In [57]:
test['name_code'].value_counts().index

Index(['Mr', 'Miss', 'Mrs', 'Master', 'Col', 'Rev', 'Ms', 'Dr', 'Dona'], dtype='object')

In [58]:
test['name_code'] = test['name_code'].replace(['Col', 'Rev', 'Ms', 'Dr', 'Dona'], 'other')

In [59]:
test['name_code'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
other       7
Name: name_code, dtype: int64

In [60]:
test['name_code'] = test['name_code'].astype('category').cat.codes

In [61]:
test['name_code'].value_counts()

2    240
1     78
3     72
0     21
4      7
Name: name_code, dtype: int64

### Age
Fill na values with the median age of their 'name_code'. Then apply binning.

In [62]:
train['age'].isna().sum()

177

In [63]:
test['age'].isna().sum()

86

In [64]:
train['age'] = train.groupby('name_code')['age'].transform(lambda x: x.fillna(x.median()))
test['age'] = test.groupby('name_code')['age'].transform(lambda x: x.fillna(x.median()))

In [65]:
train['age'].isna().sum()

0

In [66]:
test['age'].isna().sum()

0

In [67]:
train.loc[ train['age'] <= 10, 'age_code'] = 0
train.loc[(train['age'] > 10) & (train['age'] <= 16), 'age_code'] = 1
train.loc[(train['age'] > 16) & (train['age'] <= 20), 'age_code'] = 2
train.loc[(train['age'] > 20) & (train['age'] <= 26), 'age_code'] = 3
train.loc[(train['age'] > 26) & (train['age'] <= 30), 'age_code'] = 4
train.loc[(train['age'] > 30) & (train['age'] <= 36), 'age_code'] = 5
train.loc[(train['age'] > 36) & (train['age'] <= 40), 'age_code'] = 6
train.loc[(train['age'] > 40) & (train['age'] <= 46), 'age_code'] = 7
train.loc[(train['age'] > 46) & (train['age'] <= 50), 'age_code'] = 8
train.loc[(train['age'] > 50) & (train['age'] <= 60), 'age_code'] = 9
train.loc[ train['age'] > 60, 'age_code'] = 10

test.loc[ test['age'] <= 10, 'age_code'] = 0
test.loc[(test['age'] > 10) & (test['age'] <= 16), 'age_code'] = 1
test.loc[(test['age'] > 16) & (test['age'] <= 20), 'age_code'] = 2
test.loc[(test['age'] > 20) & (test['age'] <= 26), 'age_code'] = 3
test.loc[(test['age'] > 26) & (test['age'] <= 30), 'age_code'] = 4
test.loc[(test['age'] > 30) & (test['age'] <= 36), 'age_code'] = 5
test.loc[(test['age'] > 36) & (test['age'] <= 40), 'age_code'] = 6
test.loc[(test['age'] > 40) & (test['age'] <= 46), 'age_code'] = 7
test.loc[(test['age'] > 46) & (test['age'] <= 50), 'age_code'] = 8
test.loc[(test['age'] > 50) & (test['age'] <= 60), 'age_code'] = 9
test.loc[ test['age'] > 60, 'age_code'] = 10

In [68]:
train['age_code'].value_counts()

4.0     209
3.0     176
5.0     127
2.0      79
0.0      68
7.0      53
6.0      45
9.0      42
1.0      36
8.0      34
10.0     22
Name: age_code, dtype: int64

In [69]:
test['age_code'].value_counts()

4.0     103
3.0      99
5.0      36
2.0      35
7.0      29
6.0      29
0.0      26
9.0      20
8.0      18
1.0      12
10.0     11
Name: age_code, dtype: int64

### Cabin
Select the first alphabet letter. Then, convert to numeric value. Then, Fill na values with the median value of pclass.

In [70]:
train['cabin'].isna().sum()

687

In [71]:
test['cabin'].isna().sum()

327

In [72]:
train['cabin'].value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: cabin, Length: 147, dtype: int64

In [73]:
test['cabin'].value_counts()

B57 B59 B63 B66    3
B45                2
C89                2
C55 C57            2
A34                2
                  ..
E52                1
D30                1
E31                1
C62 C64            1
C105               1
Name: cabin, Length: 76, dtype: int64

In [74]:
train['cabin_code'] = train['cabin'].str[:1]
train['cabin_code'].value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: cabin_code, dtype: int64

In [75]:
test['cabin_code'] = test['cabin'].str[:1]
test['cabin_code'].value_counts()

C    35
B    18
D    13
E     9
F     8
A     7
G     1
Name: cabin_code, dtype: int64

In [76]:
mapping = {
    'A': 0,
    'B': 1,
    'C': 2,
    'D': 3,
    'E': 4,
    'F': 5,
    'G': 6,
    'T': 7
}

In [77]:
train['cabin_code'] = train['cabin_code'].map(mapping)
train['cabin_code']

0      NaN
1      2.0
2      NaN
3      2.0
4      NaN
      ... 
886    NaN
887    1.0
888    NaN
889    2.0
890    NaN
Name: cabin_code, Length: 891, dtype: float64

In [78]:
test['cabin_code'] = test['cabin_code'].map(mapping)
test['cabin_code']

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
      ... 
413    NaN
414    2.0
415    NaN
416    NaN
417    NaN
Name: cabin_code, Length: 418, dtype: float64

In [79]:
train['cabin_code'] = train.groupby('pclass')['cabin_code'].transform(lambda x: x.median())
train['cabin_code'].value_counts()

5.0    491
2.0    216
4.5    184
Name: cabin_code, dtype: int64

In [80]:
test['cabin_code'] = test.groupby('pclass')['cabin_code'].transform(lambda x: x.median())
test['cabin_code'].value_counts()

5.0    311
2.0    107
Name: cabin_code, dtype: int64

### Featurers & label

In [81]:
train.columns

Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp',
       'parch', 'ticket', 'fare', 'cabin', 'embarked', 'sex_code',
       'embarked_code', 'family', 'solo', 'fare_code', 'name_code', 'age_code',
       'cabin_code'],
      dtype='object')

In [82]:
features = [
    'pclass', 
#     'sibsp',
#     'parch', 
    'sex_code',
    'embarked_code', 
    'family', 
#     'solo', 
    'fare_code', 
    'name_code', 
    'age_code',
#     'cabin_code'
]

label = 'survived'

### Modeling

In [83]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

In [84]:
x_train = train[features]
y_train = train[label]

k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
model = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=0)
cross_val_score(model, x_train, y_train, cv=k_fold, scoring='accuracy').mean()

0.8271535580524345

In [85]:
model.fit(x_train, y_train)

RandomForestClassifier(max_depth=6, n_estimators=50, random_state=0)

In [86]:
x_test = test[features]
gender_submission['Survived'] = model.predict(x_test)
gender_submission.to_csv('titanic_submission.csv', index=False)