https://dacon.io/competitions/open/235539/overview/description

타이타닉은 역사상 가장 유명한 난파선 중 하나입니다.

1912년 4월 15일, 타이타닉은 항해중 빙산과의 충돌로 인해 침몰합니다. 탑승인원 2224명 중 1502명이 사망한 비극적인 사건이었고, 선박의 안전규정을 개선시키는 계기가 되었습니다.

이 과제에서는 여러분은 어떤 종류의 사람들이 많이 살아남았는지에 대한 분석을 해야 합니다.
그리고 머신러닝을 이용해 어떤 승객이 생존했을지 예측해보세요. 

1. train.csv / test.csv : 타이타닉 탑승자들 중 일부의 인적 정보와 생존 여부 데이터
PassengerID : 탑승객 고유 아이디
Survival : 탑승객 생존 유무 (0: 사망, 1: 생존)
Pclass : 등실의 등급
Name : 이름
Sex : 성별
Age : 나이
Sibsp : 함께 탐승한 형제자매, 아내, 남편의 수
Parch : 함께 탐승한 부모, 자식의 수
Ticket :티켓 번호
Fare : 티켓의 요금
Cabin : 객실번호
Embarked : 배에 탑승한 항구 이름 ( C = Cherbourn, Q = Queenstown, S = Southampton)

2. sample_submission.csv : 정답 파일의 예시

In [84]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

from sklearn.linear_model    import LogisticRegression
from sklearn.linear_model    import SGDClassifier
from sklearn.metrics         import mean_squared_error

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers

In [2]:
path = './dataset/타이타닉/'

In [3]:
df_train = pd.read_csv(path + 'train.csv')
df_test = pd.read_csv(path + 'test.csv')
df_submission = pd.read_csv(path + 'submission.csv')

In [4]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [6]:
df_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


## Data preprocessing

In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [8]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


### - 상관성이 없다고 판단되는 Name과 결측값이 많은 Cabin 제거

In [9]:
df_train.drop(columns=['Name', 'Cabin'], inplace= True)
df_test.drop(columns=['Name', 'Cabin'], inplace= True)

In [10]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB


### - Nan 데이터 행 단위로 삭제

In [11]:
df_train.dropna(axis=0, how='any', inplace=True)
df_test.dropna(axis=0, how='any', inplace=True)

In [12]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Sex          712 non-null    object 
 4   Age          712 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Ticket       712 non-null    object 
 8   Fare         712 non-null    float64
 9   Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(3)
memory usage: 61.2+ KB


### - Ticket 데이터 형변환

In [13]:
df_train['Ticket'] = [i.split()[-1] for i in list(df_train['Ticket'])]
df_test['Ticket'] = [i.split()[-1] for i in list(df_test['Ticket'])]

In [14]:
df_train = df_train[(df_train['Ticket'] != 'LINE')]

In [15]:
df_train['Ticket'] = df_train['Ticket'].astype('int64')
df_test['Ticket'] = df_test['Ticket'].astype('int64')

In [16]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 708 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  708 non-null    int64  
 1   Survived     708 non-null    int64  
 2   Pclass       708 non-null    int64  
 3   Sex          708 non-null    object 
 4   Age          708 non-null    float64
 5   SibSp        708 non-null    int64  
 6   Parch        708 non-null    int64  
 7   Ticket       708 non-null    int64  
 8   Fare         708 non-null    float64
 9   Embarked     708 non-null    object 
dtypes: float64(2), int64(6), object(2)
memory usage: 60.8+ KB


### - 레이블 인코딩 처리(Sex, Embarked)

In [17]:
df_train['Sex'] = np.where(df_train['Sex'] == 'male', 1 , 0)
df_test['Sex'] = np.where(df_test['Sex'] == 'male', 1 , 0)

In [18]:
le = LabelEncoder()

le.fit(df_train['Embarked'])

df_train['Embarked'] = le.transform(df_train['Embarked'])
df_test['Embarked'] = le.transform(df_test['Embarked'])

### - 상관 계수 확인

In [28]:
df_train.corr(method='pearson')

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,1.0,0.029619,-0.033548,0.026582,0.03246,-0.084239,-0.012923,-0.029173,0.008102,-0.000585
Survived,0.029619,1.0,-0.356557,-0.537611,-0.081042,-0.016561,0.094666,-0.104992,0.265914,-0.181642
Pclass,-0.033548,-0.356557,1.0,0.147527,-0.368364,0.068242,0.026348,0.322058,-0.551491,0.242228
Sex,0.026582,-0.537611,0.147527,1.0,0.098609,-0.104193,-0.247958,0.081889,-0.180174,0.107675
Age,0.03246,-0.081042,-0.368364,0.098609,1.0,-0.307621,-0.187863,-0.106422,0.094094,-0.03318
SibSp,-0.084239,-0.016561,0.068242,-0.104193,-0.307621,1.0,0.382362,0.095227,0.138106,0.034743
Parch,-0.012923,0.094666,0.026348,-0.247958,-0.187863,0.382362,1.0,-0.033503,0.205149,0.01331
Ticket,-0.029173,-0.104992,0.322058,0.081889,-0.106422,0.095227,-0.033503,1.0,-0.161993,0.191212
Fare,0.008102,0.265914,-0.551491,-0.180174,0.094094,0.138106,0.205149,-0.161993,1.0,-0.282148
Embarked,-0.000585,-0.181642,0.242228,0.107675,-0.03318,0.034743,0.01331,0.191212,-0.282148,1.0


- 상관 계수를 확인해볼 때 생존 유무와 연관성이 짙은 feature는 Pclass(등실의 등급), Sex(성별), Fare(티켓의 요금), Embarked(배에 탑승한 항구 이름)으로 나타난다.

## Train & Evaluate

In [60]:
x, y = df_train.loc[:, ['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Embarked']], df_train.loc[:, 'Survived']

In [86]:
stf = StratifiedKFold(n_splits=5, shuffle= True)

print('stf.get_n_splits  : ', stf.get_n_splits(x, y))

model_lr = LogisticRegression()

model_lr_dl = Sequential()
model_lr_dl.add(Dense(1, input_dim=1, activation='sigmoid'))

sgd = optimizers.SGD(lr=0.01)
model_lr_dl.compile(optimizer=sgd, loss='binary_crossentropy', metrics=['binary_accuracy'])


score = []
for train_idx, test_idx in stf.split(x, y):
    X_train, X_test = x.values[train_idx], x.values[test_idx]
    y_train, y_test = y.values[train_idx], y.values[test_idx]

#     model_lr_dl.fit(X_train, y_train, epochs=200)
#     pred_y = model_lr_dl.predict(X_test)
    
    model_lr.fit(X_train, y_train)
    pred_y = model_lr.predict(X_test)
    
    score.append(mean_squared_error(y_test, pred_y))
    

stf.get_n_splits  :  5


  super(SGD, self).__init__(name, **kwargs)


In [111]:
np.mean(score)

0.3855558885226251

### - 상관 계수 0.1 이하 제거 후 다시 진행

In [90]:
df_train.corr(method='pearson').iloc[:, 1]

PassengerId    0.029619
Survived       1.000000
Pclass        -0.356557
Sex           -0.537611
Age           -0.081042
SibSp         -0.016561
Parch          0.094666
Ticket        -0.104992
Fare           0.265914
Embarked      -0.181642
Name: Survived, dtype: float64

In [133]:
df_train_modify = df_train.drop(columns= ['PassengerId', 'Age', 'SibSp', 'Parch', 'Ticket'])

In [134]:
df_train_modify.keys()

Index(['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked'], dtype='object')

In [139]:
x_1, y_1 = df_train_modify.loc[:, ['Pclass', 'Sex', 'Fare', 'Embarked']], df_train.loc[:, 'Survived']

In [147]:
stf = StratifiedKFold(n_splits=5, shuffle= True)

print('stf.get_n_splits  : ', stf.get_n_splits(x, y))

model_lr = LogisticRegression()
model_sgd = SGDClassifier()
model_lr_dl = Sequential()
model_lr_dl.add(Dense(1, input_dim=1, activation='sigmoid'))

sgd = optimizers.SGD(lr=0.01)
model_lr_dl.compile(optimizer=sgd, loss='binary_crossentropy', metrics=['binary_accuracy'])


score = []
for train_idx, test_idx in stf.split(x_1, y_1):
    X_train, X_test = x_1.values[train_idx], x_1.values[test_idx]
    y_train, y_test = y_1.values[train_idx], y_1.values[test_idx]

#     model_lr_dl.fit(X_train, y_train, epochs=200)
#     pred_y = model_lr_dl.predict(X_test)
    
    model_sgd.fit(X_train, y_train)
    pred_y = model_sgd.predict(X_test)
    
    score.append(mean_squared_error(y_test, pred_y))
    

stf.get_n_splits  :  5


  super(SGD, self).__init__(name, **kwargs)


In [148]:
np.mean(score)

0.3248326840475477

In [132]:
df_train_modify.corr(method='pearson').iloc[:, 0]

Survived    1.000000
Pclass     -0.356557
Sex        -0.537611
Ticket     -0.104992
Fare        0.265914
Embarked   -0.181642
Name: Survived, dtype: float64