https://dacon.io/competitions/open/235539/overview/description

타이타닉은 역사상 가장 유명한 난파선 중 하나입니다.

1912년 4월 15일, 타이타닉은 항해중 빙산과의 충돌로 인해 침몰합니다. 탑승인원 2224명 중 1502명이 사망한 비극적인 사건이었고, 선박의 안전규정을 개선시키는 계기가 되었습니다.

이 과제에서는 여러분은 어떤 종류의 사람들이 많이 살아남았는지에 대한 분석을 해야 합니다.
그리고 머신러닝을 이용해 어떤 승객이 생존했을지 예측해보세요. 

1. train.csv / test.csv : 타이타닉 탑승자들 중 일부의 인적 정보와 생존 여부 데이터
PassengerID : 탑승객 고유 아이디
Survival : 탑승객 생존 유무 (0: 사망, 1: 생존)
Pclass : 등실의 등급
Name : 이름
Sex : 성별
Age : 나이
Sibsp : 함께 탐승한 형제자매, 아내, 남편의 수
Parch : 함께 탐승한 부모, 자식의 수
Ticket :티켓 번호
Fare : 티켓의 요금
Cabin : 객실번호
Embarked : 배에 탑승한 항구 이름 ( C = Cherbourn, Q = Queenstown, S = Southampton)

2. sample_submission.csv : 정답 파일의 예시

In [34]:
# !pip install tensorflow
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

from sklearn.linear_model    import LogisticRegression
from sklearn.linear_model    import SGDClassifier
from sklearn.metrics         import mean_squared_error
from sklearn.metrics         import accuracy_score 

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers

In [2]:
path = './dataset/타이타닉/'

In [3]:
df_train = pd.read_csv(path + 'train.csv')
df_test = pd.read_csv(path + 'test.csv')
df_submission = pd.read_csv(path + 'submission.csv')

In [4]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [6]:
df_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


## Data preprocessing

In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [8]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


### - 상관성이 없다고 판단되는 Name 제거

In [9]:
df_train.drop(columns=['Name'], inplace= True)
df_test.drop(columns=['Name'], inplace= True)

In [10]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


### - Age 평균값으로 채우기

In [11]:
df_train['Age'].fillna(df_train['Age'].mean(), inplace=True)
df_test['Age'].fillna(df_test['Age'].mean(), inplace=True)

In [12]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


### - Ticket 데이터 형변환

In [13]:
df_train['Ticket'] = [i.split()[-1] for i in list(df_train['Ticket'])]
df_test['Ticket'] = [i.split()[-1] for i in list(df_test['Ticket'])]

In [14]:
df_train = df_train[(df_train['Ticket'] != 'LINE')]

In [15]:
df_train['Ticket'] = df_train['Ticket'].astype('int64')
df_test['Ticket'] = df_test['Ticket'].astype('int64')

In [16]:
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.000000,1,0,21171,7.2500,,S
1,2,1,1,female,38.000000,1,0,17599,71.2833,C85,C
2,3,1,3,female,26.000000,0,0,3101282,7.9250,,S
3,4,1,1,female,35.000000,1,0,113803,53.1000,C123,S
4,5,0,3,male,35.000000,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.000000,0,0,211536,13.0000,,S
887,888,1,1,female,19.000000,0,0,112053,30.0000,B42,S
888,889,0,3,female,29.699118,1,2,6607,23.4500,,S
889,890,1,1,male,26.000000,0,0,111369,30.0000,C148,C


### - 레이블 인코딩 처리(Sex, Cabin, Embarked)

In [17]:
df_train['Sex'] = np.where(df_train['Sex'] == 'male', 1 , 0)
df_test['Sex'] = np.where(df_test['Sex'] == 'male', 1 , 0)

In [18]:
df_train['Cabin'].fillna('n', inplace= True)
df_test['Cabin'].fillna('n', inplace= True)

In [19]:
df_train['Cabin'] = df_train['Cabin'].str[:1]
df_test['Cabin'] = df_test['Cabin'].str[:1]

In [20]:
feature = ['Cabin', 'Embarked']

for i in feature:
    le = LabelEncoder()
    le = le.fit(df_train[i])
    df_train[i] = le.transform(df_train[i])
    df_test[i] = le.transform(df_test[i])

### - 상관 계수 확인

In [21]:
df_train.corr(method='pearson')

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,1.0,-0.004948,-0.033674,0.044442,0.032096,-0.058525,-0.002551,-0.022994,0.011458,-0.030008,0.014209
Survived,-0.004948,1.0,-0.338637,-0.544165,-0.068536,-0.036,0.081241,-0.0964,0.257248,-0.301241,-0.163238
Pclass,-0.033674,-0.338637,1.0,0.129507,-0.333193,0.085026,0.020252,0.286279,-0.548447,0.746279,0.155302
Sex,0.044442,-0.544165,0.129507,1.0,0.083756,-0.113249,-0.244337,0.07733,-0.180568,0.121563,0.102327
Age,0.032096,-0.068536,-0.333193,0.083756,1.0,-0.232747,-0.179189,-0.103852,0.092396,-0.250207,-0.022813
SibSp,-0.058525,-0.036,0.085026,-0.113249,-0.232747,1.0,0.414244,0.046018,0.158494,0.042705,0.067984
Parch,-0.002551,0.081241,0.020252,-0.244337,-0.179189,0.414244,1.0,-0.033529,0.215153,-0.031468,0.039614
Ticket,-0.022994,-0.0964,0.286279,0.07733,-0.103852,0.046018,-0.033529,1.0,-0.156916,0.190335,0.192277
Fare,0.011458,0.257248,-0.548447,-0.180568,0.092396,0.158494,0.215153,-0.156916,1.0,-0.522303,-0.219903
Cabin,-0.030008,-0.301241,0.746279,0.121563,-0.250207,0.042705,-0.031468,0.190335,-0.522303,1.0,0.185903


- 상관 계수를 확인해볼 때 생존 유무와 연관성이 짙은 feature는 Pclass(등실의 등급), Sex(성별), Fare(티켓의 요금), Embarked(배에 탑승한 항구 이름)으로 나타난다.

## Train & Evaluate
- 0.1 이하 삭제

In [43]:
# x, y = df_train.loc[:, ['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch',
#        'Ticket', 'Fare', 'Cabin', 'Embarked']], df_train.loc[:, 'Survived']  0.4104176210108414

# x, y = df_train.loc[:, ['Pclass', 'Sex', 'SibSp', 'Parch',
#        'Ticket', 'Fare', 'Cabin', 'Embarked']], df_train.loc[:, 'Survived']

x, y = df_train.loc[:, ['Pclass', 'Sex','Fare', 'Cabin', 'Embarked']], df_train.loc[:, 'Survived']

In [44]:
stf = StratifiedKFold(n_splits=3, shuffle= True)

print('stf.get_n_splits  : ', stf.get_n_splits(x, y))

model_lr = LogisticRegression()

model_lr_dl = Sequential()
model_lr_dl.add(Dense(1, input_dim=x.shape[1], activation='sigmoid'))

sgd = optimizers.SGD(lr=0.01)
model_lr_dl.compile(optimizer=sgd, loss='binary_crossentropy', metrics=['binary_accuracy'])


score = []
for train_idx, test_idx in stf.split(x, y):
    X_train, X_test = x.values[train_idx], x.values[test_idx]
    y_train, y_test = y.values[train_idx], y.values[test_idx]

#     model_lr_dl.fit(X_train, y_train, epochs=300)
#     pred_y = model_lr_dl.predict(X_test)
    
    model_lr.fit(X_train, y_train)
    pred_y = model_lr.predict(X_test)
    score.append(accuracy_score(y_test, pred_y))
#     score.append(mean_squared_error(y_test, pred_y))
    

stf.get_n_splits  :  3


  super(SGD, self).__init__(name, **kwargs)


In [45]:
np.mean(score)
# score

0.7835585585585586