
## Mini PJT.

- `titanic` competition을 도전해봅시다!

- 이번 프로젝트에서는 간단한 분류 문제를 풀어봅니다.

- sklearn으로 머신러닝 모델을 구현해봅니다.

- Machine Learning Workflow를 따라가봅니다.


Source : https://www.kaggle.com/c/titanic

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.simplefilter('ignore')

In [None]:
# titanic data 불러오기
base_path = './titanic_data/'
train = pd.read_csv(base_path + 'train.csv')
test = pd.read_csv(base_path + 'test.csv')
submission = pd.read_csv(base_path + 'gender_submission.csv')
# pd.read_csv('./titanic_data/test.csv')

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Data Preprocessing

1. 결측치 처리


2. feature selection (분석에 사용하지 않을 column 제거)

In [None]:
train[train.isna().any(axis=1)] # feature에 nan이 하나라도 있는 row

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


In [None]:
# titanic data에서 missing value를 찾아봅니다.

train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
# Embarked column이 NaN인 row를 찾습니다.
train[train['Embarked'].isna()] # 1등석 Female --> 비슷한 항구에서 탓을 확률이 높다.(티켓 번호도 동일하다)


# train['Embarked'].value_counts() --> 최빈값으로 채우는 경우 (S)


## Pclass 가 1이고 Sex가 female인 사람들이 어느항구에서 많이 탔는지(비슷한 조건의 데이터)

# train[(train['Sex'] == 'female') & (train['Pclass'] == 1)]['Embarked'].value_counts()


## 결론 : 대부분의 사람이 S항구에서 탔고 비슷한 조건의 사람들또한 S항구에서 탄 사람이 가장 많기 때문에 
## nan 값을 S로 채워주자.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [None]:
train.loc[train['Embarked'].isna(), 'Embarked'] = 'S'

In [None]:
train[(train['Sex'] == 'female') & (train['Pclass'] == 1)]['Embarked'].value_counts()

S    50
C    43
Q     1
Name: Embarked, dtype: int64

In [None]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [None]:
# missing value를 handling 합니다.
# column을 지울까요 / 채울까요?

## Cabin같은 경우 생존자 예측에 크게 영향이 없다고 판단 되므로 column자체를 제외.
## 생존자 예측에 영향이 없다고 판단되는 column들 같이 제외시키자.

train = train.drop(columns=['Cabin', 'Ticket', 'Name', 'PassengerId'])

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


In [None]:
## 'Age' column 채우기 --> 평균값을 이용해서 채우기 (다른 방법들도 충분히 가능)

train = train.fillna(train['Age'].mean())

In [None]:
train.isna().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

### Feature Engineering

1. Categorical feature encoding

2. Normalization

In [None]:
## Ordinal Encoding -> Ordinal feature를 변환할 때 쓰임. -> 선호도...
## One-hot Encoding -> Nominal feature를 변환할 때 쓰임. -> 성별, 부서...

In [None]:
## categorical-feature --> one-hot Encoding

train_OHE = pd.get_dummies(data=train, columns=['Sex', 'Embarked'], drop_first=True)
train_OHE

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.000000,1,0,7.2500,1,0,1
1,1,1,38.000000,1,0,71.2833,0,0,0
2,1,3,26.000000,0,0,7.9250,0,0,1
3,1,1,35.000000,1,0,53.1000,0,0,1
4,0,3,35.000000,0,0,8.0500,1,0,1
...,...,...,...,...,...,...,...,...,...
886,0,2,27.000000,0,0,13.0000,1,0,1
887,1,1,19.000000,0,0,30.0000,0,0,1
888,0,3,29.699118,1,2,23.4500,0,0,1
889,1,1,26.000000,0,0,30.0000,1,0,0


In [None]:
# Normalization --> Min-Max scaling

from sklearn.preprocessing import MinMaxScaler

In [None]:
X = train_OHE.drop(columns=['Survived']) # input matrix
y = train_OHE['Survived'] # target vector

In [None]:
model = MinMaxScaler()
scaled_data = model.fit_transform(X)

In [None]:
scaled_df = pd.DataFrame(data=scaled_data, columns=X.columns)
scaled_df.head(2)

## 0~1사이의 값으로 scaling 완료

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,1.0,0.271174,0.125,0.0,0.014151,1.0,0.0,1.0
1,0.0,0.472229,0.125,0.0,0.139136,0.0,0.0,0.0


### Training 

In [None]:
# sklearn에서 배웠던 분류 모델들을 불러와봅니다.
# Linearclassifier
# LogisticRegression
# Decisiontree
# Randomforest
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
## 평가지표

from sklearn.metrics import accuracy_score

In [None]:
clf_1 = SGDClassifier()
clf_2 = LogisticRegression()
clf_3 = DecisionTreeClassifier()
clf_4 = RandomForestClassifier()

In [None]:
# 학습
clf_1.fit(X, y)
clf_2.fit(X, y)
clf_3.fit(X, y)
clf_4.fit(X, y)


## 예측결과(학습성능)

pred_1 = clf_1.predict(X)
pred_2 = clf_2.predict(X)
pred_3 = clf_3.predict(X)
pred_4 = clf_4.predict(X)

In [None]:
# 평가

accuracy_score(y, pred_1), accuracy_score(y, pred_2), accuracy_score(y, pred_3), accuracy_score(y, pred_4)

(0.6161616161616161,
 0.8035914702581369,
 0.9820426487093153,
 0.9820426487093153)

### Test (Predict)

In [None]:
# test data에 같은 feature engineering을 적용해줍니다.
test = test.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])


In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Sex       418 non-null    object 
 2   Age       332 non-null    float64
 3   SibSp     418 non-null    int64  
 4   Parch     418 non-null    int64  
 5   Fare      417 non-null    float64
 6   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 23.0+ KB


In [None]:
test['Age'] = test['Age'].fillna(train['Age'].mean())
test['Fare'] = test['Fare'].fillna(train['Fare'].mean())

In [None]:
test.isna().sum()

Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [None]:
## One-hot Encoding

test = pd.get_dummies(data=test, columns=['Sex', 'Embarked'], drop_first=True)

In [None]:
test.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,3,34.5,0,0,7.8292,1,1,0
1,3,47.0,1,0,7.0,0,0,1
2,2,62.0,0,0,9.6875,1,1,0
3,3,27.0,0,0,8.6625,1,0,1
4,3,22.0,1,1,12.2875,0,0,1


In [None]:
## Scaling

scaled_data = model.fit_transform(test)

scaled_df_2 = pd.DataFrame(data=scaled_data, columns=test.columns)

In [None]:
pred_1 = clf_1.predict(scaled_df)
pred_2 = clf_2.predict(scaled_df)
pred_3 = clf_3.predict(scaled_df)
pred_4 = clf_4.predict(scaled_df)

In [None]:
# 결과 파일인 submission.csv를 생성합니다.
submission['Survived'] = pred_4
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


- 모든 학습이 끝나면 결과를 가지고 제출해볼 수 있습니다.

- 만든 모델 중에 가장 test 성능이 좋은 하나를 제출해볼까요?

[제출하러가기] https://www.kaggle.com/c/titanic

In [None]:
submission.to_csv('./submission.csv', index=False)