
## Mini PJT.

- `titanic` competition을 도전해봅시다!

- 이번 프로젝트에서는 간단한 분류 문제를 풀어봅니다.

- sklearn으로 머신러닝 모델을 구현해봅니다.

- Machine Learning Workflow를 따라가봅니다.


Source : https://www.kaggle.com/c/titanic

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# titanic data 불러오기
import pandas as pd

base_path = "/content/drive/MyDrive/study/머신러닝_딥러닝_강의자료/2. 머신러닝/data/titanic/"

train = pd.read_csv(base_path + "train.csv")
test = pd.read_csv(base_path + "test.csv")
submission = pd.read_csv(base_path + "gender_submission.csv")

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Data Preprocessing

1. 결측치 처리


2. feature selection (분석에 사용하지 않을 column 제거)

In [5]:
# titanic data에서 missing value를 찾아봅니다.
train.isnull().any(axis=1)

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Length: 891, dtype: bool

In [22]:
# Embarked column이 NaN인 row를 찾습니다.
# 승선한 항구의 이름
train[train.Embarked.isnull()]

# train.Embarked.value_counts() --> "S"

# Pclass가 1이고 sex가 female인 사람

# train.loc[(train.Pclass == 1) & (train.Sex  == "female") , "Embarked"].value_counts() --> "S"
train.loc[train.Embarked.isnull(), "Embarked"] ="S"
train.loc[(train.Pclass == 1) & (train.Sex  == "female") , "Embarked"].value_counts()

S    50
C    43
Q     1
Name: Embarked, dtype: int64

In [27]:
# missing value를 handling 합니다.
# column을 지울까요 / 채울까요?

# train.Cabin.value_counts() --> Drop
# drop 할 column을 생각해 볼면?
# PassengerId, Name, Ticket, Cabin, 
# train = train.drop(columns = ["Ticket","Cabin", "PassengerId", "Name"])
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


In [29]:
# "Age" column 채우기
# 나이는 평균으로 채우기
# train = train.fillna(train.Age.mean())
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


### Feature Engineering

1. Categorical feature encoding

2. Normalization

In [32]:
# categorical feature --> One-hot Encoding, Ordinal Encoding

# Ordinal Encoding => Ordinal feature를 변환할 때 쓰임. e.g. 학력, 선호도, ...
# One-hot Encoding => Nominal feature를 변환할 때 쓰임. e.g. 성별, 부서, 출신학교, ... 서로다르지만 정보차이가 유의미하지 않은 경우

train_OHE = pd.get_dummies(train, columns=["Embarked", "Sex"])
train_OHE

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male
0,0,3,22.000000,1,0,7.2500,0,0,1,0,1
1,1,1,38.000000,1,0,71.2833,1,0,0,1,0
2,1,3,26.000000,0,0,7.9250,0,0,1,1,0
3,1,1,35.000000,1,0,53.1000,0,0,1,1,0
4,0,3,35.000000,0,0,8.0500,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,27.000000,0,0,13.0000,0,0,1,0,1
887,1,1,19.000000,0,0,30.0000,0,0,1,1,0
888,0,3,29.699118,1,2,23.4500,0,0,1,1,0
889,1,1,26.000000,0,0,30.0000,1,0,0,0,1


In [36]:
# Normalization --> Min-Max scaling

X, y = train_OHE.drop(columns="Survived"), train_OHE.Survived

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# scaler.fit()
# scaler.transform()
# X.Age = scaler.fit_transform(X.Age)
# X.Fare = scaler.fit_trasform(X.Fare)

temp = scaler.fit_transform(X.loc[:, ["Age", "Fare"]])
X["Age"] = temp[:,0]
X["Fare"] = temp[:,1]
X

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male
0,3,0.271174,1,0,0.014151,0,0,1,0,1
1,1,0.472229,1,0,0.139136,1,0,0,1,0
2,3,0.321438,0,0,0.015469,0,0,1,1,0
3,1,0.434531,1,0,0.103644,0,0,1,1,0
4,3,0.434531,0,0,0.015713,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...
886,2,0.334004,0,0,0.025374,0,0,1,0,1
887,1,0.233476,0,0,0.058556,0,0,1,1,0
888,3,0.367921,1,2,0.045771,0,0,1,1,0
889,1,0.321438,0,0,0.058556,1,0,0,0,1


### Training 

In [37]:
# sklearn에서 배웠던 분류 모델들을 불러와봅니다.
# 1. Linear Classifier
from sklearn.linear_model import SGDClassifier

# 2. Logistic Regression
from sklearn.linear_model import LogisticRegression

# 3. Decision Tree
from sklearn.tree import DecisionTreeClassifier

# 4. Random Forest
from sklearn.ensemble import RandomForestClassifier

# 평가지표
from sklearn.metrics import accuracy_score

In [39]:
clf = SGDClassifier()
clf2 = LogisticRegression()
clf3 = DecisionTreeClassifier()
clf4 = RandomForestClassifier()

In [40]:
clf.fit(X,y)
clf2.fit(X,y)
clf3.fit(X,y)
clf4.fit(X,y)

pred = clf.predict(X)
pred2 = clf2.predict(X)
pred3 = clf3.predict(X)
pred4 = clf4.predict(X)

In [42]:
print("1. Linear Classifier, Accuracy for training : %.4f" % accuracy_score(y, pred))
print("2. Logistic Regression, Accuracy for training : %.4f" % accuracy_score(y, pred2))
print("3. Decision Tree, Accuracy for training : %.4f" % accuracy_score(y, pred3))
print("4. Random Forest, Accuracy for training : %.4f" % accuracy_score(y, pred4))

1. Linear Classifier, Accuracy for training : 0.8058
2. Logistic Regression, Accuracy for training : 0.8013
3. Decision Tree, Accuracy for training : 0.9820
4. Random Forest, Accuracy for training : 0.9820


### Test (Predict)

In [43]:
test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [55]:
# test data에 같은 feature engineering을 적용해줍니다.

#Imputation
# test = test.drop(columns = ["Ticket","Cabin", "PassengerId", "Name"])
test = test.fillna(train.Age.mean()) # (***)
test = test.fillna(train.Fare.mean()) # (***)

# Categorical feature encoding
test_OHE = pd.get_dummies(data=test, columns = ["Embarked", "Sex"])

# Nomalization
temp = scaler.transform(test_OHE.loc[:, ["Age", "Fare"]])
test_OHE.Age = temp[:, 0]
test_OHE.Fare = temp[:, 1]

test_OHE

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male
0,3,0.428248,0,0,0.015282,0,1,0,0,1
1,3,0.585323,1,0,0.013663,0,0,1,1,0
2,2,0.773813,0,0,0.018909,0,1,0,0,1
3,3,0.334004,0,0,0.016908,0,0,1,0,1
4,3,0.271174,1,1,0.023984,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...
413,3,0.367921,0,0,0.015713,0,0,1,0,1
414,1,0.484795,0,0,0.212559,1,0,0,1,0
415,3,0.478512,0,0,0.014151,0,0,1,0,1
416,3,0.367921,0,0,0.015713,0,0,1,0,1


In [56]:
# prediction
result = clf.predict(test_OHE)
result2 = clf2.predict(test_OHE)
result3 = clf3.predict(test_OHE)
result4 = clf4.predict(test_OHE)

In [59]:
# 결과 파일인 submission.csv를 생성합니다.
submission["Survived"] = result4

- 모든 학습이 끝나면 결과를 가지고 제출해볼 수 있습니다.

- 만든 모델 중에 가장 test 성능이 좋은 하나를 제출해볼까요?

[제출하러가기] https://www.kaggle.com/c/titanic

In [61]:
submission.to_csv("submission.csv", index=False)