# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察填補缺值以及 標準化 / 最小最大化 對數值的影響

# [作業重點]
- 觀察替換不同補缺方式, 對於特徵的影響 (In[4]~In[6], Out[4]~Out[6])
- 觀察替換不同特徵縮放方式, 對於特徵的影響 (In[7]~In[8], Out[7]~Out[8])

In [1]:
# 做完特徵工程前的所有準備 (與前範例相同)
import pandas as pd
import numpy as np
import copy
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# 讀取測訓練資料(df_train)及測試資料(df_test)
data_path = 'data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')

#取出要用的資料
train_Y = df_train['Survived']
ids = df_test['PassengerId']

# 把要已取出的資料從Dataframe中刪除
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)

#把資料合併再一起
df = pd.concat([df_train,df_test])
print("Dataframe資訊")
df.info()
print("\r\n")
print("Dataframe預覽")
df.head()

Dataframe資訊
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 10 columns):
Pclass      1309 non-null int64
Name        1309 non-null object
Sex         1309 non-null object
Age         1046 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
Ticket      1309 non-null object
Fare        1308 non-null float64
Cabin       295 non-null object
Embarked    1307 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 112.5+ KB


Dataframe預覽


Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
#只取 int64, float64 兩種數值型欄位, 存於 num_features 中
num_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64' or dtype == 'int64':
        num_features.append(feature)
print(f'{len(num_features)} Numeric Features : {num_features}\n')

5 Numeric Features : ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']



In [5]:
# 削減文字型欄位, 只剩數值型欄位
df = df[num_features]
train_num = train_Y.shape[0] # 取得每個欄位的資料數量

df.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,3,22.0,1,0,7.25
1,1,38.0,1,0,71.2833
2,3,26.0,0,0,7.925
3,1,35.0,1,0,53.1
4,3,35.0,0,0,8.05


# 作業1
* 試著在補空值區塊, 替換並執行兩種以上填補的缺值, 看看何者比較好?

In [9]:
# 空值補 -1, 做羅吉斯迴歸
df_m1 = df.fillna(-1)  # 把Nan值都放-1
train_X = df_m1[:train_num]
estimator = LogisticRegression(solver='lbfgs')
scores1 =cross_val_score(estimator, train_X, train_Y, cv=5).mean()

"""
Your Code Here
"""
print(df.isnull().sum().sort_values(ascending=False))
print(df.describe())

#發現Age這個欄位比較多空值，用Age.mean()來補值看看
df_m2 = df.fillna(df.mean())
df_m2.describe()
train_X = df_m2[:train_num]
estimator = LogisticRegression(solver='lbfgs')
scores2 =cross_val_score(estimator, train_X, train_Y, cv=5).mean()

#用0補值看看
df_m3 = df.fillna(0)
df_m3.describe()
train_X = df_m3[:train_num]
estimator = LogisticRegression(solver='lbfgs')
scores3 =cross_val_score(estimator, train_X, train_Y, cv=5).mean()

print(scores1) # 原本的值
print(scores2) # 反而更差一些
print(scores3) # 更好些?

Age       263
Fare        1
Parch       0
SibSp       0
Pclass      0
dtype: int64
            Pclass          Age        SibSp        Parch         Fare
count  1309.000000  1046.000000  1309.000000  1309.000000  1308.000000
mean      2.294882    29.881138     0.498854     0.385027    33.295479
std       0.837836    14.413493     1.041658     0.865560    51.758668
min       1.000000     0.170000     0.000000     0.000000     0.000000
25%       2.000000    21.000000     0.000000     0.000000     7.895800
50%       3.000000    28.000000     0.000000     0.000000    14.454200
75%       3.000000    39.000000     1.000000     0.000000    31.275000
max       3.000000    80.000000     8.000000     9.000000   512.329200
0.6982644788418415
0.6959413955734954
0.6993817972775958


# 作業2
* 使用不同的標準化方式 ( 原值 / 最小最大化 / 標準化 )，搭配羅吉斯迴歸模型，何者效果最好?

In [13]:
"""
Your Code Here
"""
# 空值補 0, 搭配最大最小化
df_m4 = df.fillna(-1)
df_temp = MinMaxScaler().fit_transform(df_m4)
train_X = df_temp[:train_num]
estimator = LogisticRegression(solver='lbfgs')
scores4 =cross_val_score(estimator, train_X, train_Y, cv=5).mean()

# 空值補 0, 搭配標準化
df_m5 = df.fillna(-1)
df_temp = StandardScaler().fit_transform(df_m5)
train_X = df_temp[:train_num]
estimator = LogisticRegression(solver='lbfgs')
scores5 =cross_val_score(estimator, train_X, train_Y, cv=5).mean()


print("缺值補-1:        {}".format(scores1)) # 原本的值
print("缺值補平均值:    {}".format(scores2)) # 原本的值
print("缺值補0:         {}".format(scores3)) # 原本的值
print("最大最小化(補-1):{}".format(scores4)) # 原本的值
print("標準化(補-1):    {}".format(scores5)) # 原本的值

缺值補-1:        0.6982644788418415
缺值補平均值:    0.6959413955734954
缺值補0:         0.6993817972775958
最大最小化(補-1):0.7005053927832138
標準化(補-1):    0.6982582017719778
