# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察填補缺值以及 標準化 / 最小最大化 對數值的影響

# [作業重點]
- 觀察替換不同補缺方式, 對於特徵的影響 (In[4]~In[6], Out[4]~Out[6])
- 觀察替換不同特徵縮放方式, 對於特徵的影響 (In[7]~In[8], Out[7]~Out[8])

In [2]:
# 做完特徵工程前的所有準備 (與前範例相同)
import pandas as pd
import numpy as np
import copy
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

data_path = '../data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')

train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
#只取 int64, float64 兩種數值型欄位, 存於 num_features 中
num_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64' or dtype == 'int64':
        num_features.append(feature)
print(f'{len(num_features)} Numeric Features : {num_features}\n')

5 Numeric Features : ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']



In [4]:
# 削減文字型欄位, 只剩數值型欄位
df = df[num_features]
train_num = train_Y.shape[0]
df.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,3,22.0,1,0,7.25
1,1,38.0,1,0,71.2833
2,3,26.0,0,0,7.925
3,1,35.0,1,0,53.1
4,3,35.0,0,0,8.05


In [5]:
df.shape

(1309, 5)

In [6]:
train_Y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

# 作業1
* 試著在補空值區塊, 替換並執行兩種以上填補的缺值, 看看何者比較好?

In [7]:
# 空值補 -1, 做羅吉斯迴歸
df_m1 = df.fillna(-1)
train_X = df_m1[:train_num]
estimator = LogisticRegression()
cross_val_score(estimator, train_X, train_Y, cv=5).mean()

"""
Your Code Here
"""

'\nYour Code Here\n'

# 作業2
* 使用不同的標準化方式 ( 原值 / 最小最大化 / 標準化 )，搭配羅吉斯迴歸模型，何者效果最好?

In [8]:
"""
Your Code Here
"""

'\nYour Code Here\n'

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

In [10]:
def multi_scalers(df):    
    scalers = [('No_Scaler', "No_Scaler"),('StandardScaler', StandardScaler()),
                   ('MinMaxScaler', MinMaxScaler()),
                   ('MaxAbsScaler', MaxAbsScaler()),
                   ('RobustScaler', RobustScaler()),
                   ('QuantileTransformer-Normal', QuantileTransformer(output_distribution='normal')),
                   ('QuantileTransformer-Uniform', QuantileTransformer(output_distribution='uniform')),
                   ('PowerTransformer-Yeo-Johnson', PowerTransformer(method='yeo-johnson')),
                   ('Normalizer', Normalizer())
                   ]
    
    #no_scaler as scaler control
    estimator = LogisticRegression()
    cv=5
    
    #Start to estimator with different scalers
    for scaler in scalers:
        if scaler[1] == "No_Scaler":
            train_X = df
            score_no_scaler = cross_val_score(estimator, train_X, train_Y, cv=cv).mean()
            cv_score = [score_no_scaler]
            class_score = {"no_Scaler":score_no_scaler}
            
        else:
            train_X = scaler[1].fit_transform(df)
            score = cross_val_score(estimator, train_X, train_Y, cv=cv).mean()
            cv_score.append(score)
            class_score.update({scaler[0]:score})
    
    return cv_score, class_score

In [11]:
#空值填補-1, 0 , mean, median，再搭上不同Scaler做logistic regression
df = df.iloc[:891, :]
scores = {}
fill = [("fill with 0", -1), ("fill with -1", 0), ("fill with mean", df.mean()), ("fill with median", df.median())]
for i in fill: 
    df = df.fillna(i[1])
    cv_score, class_score = multi_scalers(df)   
    scores.update({i[0]:cv_score})

df_scores = pd.DataFrame(scores, index=class_score.keys() , columns=scores.keys())

  % (self.n_quantiles, n_samples))
  % (self.n_quantiles, n_samples))
  % (self.n_quantiles, n_samples))
  % (self.n_quantiles, n_samples))
  % (self.n_quantiles, n_samples))
  % (self.n_quantiles, n_samples))
  % (self.n_quantiles, n_samples))
  % (self.n_quantiles, n_samples))


In [12]:
df_scores

Unnamed: 0,fill with 0,fill with -1,fill with mean,fill with median
no_Scaler,0.69818,0.69818,0.69818,0.69818
StandardScaler,0.698173,0.698173,0.698173,0.698173
MinMaxScaler,0.700414,0.700414,0.700414,0.700414
MaxAbsScaler,0.702674,0.702674,0.702674,0.702674
RobustScaler,0.697056,0.697056,0.697056,0.697056
QuantileTransformer-Normal,0.715034,0.715034,0.715034,0.715034
QuantileTransformer-Uniform,0.711663,0.711663,0.711663,0.711663
PowerTransformer-Yeo-Johnson,0.713923,0.713923,0.713923,0.713923
Normalizer,0.675758,0.675758,0.675758,0.675758
