# 作業 : (Kaggle)鐵達尼生存預測
***
- 分數以網站評分結果為準, 請同學實際將提交檔(*.csv)上傳試試看  
https://www.kaggle.com/c/titanic/submit

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察觀查混合泛化 (Blending) 的寫法與效果

# [作業重點]
- 觀察混合泛化的準確度 (In[14]), 是否比單一模型準確度為高 (In[11~13])  
- 除了我們的權重, 同學也可以試著自行調整權重 (注意:權重和=1), 看看有什麼影響
- Hint : 除了權重, 分類預測的調整, 還可以調整什麼地方?

In [71]:
import os
import pandas as pd
import numpy as np
import copy, time
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import datasets, metrics

# Set data directory
dir_data = 'D:\Document\AI\Marathon100D\Assignment\Day_049\data'
# Set the full data file name
f_app_train = os.path.join(dir_data, 'titanic_train.csv')
f_app_test = os.path.join(dir_data, 'titanic_test.csv')

# Create training data frame by reading CSV file
df_train = pd.read_csv(f_app_train)
# Create test data frame  by reading CSV file
df_test = pd.read_csv(f_app_test)

# Create target data frame by extracting the target column
train_Y = df_train['Survived']

# Create primary key data frame
ids = df_test['PassengerId']

# Accuracy checker
# df_verify = df_train[['PassengerId', 'Survived'] ]

# Drop primary key and target data column from training data frame
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)

# Drop primary key from test data frame
df_test = df_test.drop(['PassengerId'] , axis=1)

# Combine two data frame
df = pd.concat([df_train,df_test])

# Show top N rows
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [72]:
# 檢查 DataFrame 空缺值的狀態
# Create a function to check of ratio of missing data
def na_check(df_data):
    data_na = (df_data.isnull().sum() / len(df_data)) * 100
    data_na = data_na.drop(data_na[data_na == 0].index).sort_values(ascending=False)
    missing_data = pd.DataFrame({'Missing Ratio' :data_na})
    display(missing_data.head(10))

# Call function to check data frame
na_check(df)

Unnamed: 0,Missing Ratio
Cabin,77.463713
Age,20.091673
Embarked,0.152788
Fare,0.076394


In [73]:
# 以下 In[3]~In[10] 只是鐵達尼預測中的一組特徵工程, 並以此組特徵工程跑參數, 若更換其他特徵工程, In[10]的參數需要重新跑
# Sex : 直接轉男 0 女 1
# Encode column, convert string to number
df["Sex"] = df["Sex"].map({"male": 0, "female":1})

# Fare : 用 log 去偏態, 0 則直接取 0
# Standardize value , if value >0, use natural logarithm, else use 0.
df["Fare"] = df["Fare"].map(lambda i: np.log(i) if i > 0 else 0)

# Age : 缺值用中位數補
# Fill missing data using median value of the column
df["Age"] = df["Age"].fillna(df['Age'].median())

In [74]:
# Title 的 特徵工程 : 將各種頭銜按照類型分類, 最後取 One Hot
# Create a data frame based on spliting of the column "Name", the possible values are Mr, Miss, Master, Mrs, etc.
df_title = [i.split(",")[1].split(".")[0].strip() for i in df["Name"]]

# Add a column based on the one-dimensional labeled array filled with data from df_title
df["Title"] = pd.Series(df_title)

# Replace certain values by "rare"
df["Title"] = df["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

# Convert the string value to anther values
df["Title"] = df["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})

# Convert the column data type from string to integer
df["Title"] = df["Title"].astype(int)

# Convert categorical variable into dummy/indicator columns ( Title_0, Title_1, Title_2, Title_3 )
df = pd.get_dummies(df, columns = ["Title"])

In [75]:
# 新建:家庭大小 (Fsize)特徵, 並依照大小分別建獨立欄位
# Add a column to data frame based on add-up of two columns plus 1
df["Fsize"] = df["SibSp"] + df["Parch"] + 1

# Add a column (Single) based on familiy size column value, if it is 1, set value to 1 else 0.
df['Single'] = df['Fsize'].map(lambda s: 1 if s == 1 else 0)

# Add a column (Small Family) based on familiy size column value, if it is 2, set value to 1 else 0.
df['SmallF'] = df['Fsize'].map(lambda s: 1 if  s == 2  else 0)

# Add a column (Medium Family) based on familiy size column value, if it betwee 3 and 4, set value to 1 else 0.
df['MedF'] = df['Fsize'].map(lambda s: 1 if 3 <= s <= 4 else 0)

# Add a column (Large Family) based on familiy size column value, if it is >=5,  set value to 1 else 0.
df['LargeF'] = df['Fsize'].map(lambda s: 1 if s >= 5 else 0)

In [76]:
# Ticket : 如果不只是數字-取第一個空白之前的字串(去除'.'與'/'), 如果只是數字-設為'X', 最後再取 One Hot
# Create an empty array
Ticket = []

# Loop through all values of data frame's Ticket column
for i in list(df.Ticket):
    # If the value is not digit
    if not i.isdigit() :
        # Add array element by removing . and /, and removing the tail part
        Ticket.append(i.replace(".","").replace("/","").strip().split(' ')[0])
    # If the value is digit
    else:
        # Add array element by "X" string value
        Ticket.append("X")

# Convert the column value using the array
df["Ticket"] = Ticket

# Convert categorical variable into dummy/indicator columns
df = pd.get_dummies(df, columns = ["Ticket"], prefix="T")

In [77]:
# Cabib 依照第一碼分類, 再取 One Hot
# Reset column values: if is is null, set it to 'X', else use the first letter of the original value
df["Cabin"] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in df['Cabin'] ])

# Convert categorical variable into dummy/indicator columns
df = pd.get_dummies(df, columns = ["Cabin"], prefix="Cabin")

In [78]:
# Embarked, Pclass 取 One Hot
# Convert categorical variable into dummy/indicator columns
df = pd.get_dummies(df, columns = ["Embarked"], prefix="Em")

# Convert the column data type to category
df["Pclass"] = df["Pclass"].astype("category")

# Convert categorical variable into dummy/indicator columns
df = pd.get_dummies(df, columns = ["Pclass"], prefix="Pc")

# 捨棄 Name 欄位
# Drop column "Name"
df.drop(labels = ["Name"], axis = 1, inplace = True)

In [79]:
# 確認缺值 與 目前的資料表內容
# Check the empty data ratio of the data frame
na_check(df)

# List top N rows of data frame
df.head()

# df.shape (1782, 61)
# train_Y.shape ( 891, )

Unnamed: 0,Missing Ratio


Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Title_0,Title_1,Title_2,Title_3,Fsize,...,Cabin_F,Cabin_G,Cabin_T,Cabin_X,Em_C,Em_Q,Em_S,Pc_1,Pc_2,Pc_3
0,0,22.0,1,0,1.981001,0,0,1,0,2,...,0,0,0,1,0,0,1,0,0,1
1,1,38.0,1,0,4.266662,0,1,0,0,2,...,0,0,0,0,1,0,0,1,0,0
2,1,26.0,0,0,2.070022,0,1,0,0,1,...,0,0,0,1,0,0,1,0,0,1
3,1,35.0,1,0,3.972177,0,1,0,0,2,...,0,0,0,0,0,0,1,1,0,0
4,0,35.0,0,0,2.085672,0,0,1,0,1,...,0,0,0,1,0,0,1,0,0,1


In [80]:
# 將資料最大最小化
# Apply min max scaler to the data frame
df = MinMaxScaler().fit_transform(df)

# 將前述轉換完畢資料 df , 重新切成 train_X, test_X
# Set training data row number
train_num = train_Y.shape[0]

# Create training data by extracting data from the begining to training data row number
train_X = df[:train_num]

# Create test data by extracting data from the training data row number to the end.
test_X = df[train_num:]

# 使用三種模型 : 邏輯斯迴歸 / 梯度提升機 / 隨機森林, 參數使用 Random Search 尋找
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

# Create logistic regression model
lr = LogisticRegression(tol=0.001, penalty='l2', fit_intercept=True, C=1.0)

# Create gradient boosting classifier model
gdbt = GradientBoostingClassifier(tol=100, subsample=0.75, n_estimators=250, max_features=20,
                                  max_depth=6, learning_rate=0.03)

# Create random forest classifier model
rf = RandomForestClassifier(n_estimators=100, min_samples_split=2, min_samples_leaf=1, 
                            max_features='sqrt', max_depth=6, bootstrap=True)

In [81]:
# 線性迴歸預測檔 (結果有部分隨機, 請以 Kaggle 計算的得分為準, 以下模型同理)
# Train the logistic regression model
lr.fit(train_X, train_Y)

# Test the logistic regression model
lr_pred = lr.predict_proba(test_X)[:,1]

# Create a data frame based on the primary key plus the test result
sub = pd.DataFrame({'PassengerId': ids, 'Survived': lr_pred})

# Convert the column value to 1 if it is > 0.5, else set it to 0
sub['Survived'] = sub['Survived'].map(lambda x:1 if x>0.5 else 0) 

# Export data to CSV
sub.to_csv('titanic_lr.csv', index=False) 

In [82]:
# 梯度提升機預測檔 
# Train the gradient boosting classifier model
gdbt.fit(train_X, train_Y)

# Test the gradient boosting classifier model
gdbt_pred = gdbt.predict_proba(test_X)[:,1]

# Create a data frame based on the primary key and prediction value from test
sub = pd.DataFrame({'PassengerId': ids, 'Survived': gdbt_pred})

# Convert the target data to 1 if it is > 0.5, else set it to 0
sub['Survived'] = sub['Survived'].map(lambda x:1 if x>0.5 else 0) 

# Export the data frame to CSV
sub.to_csv('titanic_gdbt.csv', index=False)

In [83]:
# 隨機森林預測檔
# Train the random forest classifier model
rf.fit(train_X, train_Y)

# Test the random forest classifier model
rf_pred = rf.predict_proba(test_X)[:,1]

# Create a data frame based on the primary key and predicted value from testing
sub = pd.DataFrame({'PassengerId': ids, 'Survived': rf_pred})

# Convert the target data to 1 if it is >0.5, elase set it to 0
sub['Survived'] = sub['Survived'].map(lambda x:1 if x>0.5 else 0) 

# Export the data frame to CSV.
sub.to_csv('titanic_rf.csv', index=False)

# 作業
* 雖然同樣是混合泛化，分類預測其實與回歸預測有相當多的差異性，
因為鐵達尼預測的結果是 '生存/死亡'，輸出不是 0 就是 1  
因此要用權重混合時，需要以以機率的形式混合，因此我們在作業前幾格當中，先幫各位同學把預測值寫成了機率的形式  
(請同學把下列程式完成，並將結果提交到 Kaggle 網站看看結果)

* 但是光是這樣，分類問題的混合泛化就能比單模預測還要好嗎?  
已經快要期中考了，這裡請同學挑戰看看，還有沒有什麼方法可以改進混合泛化的結果?

In [100]:
# 混合泛化預測檔 
"""
Your Code Here
"""
# blending_pred = lr_pred*0.1  + gdbt_pred*0.6 + rf_pred*0.3
# blending_pred = lr_pred*0.05  + gdbt_pred*0.8 + rf_pred*0.15 # 0.67942
# blending_pred = lr_pred*0.05  + gdbt_pred*0.15 + rf_pred*0.85
# blending_pred = lr_pred*0.05  + gdbt_pred*0.9 + rf_pred*0.05
# blending_pred = lr_pred*0.2  + gdbt_pred*0.4 + rf_pred*0.4
# blending_pred = lr_pred*0.0  + gdbt_pred*0.9 + rf_pred*0.1  # 0.67
# blending_pred = lr_pred*0.05  + gdbt_pred*0.85 + rf_pred*0.10 # 0.67464
# blending_pred = lr_pred*0.05  + gdbt_pred*0.75 + rf_pred*0.20 # 0.67942
# blending_pred = lr_pred*0.05  + gdbt_pred*0.70 + rf_pred*0.20 # 0.68421
# blending_pred = lr_pred*0.05  + gdbt_pred*0.70 + rf_pred*0.25 # 0.67942
# blending_pred = lr_pred*0.15  + gdbt_pred*0.70 + rf_pred*0.15 # 0.67464
blending_pred = lr_pred*0.0  + gdbt_pred*0.80 + rf_pred*0.2     # 0.75598

sub = pd.DataFrame({'PassengerId': ids, 'Survived': blending_pred})

# Convert the column value to 1 if it is > 0.5, else set it to 0
sub['Survived'] = sub['Survived'].map(lambda x:1 if x>0.5 else 0) 

# Export data to CSV
sub.to_csv('titanic_blending.csv', index=False)

In [97]:
#acc = metrics.accuracy_score(df_verify, sub)
#print("Acurracy: ", acc)