# 作業 : (Kaggle)鐵達尼生存預測
***
https://www.kaggle.com/c/titanic

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察計數編碼與特徵雜湊的效果

# [作業重點]
- 仿造範例, 完成自己挑選特徵的群聚編碼 (In[2], Out[2])
- 觀察群聚編碼, 搭配邏輯斯回歸, 看看有什麼影響 (In[5], Out[5], In[6], Out[6]) 

# 作業1
* 試著使用鐵達尼號的例子，創立兩種以上的群聚編碼特徵( mean、median、mode、max、min、count 均可 )

In [7]:
import os
import pandas as pd
import numpy as np
import copy
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Set data directory
dir_data = 'D:\Document\AI\Marathon100D\Assignment\Day_027\data'

# Set the full data file name
f_app_train = os.path.join(dir_data, 'titanic_train.csv')
df = pd.read_csv(f_app_train)

# Extract target data from training data frame
train_Y = df['Survived']

# Drop two columns
df = df.drop(['PassengerId', 'Survived'] , axis=1)

# Show top few rows
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
# 取一個類別型欄位, 與一個數值型欄位, 做群聚編碼
"""
Your Code Here
"""
# Group by Ticket, and add aggreate data to encode Age column
# Fill up null value by None string
df['Ticket'] = df['Ticket'].fillna('None')
# Fill up null value by mean value
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Create a data frame by grouping by Ticket and calculate the mean value of column Age
mean_df = df.groupby(['Ticket'])['Age'].mean().reset_index()

# Create a data frame by grouping by Ticket and calculate the mode value of column Age
mode_df = df.groupby(['Ticket'])['Age'].apply(lambda x: x.mode()[0]).reset_index()

# Create a data frame by grouping by Ticket and calculate the median value of column Age
median_df = df.groupby(['Ticket'])['Age'].median().reset_index()

# Create a data frame by grouping by Ticket and calculate the max value of column Age
max_df = df.groupby(['Ticket'])['Age'].max().reset_index()

# Create a data frame by grouping by Ticket and calculate the min value of column Age
min_df = df.groupby(['Ticket'])['Age'].min().reset_index()

# Merge data frame, the previous one on the left
temp = pd.merge(mean_df, mode_df, how='left', on=['Ticket'])
# Merge data frame, the previous one on the left
temp = pd.merge(temp, median_df, how='left', on=['Ticket'])
# Merge data frame, the previous one on the left
temp = pd.merge(temp, max_df, how='left', on=['Ticket'])
# Merge data frame, the previous one on the left
temp = pd.merge(temp, min_df, how='left', on=['Ticket'])
# Rename columns
temp.columns = ['Ticket', 'Age_Ticket_Mean', 'Age_Ticket_Mode', 'Age_Ticket_Median', 'Age_Ticket_Max', 'Age_Ticket_Min']
temp.head()

Unnamed: 0,Ticket,Age_Ticket_Mean,Age_Ticket_Mode,Age_Ticket_Median,Age_Ticket_Max,Age_Ticket_Min
0,110152,26.333333,16.0,30.0,33.0,16.0
1,110413,36.333333,18.0,39.0,52.0,18.0
2,110465,38.349559,29.699118,38.349559,47.0,29.699118
3,110564,28.0,28.0,28.0,28.0,28.0
4,110813,60.0,60.0,60.0,60.0,60.0


In [9]:
# Merge the original data with grouped by aggregated columns
df = pd.merge(df, temp, how='left', on=['Ticket'])

# Drop the grouped by column
df = df.drop(['Ticket'] , axis=1)

# Show top few rows
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Age_Ticket_Mean,Age_Ticket_Mode,Age_Ticket_Median,Age_Ticket_Max,Age_Ticket_Min
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S,22.0,22.0,22.0,22.0,22.0
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C,38.0,38.0,38.0,38.0,38.0
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,,S,26.0,26.0,26.0,26.0,26.0
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,C123,S,36.0,35.0,36.0,37.0,35.0
4,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,,S,35.0,35.0,35.0,35.0,35.0


In [10]:
#只取 int64, float64 兩種數值型欄位, 存於 num_features 中
# Initialize a list
num_features = []

# Loop through all data types and columns
for dtype, feature in zip(df.dtypes, df.columns):
    # If data type is numeric
    if dtype == 'float64' or dtype == 'int64':
        # Append the column name to the list
        num_features.append(feature)
        
# Print the member in the list
print(f'{len(num_features)} Numeric Features : {num_features}\n')

# 削減文字型欄位, 只剩數值型欄位
# Extract the data frame to include the numeric columns only
df = df[num_features]

# Fill up null value using -1
df = df.fillna(-1)

# Create a minmax data scaler
MMEncoder = MinMaxScaler()

# Print top few rows
df.head()

10 Numeric Features : ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Age_Ticket_Mean', 'Age_Ticket_Mode', 'Age_Ticket_Median', 'Age_Ticket_Max', 'Age_Ticket_Min']



Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Age_Ticket_Mean,Age_Ticket_Mode,Age_Ticket_Median,Age_Ticket_Max,Age_Ticket_Min
0,3,22.0,1,0,7.25,22.0,22.0,22.0,22.0,22.0
1,1,38.0,1,0,71.2833,38.0,38.0,38.0,38.0,38.0
2,3,26.0,0,0,7.925,26.0,26.0,26.0,26.0,26.0
3,1,35.0,1,0,53.1,36.0,35.0,36.0,37.0,35.0
4,3,35.0,0,0,8.05,35.0,35.0,35.0,35.0,35.0


# 作業2
* 將上述的新特徵，合併原有的欄位做生存率預估，結果是否有改善?
Yes, slightly.

In [11]:
# 原始特徵 + 邏輯斯迴歸
"""
Your Code Here
"""
# The dataframe without the newly added columns is called df_minus
# Create a data frame by dropping the group by aggregated column
df_minus = df.drop(['Age_Ticket_Mean', 'Age_Ticket_Mean', 'Age_Ticket_Median', 'Age_Ticket_Max', 'Age_Ticket_Min'] , axis=1)

# Create a training data frame by applying the minmax scaler
train_X = MMEncoder.fit_transform(df_minus)

# Create a logistic regression model
estimator = LogisticRegression()

# Calculate mean value of cross validation score using logistic regression model
cross_val_score(estimator, train_X, train_Y, cv=5).mean()

  return self.partial_fit(X, y)


0.7049176764060548

In [12]:
# 新特徵 + 邏輯斯迴歸
"""
Your Code Here
"""
# Create a training data frame by applying the minmax scaler ( without group by aggregate column )
train_X = MMEncoder.fit_transform(df)

# Calculate mean value of cross validation score using logistic regression model
cross_val_score(estimator, train_X, train_Y, cv=5).mean()

  return self.partial_fit(X, y)


0.7049239534759185