# 作業 : (Kaggle)鐵達尼生存預測
https://www.kaggle.com/c/titanic

# [作業目標]
- 試著模仿範例寫法, 在鐵達尼生存預測中, 觀察均值編碼的效果

# [作業重點]
- 仿造範例, 完成標籤編碼與均值編碼搭配邏輯斯迴歸的預測
- 觀察標籤編碼與均值編碼在特徵數量 / 邏輯斯迴歸分數 / 邏輯斯迴歸時間上, 分別有什麼影響 (In[3], Out[3], In[4], Out[4]) 

# 作業1
* 請仿照範例，將鐵達尼範例中的類別型特徵改用均值編碼實作一次

In [1]:
import os
import pandas as pd
import numpy as np
import copy, time
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder


# Set data directory
dir_data = 'D:\Document\AI\Marathon100D\Assignment\Day_023\data'

# Set the full data file name
f_app_train = os.path.join(dir_data, 'titanic_train.csv')
f_app_test = os.path.join(dir_data, 'titanic_test.csv')

# Read CSV into data frame
df_train = pd.read_csv(f_app_train)
df_test = pd.read_csv(f_app_test)

# Extract target data from training data frame and convert it to natural logarithm value
train_Y = df_train['Survived']

# Extract data frame containing just the primary key of test data frame
ids = df_test['PassengerId']

# Drop the primary key and target data column from the training data frame
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)

# Drop the primary key from test data frame
df_test = df_test.drop(['PassengerId'] , axis=1)

# Concat (append) the training and test data frame
df = pd.concat([df_train,df_test])

# Show top few rows
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
#只取類別值 (object) 型欄位, 存於 object_features 中
# Initialize an empty array to store columns names
object_features = []

# For all data type and columns
for dtype, feature in zip(df.dtypes, df.columns):
    # If data type is string
    if dtype == 'object':
         # Add the column name to the array
        object_features.append(feature)
        
print(f'{len(object_features)} Object Features : {object_features}\n')

# 只留類別型欄位
# Extract the data frame by selecting the columns specified in the array
df = df[object_features]

# Fill up null value with None string
df = df.fillna('None')

# Get row count of the data frame
train_num = train_Y.shape[0]

# Show top few rows
df.head()

5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']



Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


# 作業2
* 觀察鐵達尼生存預測中，均值編碼與標籤編碼兩者比較，哪一個效果比較好? 可能的原因是什麼?
Mean encoding brings better result because some feature may be closely linked to the target.

In [3]:
# 對照組 : 標籤編碼 + 邏輯斯迴歸
# Initialize a temp data frame
df_temp = pd.DataFrame()

# Loop through all columns of the data fame
for c in df.columns:
    
    # Assign the value of frame by applying lable encoder to the data frame that consist only string columns
    df_temp[c] = LabelEncoder().fit_transform(df[c])

# Create a data frame by extracting the temp data frame from first row to the last row of training data.
train_X = df_temp[:train_num]

# Create a logistic regression model
estimator = LogisticRegression()

# Get current time as the start time
start = time.time()

# Print row count and column count
print(f'shape : {train_X.shape}')

# Print mean value of cross validation score
print(f'score : {cross_val_score(estimator, train_X, train_Y, cv=5).mean()}')

# Print the duration required to complete the task
print(f'time : {time.time() - start} sec')

shape : (891, 5)
score : 0.780004837244799
time : 0.1651172637939453 sec




In [7]:
# 均值編碼 + 邏輯斯迴歸
# Concate ( append ) feature columns with target columns
data = pd.concat([df[:train_num], train_Y], axis=1)

# Loop through all columns in data frame
for c in df.columns:
    
    # Create a data frame by calculating the mean target value group by each column
    mean_df = data.groupby([c])['Survived'].mean().reset_index()
    
    # Convert the column name to [original coloumn name]_mean
    mean_df.columns = [c, f'{c}_mean']

    # Merge data by adding the columns being converted to their mean value ( data on the left , mean_df on the right)
    data = pd.merge(data, mean_df, on=c, how='left')

    # Drop the original column
    data = data.drop([c] , axis=1)

    
# Drop the target column
data = data.drop(['Survived'] , axis=1)

# Create a logicstic regression model
estimator = LogisticRegression()

# Get the current time as start time
start = time.time()

# Print row count and column count
print(f'shape : {train_X.shape}')

# Print mean value of cross validation score
print(f'score : {cross_val_score(estimator, data, train_Y, cv=5).mean()}')

# Print the duration required to complete the task
print(f'time : {time.time() - start} sec')

# Note: We can notice that the score has increased to 1, which is the sign of overfitting.

shape : (891, 5)
score : 1.0
time : 0.01795172691345215 sec




In [8]:
# The previous result has problem of over fitting.
# We need to process to eliminate columns with too many unique values before running the model.
# Check the unique numbers of column value
train_X.nunique()

# Notice that columns Name, Ticket have too many unquie values

Name        891
Sex           2
Ticket      681
Cabin       148
Embarked      4
dtype: int64

In [10]:
# Now we redo this again, this time, we will remove the columns with too many unique values
# Concate ( append ) feature columns with target columns
data = pd.concat([df[:train_num], train_Y], axis=1)

# Loop through all columns in data frame
for c in df.columns:
    
    # Create a data frame by calculating the mean target value group by each column
    mean_df = data.groupby([c])['Survived'].mean().reset_index()
    
    # Convert the column name to [original coloumn name]_mean
    mean_df.columns = [c, f'{c}_mean']

    # Merge data by adding the columns being converted to their mean value ( data on the left , mean_df on the right)
    data = pd.merge(data, mean_df, on=c, how='left')

    # Drop the original column
    data = data.drop([c] , axis=1)

    
# Drop the target column
data = data.drop(['Survived', 'Name_mean', 'Ticket_mean'] , axis=1)

# Create a logicstic regression model
estimator = LogisticRegression()

# Get the current time as start time
start = time.time()

# Print row count and column count
print(f'shape : {train_X.shape}')

# Print mean value of cross validation score
print(f'score : {cross_val_score(estimator, data, train_Y, cv=5).mean()}')

# Print the duration required to complete the task
print(f'time : {time.time() - start} sec')

# Note: We can notice that the score is better than the blank test ( label encoding ), but it does have over-fitting issue.

shape : (891, 5)
score : 0.8350366889413987
time : 0.016953468322753906 sec


