# Titanic Model V3 - Main Notebook

## Section 1: Import Libraries (工具準備)
*這個任務會用到什麼工具?*

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from datetime import datetime

## Section 2: Load Raw Data (任務資料讀取)
*我的輸入是什麼? 我要預測什麼?*

In [5]:
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

# 為了之後輸出 submission
test_passenger_ids = test["PassengerId"]

## Section 3: Check Missing Values (資料健檢)
*哪些欄位不能用? 要先補哪些?*

In [7]:
def check_missing(df, name="Data"):
    print(f"Missing values in {name}:")
    missing = df.isnull().sum()
    missing = missing[missing > 0]
    if missing.empty:
        print("✅ No missing values.")
    else:
        print(missing)
    print("-" * 40)

check_missing(train, "Train - Raw")
check_missing(test, "Test - Raw")

Missing values in Train - Raw:
Age         177
Cabin       687
Embarked      2
dtype: int64
----------------------------------------
Missing values in Test - Raw:
Age       86
Fare       1
Cabin    327
dtype: int64
----------------------------------------


## Section 4: Feature Engineering (變數轉換)
*我要創造哪些「更能解釋目標」的欄位?*

In [8]:
def extract_title(df):
    df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
    df['Title_Grouped'] = df['Title'].replace({
        'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs',
        'Lady': 'Rare', 'Countess': 'Rare', 'Capt': 'Rare',
        'Col': 'Rare', 'Don': 'Rare', 'Dr': 'Rare', 'Major': 'Rare',
        'Rev': 'Rare', 'Sir': 'Rare', 'Jonkheer': 'Rare', 'Dona': 'Rare'
    })
    return df

def add_family_features(df):
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    df['FarePerPerson'] = df['Fare'] / df['FamilySize']
    return df

def engineer_features(df):
    df = extract_title(df)
    df = add_family_features(df)
    return df

train = engineer_features(train)
test = engineer_features(test)


## Section 5: Impute Missing Values (補值策略)
*我怎麼用「合理的依據」補空值?*

In [10]:
# 補 Age：依 Title_Grouped 群體中位數
age_imputer_map = train.groupby('Title_Grouped')['Age'].median()

def impute_age(df):
    df['Age'] = df.apply(
        lambda row: age_imputer_map[row['Title_Grouped']] if pd.isnull(row['Age']) else row['Age'], axis=1
    )
    return df

train = impute_age(train)
test = impute_age(test)

# 補 Fare（test 缺 1 筆）
test['Fare'] = test['Fare'].fillna(test['Fare'].median())

# 補 Embarked（train 缺 2 筆）
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])

## Section 6: Check Missing Values (清理後確認)

In [11]:
check_missing(train, "Train - After Cleaning")
check_missing(test, "Test - After Cleaning")

Missing values in Train - After Cleaning:
Cabin    687
dtype: int64
----------------------------------------
Missing values in Test - After Cleaning:
Cabin            327
FarePerPerson      1
dtype: int64
----------------------------------------


## Section 7: Feature Selection (建模欄位選擇)
*模型看得懂的欄位有哪些? 類別/數值怎麼處理?*

In [20]:
features = [
    'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
    'Title_Grouped', 'FamilySize', 'IsAlone', 'FarePerPerson'
]

X = train[features]
y = train['Survived']
X_test_final = test[features]

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


## Section 8: Preprocessing Pipeline (前處理流程)
*我要怎麼系統化處理資料轉換?*

In [15]:
numeric_features = ['Age', 'Fare', 'FamilySize', 'FarePerPerson']
categorical_features = ['Pclass', 'Sex', 'Title_Grouped', 'IsAlone']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

## Section 9: Model Training - XGBoost + GridSearchCV (模型建構 + 調參)
*要試哪幾組參數? 我要最重視準確率還是泛化?*

In [17]:
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

param_grid = {
    'xgb__n_estimators': [100, 200],
    'xgb__max_depth': [3, 5, 7],
    'xgb__learning_rate': [0.01, 0.1, 0.2],
}

clf = Pipeline(steps=[('preprocessor', preprocessor), ('xgb', xgb)])
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
grid_search.fit(X, y)

print("Best Parameters:", grid_search.best_params_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.


Best Parameters: {'xgb__learning_rate': 0.2, 'xgb__max_depth': 3, 'xgb__n_estimators': 200}


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.


## Section 10: Evaluation on Training Data (評估結果)
*我的模型準嗎? 過擬合了嗎? 要怎麼調?*

In [18]:
y_pred_train = grid_search.predict(X)
print("Accuracy:", accuracy_score(y, y_pred_train))
print(confusion_matrix(y, y_pred_train))
print(classification_report(y, y_pred_train))


Accuracy: 0.9292929292929293
[[528  21]
 [ 42 300]]
              precision    recall  f1-score   support

           0       0.93      0.96      0.94       549
           1       0.93      0.88      0.90       342

    accuracy                           0.93       891
   macro avg       0.93      0.92      0.92       891
weighted avg       0.93      0.93      0.93       891



## Section 11: Prediction & Submission File (產出預測)
*預測要轉換格式? 有沒有漏掉ID?*

In [22]:
y_test_pred = grid_search.predict(X_test_final)

submission = pd.DataFrame({
    'PassengerId': test_passenger_ids,
    'Survived': y_test_pred
})

timestamp = datetime.now().strftime('%Y-%m-%d')
filename = f"../output/titanic_submission_v3-a_xgb_tuned_2025-07-31.csv"
submission.to_csv(filename, index=False)
print(f"✅ Submission saved as: {filename}")

✅ Submission saved as: ../output/titanic_submission_v3-a_xgb_tuned_2025-07-31.csv
