# Kaggle_Competition_2.ipynb

## 專案介紹

本 Notebook 將示範在一個進階資料集（以 Kaggle 著名的房價預測資料集 Ames Housing Dataset 為例） 的建模過程，包括：
1. 資料載入與前處理
2. 特徵工程（缺失值處理、類別編碼、特徵縮放等）
3. 建立基礎模型（線性迴歸、隨機森林、Gradient Boosting、XGBoost、LightGBM）
4. 使用模型集成方法（Voting、Stacking）來提升預測準確度
5. 模型評估與比較

此範例使用的資料為 Ames Housing Dataset，
若您使用 Kaggle Kernel，可以直接透過 datasets 取得資料或透過 Kaggle API 下載。
若在本地端，請事先下載 Ames Housing 資料集，並放置於合適位置。

Ames Housing Dataset 參考: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

請確保您已經安裝必要的套件:
```bash
!pip install numpy pandas scikit-learn xgboost lightgbm
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
plt.style.use('seaborn')


### 資料讀取
假設檔案已經放在 `./data` 資料夾中，
檔案名稱為 `train.csv`, `test.csv` (與 Kaggle House Prices 資料集一致)。

In [None]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

### 資料探索 (EDA)

In [None]:
print("Train 資料集維度:", train.shape)
print("Test 資料集維度:", test.shape)
print(train['SalePrice'].describe())

# SalePrice取log處理
train['SalePrice'] = np.log1p(train['SalePrice'])

### 特徵工程
合併 train 與 test（除了目標值），以便一致的特徵工程處理。

In [None]:
train_ID = train['Id']
test_ID = test['Id']

y = train['SalePrice'].copy()
train.drop(['SalePrice','Id'], axis=1, inplace=True)
test.drop(['Id'], axis=1, inplace=True)

all_data = pd.concat([train, test], axis=0)

In [None]:
missing = all_data.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
print("缺失值概況:")
print(missing)

針對缺失值進行填補（根據Kaggle討論區常見做法）。

In [None]:
for col in ('PoolQC','MiscFeature','Alley','Fence','FireplaceQu','GarageFinish','GarageQual','GarageCond','GarageType','BsmtExposure','BsmtCond','BsmtQual','BsmtFinType1','BsmtFinType2','MasVnrType'):
    all_data[col] = all_data[col].fillna('None')

for col in ('GarageYrBlt','GarageArea','GarageCars','BsmtFullBath','BsmtHalfBath','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','MasVnrArea'):
    all_data[col] = all_data[col].fillna(0)

# LotFrontage以Neighborhood中位數填補
all_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))

for col in ['MSZoning','Electrical','KitchenQual','Exterior1st','Exterior2nd','Functional','SaleType','Utilities']:
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0])

# 新增衍生特徵
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

Label Encoding 部分具備順序關係的特徵。

In [None]:
cols_to_label_encode = (
    'FireplaceQu','BsmtQual','BsmtCond','GarageQual','GarageCond',
    'ExterQual','ExterCond','HeatingQC','PoolQC','KitchenQual','BsmtFinType1',
    'BsmtFinType2','Functional','Fence','BsmtExposure','GarageFinish','LandSlope',
    'LotShape','PavedDrive','Street','Alley','CentralAir','MasVnrType','MiscFeature','SaleType','SaleCondition'
)

for c in cols_to_label_encode:
    lbl = LabelEncoder() 
    all_data[c] = lbl.fit_transform(all_data[c].astype(str))

# 剩餘類別特徵 One-Hot Encoding
all_data = pd.get_dummies(all_data, drop_first=True)

分離回 Train/Test 資料，並對數值做縮放。

In [None]:
X = all_data.iloc[:len(y), :]
X_test = all_data.iloc[len(y):, :]

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_test = scaler.transform(X_test)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

### 模型訓練與比較

In [None]:
# LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_valid)
print("LinearRegression RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_lr)))

# Ridge
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_valid)
print("Ridge RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_ridge)))

# Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_valid)
print("Lasso RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_lasso)))

# RandomForest
rf = RandomForestRegressor(n_estimators=300, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_valid)
print("RandomForest RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_rf)))

# GradientBoosting
gbr = GradientBoostingRegressor(n_estimators=300, random_state=42)
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_valid)
print("GradientBoosting RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_gbr)))

# XGBoost
xgbr = xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, random_state=42)
xgbr.fit(X_train, y_train, early_stopping_rounds=50, eval_set=[(X_valid, y_valid)], verbose=False)
y_pred_xgbr = xgbr.predict(X_valid)
print("XGBoost RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_xgbr)))

# LightGBM
lgbr = lgb.LGBMRegressor(n_estimators=1000, learning_rate=0.05, random_state=42)
lgbr.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=50, verbose=False)
y_pred_lgbr = lgbr.predict(X_valid)
print("LightGBM RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_lgbr)))

### 模型集成 (Ensemble)

In [None]:
# Voting Ensemble
voting_reg = VotingRegressor(estimators=[
    ('ridge', ridge), 
    ('lasso', lasso), 
    ('rf', rf), 
    ('gbr', gbr), 
    ('xgbr', xgbr), 
    ('lgbr', lgbr)
])
voting_reg.fit(X_train, y_train)
y_pred_voting = voting_reg.predict(X_valid)
print("VotingRegressor RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_voting)))

# Stacking Ensemble
from sklearn.ensemble import StackingRegressor
stacking_reg = StackingRegressor(
    estimators=[
        ('ridge', ridge),
        ('lasso', lasso),
        ('rf', rf),
        ('gbr', gbr),
        ('xgbr', xgbr),
        ('lgbr', lgbr)
    ],
    final_estimator=Ridge(alpha=10),
    passthrough=False
)
stacking_reg.fit(X_train, y_train)
y_pred_stacking = stacking_reg.predict(X_valid)
print("StackingRegressor RMSE:", np.sqrt(mean_squared_error(y_valid, y_pred_stacking)))

### 最終預測與輸出
選擇表現最佳的模型（本例中以 Stacking Ensemble 為例），對測試集做預測並輸出。

In [None]:
final_pred = stacking_reg.predict(X_test)
final_pred = np.expm1(final_pred)

submission = pd.DataFrame({
    'Id': test_ID,
    'SalePrice': final_pred
})
submission.to_csv('submission.csv', index=False)
print("Done! 'submission.csv' 已輸出，可上傳至Kaggle進行評估。")