# 特徵工程 (Feature Engineering) 完整指南

本 Notebook 將示範完整的特徵工程流程與技巧，包括：
1. 特徵選擇 (Feature Selection)
2. 特徵縮放 (Feature Scaling)
3. 特徵組合 (Feature Combination)
4. 類別特徵編碼 (Categorical Encoding)
5. 文本特徵處理 (Text Feature Processing)
6. 時間特徵處理 (Time Feature Engineering)
7. Embeddings 簡介
8. PCA 降維分析 (Dimensionality Reduction via PCA)
9. Pipeline 實務整合

我們將以 `iris` 鳶尾花資料集為基礎，並添加一些人造特徵（類別、日期、文本）以示範多種特徵工程方法。

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid", font_scale=1.2)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFE, VarianceThreshold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import category_encoders as ce
import statsmodels.api as sm

np.random.seed(42)

## 載入與準備資料

使用 iris 資料集，並添加人造類別特徵、日期特徵、文本特徵，以展示各種特徵工程方法。

In [None]:
# 載入iris資料集
iris = load_iris()
df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
df['target'] = pd.Categorical.from_codes(iris['target'], categories=iris['target_names'])

# 添加類別特徵
df['color'] = np.random.choice(['Red','Green','Blue'], size=len(df))
df['region'] = np.random.choice(['North','South','East','West'], size=len(df))

# 添加日期特徵
dates = pd.date_range(start='2021-01-01', periods=len(df))
np.random.shuffle(dates.values)
df['date'] = dates

# 添加文本特徵
feedback_choices = ['產品品質優良', '服務態度良好', '價格合理', '出貨速度快',
                   '產品有瑕疵', '服務需改善', '價格偏高', '出貨太慢']
df['feedback'] = np.random.choice(feedback_choices, size=len(df))

# 添加常數特徵 (用於示範方差過低特徵移除)
df['constant_feature'] = 1.0

df.head()

## 資料切分

將資料分成訓練集與測試集，目標欄位為 `target`。

In [None]:
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## 1. 特徵選擇 (Feature Selection)

包括：
- VarianceThreshold
- SelectKBest (ANOVA F-value, Mutual Information)
- RFE (Recursive Feature Elimination)
- 基於模型的特徵重要度 (RandomForest)


In [None]:
# 選擇數值特徵進行特徵選擇示範
X_train_num = X_train.select_dtypes(include=[np.number])
y_train_num = y_train

# 1. VarianceThreshold: 移除方差過低的特徵
vt = VarianceThreshold(threshold=0.0)
X_vt = vt.fit_transform(X_train_num)
removed_features = X_train_num.columns[~vt.get_support()]
print("透過VarianceThreshold移除的特徵：", removed_features.tolist())

In [None]:
# 2. SelectKBest (ANOVA F-value)
from sklearn.feature_selection import f_classif
skb_f = SelectKBest(score_func=f_classif, k=4)
X_skb_f = skb_f.fit_transform(X_train_num, y_train_num)
selected_features_f = X_train_num.columns[skb_f.get_support()]
print("ANOVA F-value 選擇出的特徵：", selected_features_f.tolist())

# 也可使用互信息
from sklearn.feature_selection import mutual_info_classif
skb_mi = SelectKBest(score_func=mutual_info_classif, k=4)
X_skb_mi = skb_mi.fit_transform(X_train_num, y_train_num)
selected_features_mi = X_train_num.columns[skb_mi.get_support()]
print("Mutual Info 選擇出的特徵：", selected_features_mi.tolist())

In [None]:
# 3. RFE (以邏輯回歸為基礎模型)
from sklearn.feature_selection import RFE
lr = LogisticRegression(max_iter=10000)
rfe = RFE(estimator=lr, n_features_to_select=4)
X_rfe = rfe.fit_transform(X_train_num, y_train_num)
selected_features_rfe = X_train_num.columns[rfe.get_support()]
print("RFE 選擇的特徵：", selected_features_rfe.tolist())

In [None]:
# 4. 基於模型的特徵重要度 (RandomForest)
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_num, y_train_num)
importances = rf.feature_importances_
imp_df = pd.DataFrame({'feature': X_train_num.columns, 'importance': importances})
imp_df = imp_df.sort_values('importance', ascending=False)
print("RandomForest 特徵重要度：")
print(imp_df)

## 2. 特徵縮放 (Feature Scaling)

比較 StandardScaler, MinMaxScaler, RobustScaler 對分佈的影響。

In [None]:
numeric_cols = X_train_num.columns
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

plt.figure(figsize=(15, 5*len(numeric_cols)))
for i, col in enumerate(numeric_cols):
    data = X_train_num[col].values.reshape(-1, 1)
    plt.subplot(len(numeric_cols), 4, i*4 + 1)
    sns.histplot(data=data, bins=30)
    plt.title(f'Original - {col}')
    for j, (name, scaler) in enumerate(scalers.items()):
        scaled_data = scaler.fit_transform(data)
        plt.subplot(len(numeric_cols), 4, i*4 + j + 2)
        sns.histplot(data=scaled_data, bins=30)
        plt.title(f'{name} - {col}')

plt.tight_layout()
plt.show()

## 3. 特徵組合 (Feature Combination)

示範使用 PolynomialFeatures 產生多項式特徵，以及手動產生一些數值特徵交互作用項。

In [None]:
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train_num)
print("原始數值特徵數：", X_train_num.shape[1])
print("多項式特徵後特徵數：", X_poly.shape[1])

也可自行創建新特徵，例如收入與年齡的比值或日期提取出年度、月份等特徵。

In [None]:
# 時間特徵處理範例
X_train['year'] = X_train['date'].dt.year
X_train['month'] = X_train['date'].dt.month
X_train['dayofweek'] = X_train['date'].dt.dayofweek

# 數值特徵互動
X_train['income_age_ratio'] = X_train['income'] / (X_train['sepal length (cm)']+0.1)
# 僅示範，實務需確保無除以0情況

## 4. 類別特徵編碼 (Categorical Encoding)

示範各種編碼方法：
- One-Hot Encoding
- Ordinal Encoding
- Target Encoding (需外部套件 category_encoders)
- Binary Encoding (同樣使用 category_encoders)


In [None]:
cat_cols = ['color','region','categorical_feature'] if 'categorical_feature' in X_train.columns else ['color','region']

# One-Hot Encoding
ohe = OneHotEncoder(drop='first', sparse=False)
X_ohe = ohe.fit_transform(X_train[cat_cols])
print("One-Hot編碼後形狀：", X_ohe.shape)

# Ordinal Encoding (需先定義類別順序)
ord_enc = OrdinalEncoder()
X_ord = ord_enc.fit_transform(X_train[cat_cols])
print("Ordinal編碼後形狀：", X_ord.shape)

# Target Encoding
te = ce.TargetEncoder()
X_te = te.fit_transform(X_train[cat_cols], y_train)
print("Target Encoding後形狀：", X_te.shape)

# Binary Encoding
be = ce.BinaryEncoder()
X_be = be.fit_transform(X_train[cat_cols])
print("Binary Encoding後形狀：", X_be.shape)

## 5. 文本特徵處理 (Text Feature Processing)

使用 CountVectorizer 或 TfidfVectorizer 對文本特徵進行簡單表示，若有需要可使用更進階的Embedding方法（如 Word2Vec、BERT 等）。

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

X_text = X_train['feedback']
cv = CountVectorizer()
X_text_cv = cv.fit_transform(X_text)
print("CountVectorizer 特徵維度：", X_text_cv.shape)

## 6. Embeddings 簡介

Embeddings 是將類別或詞彙映射到低維向量空間的技術。在NLP中常用 Word2Vec 等工具學習詞向量。此處不實際訓練，但示範如何使用 gensim 對文本分詞後訓練 Word2Vec。

In [None]:
!pip install gensim

In [None]:
from gensim.models import Word2Vec

sentences = X_train['feedback'].apply(lambda x: list(x)) # 簡單將字串轉為字元list示範
model_w2v = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
word_vec = model_w2v.wv['產品'] # 假設 '產品' 在詞彙中
print("'產品'的詞向量：", word_vec)

實務上可使用更完善的分詞與語言模型取得更好的嵌入表示。

## 7. PCA 降維分析

對數值特徵進行 PCA，觀察資料在低維空間的分佈。

In [None]:
X_num_train = X_train.select_dtypes(include=[np.number])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_num_train)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

pca_df = pd.DataFrame(X_pca, columns=['PC1','PC2'])
pca_df['target'] = y_train.reset_index(drop=True)

plt.figure(figsize=(8,6))
sns.scatterplot(x='PC1', y='PC2', hue='target', data=pca_df)
plt.title('PCA Result')
plt.show()

print('解釋變異比例：', pca.explained_variance_ratio_)
print('累積解釋變異比例：', np.cumsum(pca.explained_variance_ratio_))

## 8. Pipeline 實務整合

透過 Pipeline 將特徵工程步驟與模型訓練串連，使流程更簡潔與可重現。

示範一個簡單的 Pipeline：
1. 對數值特徵做 StandardScaler
2. 類別特徵做 One-Hot Encoding
3. 訓練 Logistic Regression 模型

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_features = df.select_dtypes(include=[np.number]).columns
cat_features = ['color','region']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, cat_features)
], remainder='passthrough')

pipe = Pipeline([
    ('preprocess', preprocessor),
    ('clf', LogisticRegression(max_iter=10000))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Accuracy on test set:", accuracy_score(y_test, y_pred))

## 總結

本 Notebook 展示了多種特徵工程技巧，包括特徵選擇、特徵縮放、特徵組合、類別編碼、文本特徵處理、時間特徵處理、簡介 Embeddings 概念，以及 PCA 進行降維分析。並透過 Pipeline 將整個流程整合，使得特徵工程與模型訓練可重複執行、易於維護。

實務中，請根據資料特性、模型需求及專案目標靈活運用上述方法。特徵工程是一門需要實驗、領域知識與直覺的藝術，唯有不斷嘗試才能找到最佳的特徵處理策略。