## **Project2：特徵重要性計算 (Feature Importance)**

### **動機**

在量化交易策略的開發過程中，金融市場資料常包含大量技術指標與衍生特徵，若直接將所有特徵丟入機器學習模型，容易造成過度擬合、運算成本過高，且不易洞察哪些特徵對預測最具貢獻。  
本專案旨在針對多樣化的技術指標與統計特徵，透過多種特徵重要性方法（MDI、MDA、SFI）進行評估，篩選出對後續 Meta-Labeling 或其他模型訓練最具代表性的特徵子集。

---

### **流程概述**

1. **資料讀取與事件生成**  
   - 讀取 Project1 產生的分群結果。  
   - 將原始的多商品價格資料（Dollar Bars）與對應事件整合，確保「事件時間區間」與「標籤（Label）」可供後續特徵計算使用。

2. **技術指標與特徵工程**  
   - 對每段事件存續期間，以及其觸發前後指定的回溯窗口，計算多種技術指標（如 RSI、ATR、SMA、EMA、布林通道、成交量相關指標等）。  
   - 對非平穩指標進行Fraction Difference，平穩化指標的同時讓指標保持記憶性。
   - 使用PCA移除特徵間的共線性，以此降低特徵間的substition effect。

3. **切分訓練與驗證資料**  
   - 依據時間順序，將事件資料按「Purged K-Fold Cross-Validation」原則進行切分，以避免資料洩漏：  
     1. 首先以時間區間（例如年份）做粗略切分；  
     2. 在每個摺疊（Fold）內摒棄與測試集事件存續期間有交集的訓練事件（Purging）；  
     3. 用唯一性(uniqueness)進一步平衡樣本權重。  
   - 最終產生多組 Train / Validation 折疊，用於特徵重要性計算。

4. **計算特徵重要性**  
   - **MDI（Mean Decrease in Impurity）**：  
     - 以隨機森林（Random Forest）模型擬合訓練集，根據每棵決策樹的節點分裂所減少的Entropy，累計至每個特徵。  
   - **MDA（Mean Decrease in Accuracy，Permutation Importance）**：  
     - 在驗證集上，逐一打亂每個特徵的取值（Permutation），觀察模型預測準確度的下降幅度。下降越多，代表該特徵對模型性能貢獻越大。  
   - **SFI（Single Feature Importance）**：  
     - 單獨僅使用某個特徵訓練模型，評估其在驗證集上的分類效果，排名特徵的預測能力。

5. **結果可視化與輸出**  
   - 繪製各特徵方法的排名條形圖（Bar Chart）。  
   - 最終以 CSV 檔形式輸出`mdi.csv` `mda.csv` `sfi.csv`。。

---

### **核心特色說明**

- **Purged K-Fold Cross-Validation**  
  - 為避免事件期間與測試集重疊造成資料洩漏，每個摺疊會刪除掉與測試時段相重疊的訓練事件。  
  - 根據《Advances in Financial Machine Learning》建議，先將事件按時間切分，再剔除重疊範圍，確保不會偷看到未來資料。

- **Fractional Differentiation (FFD)**  
  - 傳統的差分 (integer differencing) 可能過度去除序列長期記憶性，使得部分潛在訊號流失。FFD 提供一種「非整數階差分」的方式，透過逐步逼近法，根據最小化自相關與 ADF 單根檢定結果，選擇最佳差分階數。  
  - FFD 的優點在於：  
    1. **保留長期記憶性 (Long Memory)**：避免完全消除序列的自相關結構；  
    2. **平穩化（Stationarize）**：將非平穩序列轉為近似平穩，符合大多回歸模型與機器學習對輸入特徵的假設；  
    3. **減少信息損失**：相較於 Integer Differencing，可保留更多序列原始波動信息。  

- **PCA 移除共線性**  
  - 金融技術指標之間往往高度相關，若直接投入模型訓練，容易因替代現象 (Substitution Effect) 造成某些變量重要度被高估。  
  - 使用 PCA（Principal Component Analysis）提取主要成分，保留 > 95% 變異量，同時降低維度，有助於後續特徵重要性計算更穩定。

- **三種特徵重要性方法**  
  1. **MDI（Mean Decrease in Impurity）**  
     - 透過 Random Forest 訓練時，決策樹分裂節點所減少的 Gini 或 Entropy，作為特徵貢獻值的指標。  
  2. **MDA（Permutation Importance）**  
     - 在驗證集上逐一打亂特徵，觀察模型`neg_log_loss` 的下降量，下降越多則features更有貢獻。  
  3. **SFI（Single Feature Importance）**  
     - 單變數模型訓練，檢視該特徵單獨預測能力，提供最直觀的「單一特徵力量」評估。

---



### **讀取資料**
- 讀取黃金所在群集，並將該集合內資料轉換為dollar bars

In [None]:
import sys
import os

project_root = os.path.abspath(os.path.join("..", "QuantCommon"))
if project_root not in sys.path:
    sys.path.append(project_root)

# utils為我自己編寫的常用工具庫，檔案不在此作品集內
from utils.processing import get_dollar_bars 
import numpy as np
import pandas as pd
import numpy as np


clusters = pd.read_csv("results/clusters.csv", index_col=0)
print(f'XAUUSD 在第{clusters.loc["XAUUSD_M1", "cluster"]}群')

group = clusters[clusters["cluster"] == clusters.loc["XAUUSD_M1", "cluster"]]
data = dict({})
for i in group.index:
    print(f"Processing {i[:-3]}...")
    filepath = os.path.join(project_root, "data", "FI", "M1",f"{i}.csv")
    df = pd.read_csv(filepath, parse_dates=True)
    df['time'] = pd.to_datetime(df['time'])
    df = get_dollar_bars(df)
    data[i] = df

XAUUSD 在第2群


### **對資料進行labeling和特徵計算**
- 使用trible barrier進行labeling以及計算指標作為特徵

In [3]:
from utils.metalabeling import add_vertical_barrier, get_events, get_bins
from utils.processing import apply_cusum_filter, getDailyVol, cal_weights, compute_talib_features

feats_list, labels_list, weights_list, t1_list = [], [], [], []

for symbol,df in data.items():
    print(f"Processing {symbol[:-3]}...")
    vol = getDailyVol(df["close"], span0=20)
    cusum_events  = apply_cusum_filter(df, volatility=vol).index
    vertical_barriers = add_vertical_barrier(cusum_events, df, num_days=2)
    pt_sl = [1, 1]
    min_ret = 0.003
    triple_barrier_events = get_events(close=df["close"],
                                                t_events=cusum_events,
                                                pt_sl=pt_sl,
                                                target=vol,
                                                min_ret=min_ret,
                                                num_threads=4,
                                                vertical_barrier_times=vertical_barriers,
                                                side_prediction=None)
    labels  = get_bins(triple_barrier_events, df["close"])
    weights = cal_weights(triple_barrier_events, df["close"])
    feats = compute_talib_features(df,
                               periods=[7,28,50,100],
                               apply_ffd=True)
    
    # normalize features
    for col in feats.columns:
        # 每個 col 分別做 rolling.apply
        feats[col] = (
            feats[col]
            .rolling(window=200, min_periods=1)
            .apply(lambda arr: (arr <= arr[-1]).sum() / len(arr), raw=True)
        )
    idx = feats.index.intersection(labels.index)
    feats = feats.loc[idx]
    labels = labels.loc[idx]["bin"]
    weights = weights.loc[idx]["weight"]
    weights = weights / weights.mean() # normalize weights
    t1 = triple_barrier_events.loc[idx]["t1"]

    feats_list.append(feats)
    labels_list.append(labels.rename("bin"))
    weights_list.append(weights.rename("weight"))
    t1_list.append(t1.rename("t1"))

    
feats = pd.concat(feats_list)
labels = pd.concat(labels_list)
weights = pd.concat(weights_list)/len(weights)
t1 = pd.concat(t1_list)

combined_features = pd.concat(
    [feats, labels, weights, t1],
    axis=1
)
combined_features.sort_index(inplace=True)
combined_features.to_csv("intermediate_results/combined_features.csv", index=True)

labels = combined_features['bin']   
weights = combined_features['weight']
t1 = combined_features['t1']
feats = combined_features.drop(columns=['bin', 'weight', 't1'],axis=1)

Processing AUDUSD...
Processing EURGBP...
Processing EURUSD...
Processing GBPUSD...
Processing HK50...
Processing NZDUSD...
Processing UK100...
Processing US2000...
Processing USDCAD...
Processing XAGUSD...
Processing XAUUSD...


In [None]:
feats.head()

### **PCA**

In [4]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# === Pipeline : z-score → PCA ===
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca",    PCA(n_components=0.95, whiten=False)),
])


X = pipe.fit_transform(feats)


In [None]:
from joblib import dump
# 儲存模型
dump(pipe, "models/pipeline_scaler_pca.joblib")


['models/pipeline_scaler_pca.joblib']

### **建立所需function**


- PurgedKFold

In [6]:
import numpy as np
import pandas as pd

class PurgedKFold:
    def __init__(self, n_splits=3, t1=None, pct_embargo=0.0):
        if not isinstance(t1, pd.Series):
            raise ValueError("t1 must be a pandas Series")
        self.n_splits = n_splits
        self.t1 = t1.sort_index()
        self.pct_embargo = pct_embargo

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits

    def split(self, X, y=None, groups=None):
        if not X.index.equals(self.t1.index):
            raise ValueError("X and t1 must have the same index")
        n_samples = len(X)
        indices = np.arange(n_samples)
        # divide indices into contiguous chunks
        test_slices = np.array_split(indices, self.n_splits)
        mbrg = int(n_samples * self.pct_embargo)

        for slice_ in test_slices:
            i, j = slice_[0], slice_[-1] + 1
            test_idx = indices[i:j]

            # start‐time of test block
            t0 = self.t1.index[i]
            # end‐time of test block
            t1_max = self.t1.iloc[test_idx].max()
            # find the position just after t1_max
            max_t1_pos = self.t1.index.searchsorted(t1_max)

            # training before test block
            train_before = indices[self.t1.index < t0]
            # training after test + embargo
            train_after = indices[max_t1_pos + mbrg :]

            train_idx = np.concatenate([train_before, train_after])
            yield train_idx, test_idx


- CVscore
    - 因sklearn本身的cv score在傳送weights會不一致，需要自己建立function

In [None]:
import numpy as np
from sklearn.base import clone
from sklearn.metrics import log_loss, accuracy_score

def cv_score(clf,
             X,
             y,
             sample_weight=None,
             scoring="neg_log_loss",
             t1=None,
             cv=3,
             pct_embargo=0.01):

    if scoring not in ["neg_log_loss", "accuracy"]:
        raise ValueError('scoring must be "neg_log_loss" or "accuracy"')

    pkf = PurgedKFold(n_splits=cv, t1=t1, pct_embargo=pct_embargo)
    scores = []

    for train_idx, test_idx in pkf.split(X):
        # 複製一份新的 model
        model = clone(clf)
        # fit
        if sample_weight is None:
            model.fit(X.iloc[train_idx], y.iloc[train_idx])
        else:
            model.fit(X.iloc[train_idx],
                      y.iloc[train_idx],
                      sample_weight=sample_weight.iloc[train_idx].values)
        # predict + score
        if scoring == "neg_log_loss":
            prob = model.predict_proba(X.iloc[test_idx])
            sc = -log_loss(y.iloc[test_idx],
                           prob,
                           sample_weight=(None if sample_weight is None else sample_weight.iloc[test_idx].values),
                           labels=model.classes_)
        else:
            pred = model.predict(X.iloc[test_idx])
            sc = accuracy_score(y.iloc[test_idx],
                                pred,
                                sample_weight=(None if sample_weight is None else sample_weight.iloc[test_idx].values))
        scores.append(sc)
    return np.array(scores)





- MDA MDI SFI 

In [None]:
from tqdm import tqdm

# 1) MDI 
def feat_imp_mdi(fit, feat_names):
    # 從每顆樹蒐集 feature_importances_
    df0 = pd.DataFrame(
        [tree.feature_importances_ for tree in fit.estimators_],
        columns=feat_names
    ).replace(0, np.nan)  # 如果 max_features=1，某些 tree 有 0
    imp = pd.concat({
        "mean": df0.mean(),
        "std" : df0.std() * df0.shape[0]**-0.5
    }, axis=1)
    # normalize to sum=1
    imp["mean"] /= imp["mean"].sum()
    imp.sort_values(by="mean", ascending=False, inplace=True)
    return imp


# 2) MDA: 
def feat_imp_mda(clf,
                 X,
                 y,
                 sample_weight=None,
                 t1=None,
                 cv: int = 5,
                 pct_embargo: float = 0.01,
                 scoring: str = "neg_log_loss"
                ) -> (pd.DataFrame, float):
    # --- 1) numpy → pandas ---
    if isinstance(X, np.ndarray):
        X = pd.DataFrame(X)
    if not isinstance(y, pd.Series):
        y = pd.Series(y, index=X.index)
    if sample_weight is not None and not isinstance(sample_weight, pd.Series):
        sample_weight = pd.Series(sample_weight, index=X.index)
    if t1 is not None and not isinstance(t1, pd.Series):
        t1 = pd.Series(t1, index=X.index)

    feat_names = list(X.columns)

    # --- 2) baseline score ---
    base_scores = cv_score(clf, X, y,
                           sample_weight=sample_weight,
                           scoring=scoring,
                           t1=t1,
                           cv=cv,
                           pct_embargo=pct_embargo)
    base_mean = base_scores.mean()

    # --- 3) 每個 feature permutation, 加進度條 ---
    diffs = []
    for col in tqdm(feat_names, desc="MDA permuting features"):
        Xp = X.copy()
        np.random.shuffle(Xp[col].values)
        perm_scores = cv_score(clf, Xp, y,
                               sample_weight=sample_weight,
                               scoring=scoring,
                               t1=t1,
                               cv=cv,
                               pct_embargo=pct_embargo)
        diffs.append(base_scores - perm_scores)

    diffs = np.vstack(diffs)
    imp_df = pd.DataFrame({
        "mean": diffs.mean(axis=1),
        "std" : diffs.std(axis=1) * diffs.shape[1]**-0.5
    }, index=feat_names)
    imp_df.sort_values(by="mean", ascending=False, inplace=True)
    return imp_df

# 3) SFI
def SFI(feat_names: list,
                 clf,
                 X: pd.DataFrame,
                 y: pd.Series,
                 sample_weight=None,
                 t1=None,
                 cv: int = 5,
                 pct_embargo: float = 0.01,
                 scoring: str = "neg_log_loss"
                ) -> pd.DataFrame:
    if isinstance(X, np.ndarray):
        X = pd.DataFrame(X, columns=feat_names)
    if not isinstance(y, pd.Series):
        y = pd.Series(y, index=X.index)
    if sample_weight is not None and not isinstance(sample_weight, pd.Series):
        sample_weight = pd.Series(sample_weight, index=X.index)
    if t1 is not None and not isinstance(t1, pd.Series):
        t1 = pd.Series(t1, index=X.index)

    imp = pd.DataFrame(columns=["mean", "std"])
    for featName in feat_names:
        dfo = cv_score(clf, X=X[[featName]],  y = y,
                      sample_weight= sample_weight,
                      scoring=scoring, t1 = t1, cv = cv)
        imp.loc[featName, "mean"] = dfo.mean()
        imp.loc[featName, "std"] = dfo.std() * dfo.shape[0]**-0.5
        imp.sort_values(by="mean", ascending=False, inplace=True)
    return imp


### **建立模型和計算特徵重要性**

In [None]:
col = [f"PCA_{i}" for i in range(X.shape[1])]
X = pd.DataFrame(X, columns= col, index=feats.index)
y = labels.values
weights = weights.values
# t1 在上面定義好了　　　　　　　　　　　　　　　　　　　　　　　

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
avgU = weights.mean()
tree = DecisionTreeClassifier(criterion="entropy", max_features="sqrt", class_weight="balanced")
RF = BaggingClassifier(estimator=tree, n_estimators=1000, max_samples=avgU)

In [None]:
import matplotlib.pyplot as plt


# 1. MDI
RF_fit = RF.fit(X, y, sample_weight=weights)
mdi_imp = feat_imp_mdi(RF_fit, col)
print("Mean Decrease in Impurity (MDI):")
print(mdi_imp)
mdi_imp.to_csv("results/mdi.csv")

mdi_imp = pd.read_csv("results/mdi.csv", index_col=0)
mdi_sorted = mdi_imp.sort_values(by="mean", ascending=False)

plt.figure(figsize=(12, 6))
plt.bar(mdi_sorted.index, mdi_sorted["mean"], color="steelblue")
plt.xticks(rotation=45)
plt.ylabel("MDI Importance")
plt.title("Mean Decrease in Impurity (MDI) Feature Importances")
plt.tight_layout()


plt.show()

Mean Decrease in Impurity (MDI):
            mean       std
PCA_28  0.036056  0.009382
PCA_17  0.035991  0.009206
PCA_9   0.034368  0.008391
PCA_13  0.033959  0.009764
PCA_29  0.033876  0.008070
PCA_4   0.033355  0.008488
PCA_10  0.032779  0.008886
PCA_15  0.032716  0.008518
PCA_19  0.032278  0.008728
PCA_26  0.032180  0.007766
PCA_24  0.032159  0.008717
PCA_23  0.032131  0.008056
PCA_30  0.031801  0.008355
PCA_20  0.031785  0.008110
PCA_22  0.031701  0.007981
PCA_21  0.031630  0.008596
PCA_12  0.031346  0.008854
PCA_18  0.031295  0.009183
PCA_14  0.031191  0.009223
PCA_6   0.031179  0.009147
PCA_27  0.031099  0.007842
PCA_16  0.030587  0.009740
PCA_11  0.029985  0.008939
PCA_2   0.029756  0.008952
PCA_7   0.029366  0.009330
PCA_0   0.028944  0.007892
PCA_3   0.028658  0.008977
PCA_8   0.028592  0.009641
PCA_1   0.028241  0.009165
PCA_25  0.027978  0.008794
PCA_31  0.026762  0.008687
PCA_5   0.026258  0.008511


In [None]:
# 2. MDA
mda_imp = feat_imp_mda(
    RF, X, y, cv=5,
    sample_weight=weights,
    t1=t1, pct_embargo=0.01,
    scoring="neg_log_loss"
)
print("Mean Decrease Accuracy (MDA):")
print(mda_imp)
mda_imp.to_csv("results/mda.csv")



mda_imp = pd.read_csv("results/mda.csv", index_col=0)

mda_sorted = mda_imp.sort_values("mean", ascending=False)

# 繪製長條圖並加上標準差誤差線
plt.figure(figsize=(12, 6))
plt.bar(
    mda_sorted.index,
    mda_sorted["mean"],
)
plt.xticks(rotation=45)
plt.ylabel("Permutation Importance (Mean Decrease in Accuracy)")
plt.title("Mean Decrease in Accuracy (MDA) Feature Importances")
plt.tight_layout()

plt.show()

MDA permuting features: 100%|██████████| 32/32 [22:50<00:00, 42.81s/it]

Mean Decrease Accuracy (MDA):
            mean       std
PCA_18  0.000565  0.000259
PCA_31  0.000479  0.000418
PCA_10  0.000292  0.000242
PCA_14  0.000275  0.000510
PCA_15  0.000268  0.000549
PCA_25  0.000225  0.000216
PCA_9   0.000188  0.000427
PCA_1   0.000181  0.000448
PCA_8   0.000137  0.000617
PCA_21  0.000131  0.000271
PCA_27  0.000124  0.000252
PCA_22  0.000067  0.000359
PCA_28  0.000032  0.000280
PCA_2   0.000018  0.000471
PCA_29 -0.000027  0.000217
PCA_4  -0.000029  0.000239
PCA_11 -0.000030  0.000272
PCA_19 -0.000031  0.000215
PCA_6  -0.000052  0.000260
PCA_30 -0.000089  0.000168
PCA_13 -0.000120  0.000317
PCA_0  -0.000144  0.000479
PCA_24 -0.000165  0.000228
PCA_5  -0.000169  0.000291
PCA_3  -0.000179  0.000312
PCA_17 -0.000247  0.000354
PCA_12 -0.000263  0.000116
PCA_23 -0.000286  0.000434
PCA_20 -0.000317  0.000305
PCA_7  -0.000360  0.000171
PCA_26 -0.000537  0.000521
PCA_16 -0.000571  0.000220





In [None]:
# 3. SFI
sfi_imp = SFI(
    X.columns, RF, X, y, 
    scoring="neg_log_loss", 
    sample_weight=weights , 
    cv=5, t1 = t1, pct_embargo=0.01)
print("Single Factor Importance (SFI):")
print(sfi_imp)
sfi_imp.to_csv("results/sfi.csv")

sfi_imp = pd.read_csv("results/sfi.csv", index_col=0)


sfi_sorted = sfi_imp.sort_values("mean", ascending=False)

plt.figure(figsize=(12, 6))
plt.bar(
    sfi_sorted.index,
    sfi_sorted["mean"],
)
plt.xticks(rotation=45)
plt.ylabel("Single Feature Importance (Mean ± Std)")
plt.title("Single Factor Importance (SFI) Feature Importances")
plt.tight_layout()

plt.show()

Single Factor Importance (SFI):
            mean       std
PCA_15 -0.693138  0.000212
PCA_20 -0.693355  0.000139
PCA_29 -0.693374  0.000139
PCA_1  -0.693443  0.000193
PCA_22  -0.69352  0.000294
PCA_11 -0.693526  0.000148
PCA_9  -0.693595  0.000118
PCA_7  -0.693615  0.000272
PCA_10 -0.693628  0.000362
PCA_3  -0.693682   0.00013
PCA_18 -0.693729  0.000156
PCA_4  -0.693793  0.000112
PCA_8  -0.693797  0.000397
PCA_5  -0.693808  0.000164
PCA_13 -0.693848  0.000301
PCA_19 -0.693925  0.000161
PCA_17 -0.693972  0.000053
PCA_28 -0.694027   0.00017
PCA_14 -0.694036  0.000436
PCA_16 -0.694053  0.000229
PCA_2  -0.694056  0.000142
PCA_6  -0.694077  0.000342
PCA_31 -0.694181  0.000211
PCA_26 -0.694219   0.00024
PCA_27 -0.694267  0.000175
PCA_23 -0.694285  0.000205
PCA_24 -0.694302  0.000448
PCA_25 -0.694311  0.000181
PCA_0  -0.694333  0.000389
PCA_21 -0.694346  0.000235
PCA_30 -0.694497  0.000284
PCA_12  -0.69463  0.000428
