## Traditional Machine Learning – Decision Tree

1. Data Laden
2. Manuelle Features mit Decision Tree
3. mit tsfresh Features extrahieren
4. mit Decision Tree vergleichen
5. Ergebnisse speichern

In [3]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from statsmodels.iolib.summary import summary

from tsfresh import extract_features
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, balanced_accuracy_score
from sklearn.utils.class_weight import compute_class_weight

# global random seeds
SEED_1 = 42
SEED_2 = 43
SEED_3 = 44

SEEDS = [SEED_1, SEED_2, SEED_3]


### 1. Daten laden

In [4]:
# 1. Data Loading
from src.data_loader import load_sensor_data, load_target

sensor_data = load_sensor_data()
y = load_target()
y.head()

0    100
1    100
2    100
3    100
4    100
Name: 1, dtype: int64

### 2. Sensor Auswahl
- 2.1 Baseline：mit allen 17 Sensoren, jede mit 9 bestimmte statistische Features
- 2.2 Sensor-Auswahl Ablation Experiment
    - jedesmal eine Sensor entfernen und Decision Tree neu trainieren
    - Ergebnisse vergleichen und der wichtigsten Sensoren identifizieren (PS2)
- 2.3 Kumulative Sensor Auswahl Experiment-paarweiser Vergleich mit PS2
    - Paaresensor bauen： PS2+jeden anderen Sensor
    - Ergebnisse vergleichen und die beste Kombination identifizieren (PS2+TS2)
    - Signifikanztest für PS2 vs PS2+TS2
- 2.4 Kumulative Sensor Auswahl Experiment-mit PS2 und TS2
    - PS2+TS2+jeden anderen Sensor
    - Ergebnisse vergleichen
- 2.5 Fazit:
    - welche Sensoren behalten werden sollen, womit mit tsfresh weitergearbeitet wird.

#### 2.1 Baseline mit manuellen Features und Decision Tree
- alle 17 Sensoren
- jede Sensor mit 9 bestimmte despriptive statistics Features
- Decision Tree mit 3 Splits
- Ergebnisse speichern

In [5]:
# Processing pipeline for manual feature extraction(Baseline:windows=1)
def smooth_rolling_rowwise(df: pd.DataFrame, window: int = 1) -> pd.DataFrame:
    if window <= 1:
        return df
    # rolling along columns (time axis)
    return df.rolling(window=window, axis=1, min_periods=1).mean()

In [6]:
# einfach descriptive statistics feature extraction
def extract_features_from_sensor(df: pd.DataFrame, prefix: str) -> pd.DataFrame:
    """
    df: shape (n_samples, n_timesteps)
    return: shape (n_samples, n_features)
    """
    x = df.to_numpy(dtype=float)
    n, t = x.shape

    # deskriptive Statistiken (descriptive statistics)
    mean = x.mean(axis=1)
    std = x.std(axis=1)
    vmin = x.min(axis=1)
    vmax = x.max(axis=1)
    rms = np.sqrt(np.mean(x**2, axis=1))

    # slope (lineare Regression von erstem zu letztem Punkt)
    if t > 1:
        slope = (x[:, -1] - x[:, 0]) / (t - 1)
    else:
        slope = np.zeros(n)

    # 25th, 50th, 75th percentiles
    q25 = np.quantile(x, 0.25, axis=1)
    q50 = np.quantile(x, 0.50, axis=1)
    q75 = np.quantile(x, 0.75, axis=1)

    # insgesamt 9 Features pro Sensor
    return pd.DataFrame({
        f"{prefix}__mean": mean,
        f"{prefix}__std": std,
        f"{prefix}__min": vmin,
        f"{prefix}__max": vmax,
        f"{prefix}__rms": rms,
        f"{prefix}__slope": slope,
        f"{prefix}__q25": q25,
        f"{prefix}__q50": q50,
        f"{prefix}__q75": q75,
    })

In [7]:
# Feature Extraction Pipeline

SMOOTH_WINDOW = 1  # 先不平滑；如果想尝试，改成 5/11/21 等
feature_blocks = []
for sensor_name, df in sensor_data.items():
    df_sp = smooth_rolling_rowwise(df, window=SMOOTH_WINDOW)
    feats = extract_features_from_sensor(df_sp, prefix=sensor_name)
    feature_blocks.append(feats)

X = pd.concat(feature_blocks, axis=1)

print("X shape:", X.shape)    # (2205, n_features)
print("y shape:", y.shape)    # (2205,)

X shape: (2205, 153)
y shape: (2205,)


In [64]:
# Baseline: Decision Tree mit manuellen Features und 3 Splits
from sklearn.metrics import f1_score

def run_decision_tree_3splits(X, y, seeds):
    results = []

    classes = np.unique(y)
    class_weights = compute_class_weight(class_weight="balanced", classes=classes, y=y)
    class_weight_dict = {c: w for c, w in zip(classes, class_weights)}

    n_splits = len(seeds)

    for k, rs in enumerate(seeds):
        print(f"\n--- Split {k+1}/{n_splits} | random_seed={rs} ---")

        X_train, X_test, y_train, y_test = train_test_split(
            X, y,
            test_size=0.2,
            random_state=rs,
            stratify=y
        )

        clf = DecisionTreeClassifier(
            random_state=rs,
            class_weight=class_weight_dict,
            max_depth=None,      # baseline
            min_samples_leaf=1
        )

        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)

        bal_acc = balanced_accuracy_score(y_test, y_pred)

        f1_macro = f1_score(y_test, y_pred, average="macro")
        f1_weighted = f1_score(y_test, y_pred, average="weighted")

        results.append({
            "split": k + 1,
            "random_seed": rs,
            "balanced_accuracy": bal_acc,
            "f1_macro": f1_macro,
            "f1_weighted": f1_weighted
        })


        print("Balanced Acc:", round(bal_acc, 4))
        print(classification_report(y_test, y_pred, digits=4, zero_division=0))

    # Per-split results
    results_df = pd.DataFrame(results)

    print("\nPer-split results:")
    print(results_df)

    # Summary von drei Splits
    summary_df = pd.DataFrame({
        "balanced_accuracy_mean": [results_df["balanced_accuracy"].mean()],
        "balanced_accuracy_std": [results_df["balanced_accuracy"].std(ddof=1)],
        "f1_macro_mean":          [results_df["f1_macro"].mean()],
        "f1_macro_std":           [results_df["f1_macro"].std(ddof=1)],
        "f1_weighted_mean":       [results_df["f1_weighted"].mean()],
        "f1_weighted_std":        [results_df["f1_weighted"].std(ddof=1)],
    })

    print("\nSummary von 3 Splits:")
    print(summary_df)

    return results_df, summary_df

baseline_results_df, baseline_summary_df = run_decision_tree_3splits(X, y, SEEDS)
# baseline_mean = float(baseline_summary_df["balanced_accuracy_mean"].values[0])


--- Split 1/3 | random_seed=42 ---
Balanced Acc: 0.9164
              precision    recall  f1-score   support

          73     0.9452    0.9583    0.9517        72
          80     0.8590    0.9306    0.8933        72
          90     0.8267    0.8611    0.8435        72
         100     0.9581    0.9156    0.9364       225

    accuracy                         0.9161       441
   macro avg     0.8972    0.9164    0.9062       441
weighted avg     0.9184    0.9161    0.9167       441


--- Split 2/3 | random_seed=43 ---
Balanced Acc: 0.9022
              precision    recall  f1-score   support

          73     0.9333    0.9722    0.9524        72
          80     0.8472    0.8472    0.8472        72
          90     0.8356    0.8472    0.8414        72
         100     0.9593    0.9422    0.9507       225

    accuracy                         0.9161       441
   macro avg     0.8939    0.9022    0.8979       441
weighted avg     0.9166    0.9161    0.9162       441


--- Split 3/3 |

#### 2.2 Sensor-Auswahl Ablation Experiment
- Jedesmal eine Sensor entfernen und Decision Tree neu trainieren
- Ergebnisse vergleichen und der wichtigsten Sensoren identifizieren (PS2)
- Ergebnisse speichern
- Fazit ziehen

In [9]:
# Sensor Entferner Funktion
def drop_sensor_features(X: pd.DataFrame, sensor_name: str) -> pd.DataFrame:
    """
    löscht alle Features eines bestimmten Sensors aus dem DataFrame X
    Parameters:
        X: pd.DataFrame, shape (n_samples, n_features)
        sensor_name: str, z.B. "PS1"
    Returns:
        pd.DataFrame, shape (n_samples, n_features - n_sensor_features)
    """
    cols_to_drop = [c for c in X.columns if c.startswith(f"{sensor_name}__")]
    return X.drop(columns=cols_to_drop)


In [65]:
def run_sensor_ablation_study(
    X_full: pd.DataFrame,
    y: pd.Series,
    sensors: list,
    seeds: list,
    baseline_mean: float
):
    rows = []

    rows.append({
        "removed_sensor": "None (baseline)",
        "mean_bal_acc": baseline_mean,
        "delta_vs_baseline": 0.0
    })

    for sensor in sensors:
        print("\n" + "="*60)
        print(f"Ablation: remove sensor {sensor}")
        print("="*60)

        X_reduced = drop_sensor_features(X_full, sensor)

        results_df, summary_df = run_decision_tree_3splits(
            X_reduced, y, seeds
        )

        mean_acc = summary_df["balanced_accuracy_mean"].iloc[0]
        std_acc  = summary_df["balanced_accuracy_std"].iloc[0]

        rows.append({
            "removed_sensor": sensor,
            "mean_bal_acc": mean_acc,
            "std_bal_acc": std_acc,
            "delta_vs_baseline": mean_acc - baseline_mean
        })

    ablation_df = pd.DataFrame(rows)
    ablation_df = ablation_df.sort_values(
        by="delta_vs_baseline",
        ascending=True
    ).reset_index(drop=True)

    return ablation_df


In [11]:
baseline_mean = baseline_summary_df["balanced_accuracy_mean"].iloc[0]

# 2) Ablation Study
ALL_SENSORS = list(sensor_data.keys())

ablation_df = run_sensor_ablation_study(
    X_full=X,
    y=y,
    sensors=ALL_SENSORS,
    seeds=SEEDS,
    baseline_mean=baseline_mean
)

ablation_sorted_perf = ablation_df.sort_values(
    by="mean_bal_acc",
    ascending=False
).reset_index(drop=True)

print("\n")
print("Ablation Experiment Results von Desicsion Tree (sorted by mean balanced accuracy):")
print("jedesmal eine Sensor entfernen und Decision Tree neu trainieren")
ablation_sorted_perf


Ablation: remove sensor PS1

--- Split 1/3 | random_seed=42 ---
Balanced Acc: 0.9149
              precision    recall  f1-score   support

          73     0.9189    0.9444    0.9315        72
          80     0.8841    0.8472    0.8652        72
          90     0.8462    0.9167    0.8800        72
         100     0.9727    0.9511    0.9618       225

    accuracy                         0.9274       441
   macro avg     0.9055    0.9149    0.9096       441
weighted avg     0.9288    0.9274    0.9277       441


--- Split 2/3 | random_seed=43 ---
Balanced Acc: 0.8883
              precision    recall  f1-score   support

          73     0.9444    0.9444    0.9444        72
          80     0.8923    0.8056    0.8467        72
          90     0.7949    0.8611    0.8267        72
         100     0.9381    0.9422    0.9401       225

    accuracy                         0.9070       441
   macro avg     0.8924    0.8883    0.8895       441
weighted avg     0.9083    0.9070    0.907

Unnamed: 0,removed_sensor,mean_bal_acc,delta_vs_baseline,std_bal_acc
0,FS1,0.936759,0.02713,0.02991
1,PS4,0.914259,0.00463,0.003072
2,EPS1,0.911111,0.001481,0.007641
3,PS5,0.909954,0.000324,0.019872
4,None (baseline),0.90963,0.0,
5,VS1,0.90963,0.0,0.007882
6,CP,0.908102,-0.001528,0.005932
7,TS4,0.905694,-0.003935,0.006324
8,SE,0.905185,-0.004444,0.02047
9,TS2,0.902731,-0.006898,0.003451


##### Fazit des Sensor Ablation Experiments
- FS1，PS4，EPS1，PS5 müssen auf jeden Fall entfernt werden, da ihre Entfernung die Leistung verbessert.
- VS1 muss auch entfernt werden, da seine Entfernung die Leistung gleich dem Baseline ist.
- Die anderen Sensoren scheinen wichtig zu sein, da ihre Entfernung die Leistung verringert, aber noch zu viel, muss weiter untersucht werden.
- PS2 ist am wichtigsten.
- PS3 und PS6 ist auch wichtig, aber sie haben hohe Korrelation.

#### 2.4 Cumulative ablation Experiment
- Basierend auf den Ergebnissen des einzelnen Sensor-Ablationsexperiments,sind die Sensoren nach ihrer Wichtigkeit sortiert.
- Wir sollen wichtigste Sensoren davon auswählen: PS2, PS3, PS6, TS3, CE, FS2, PS1, TS1, TS2,SE,TS4,CP.
- PS2 ist am wichtigsten,muss auf jeden Fall behalten werden.
- Vergleichen wir die Leistung von zwei Sensoren, eine davon ist PS2.

##### 2.4.1 Paarweiser Sensorvergleich mit PS2
- Vergleichen PS2+jeden anderen Sensor.

In [12]:
def keep_only_sensors(X: pd.DataFrame, sensors: list[str]) -> pd.DataFrame:
    keep_cols = []
    for s in sensors:
        keep_cols.extend([c for c in X.columns if c.startswith(f"{s}__")])
    return X[keep_cols]


In [66]:
def run_pair_search_with_ps2(X_full, y, seeds, candidates):
    rows = []

    # PS2-only baseline
    X_ps2 = keep_only_sensors(X_full, ["PS2"])
    _, summary_ps2 = run_decision_tree_3splits(X_ps2, y, seeds)
    ps2_mean = summary_ps2["balanced_accuracy_mean"].iloc[0]
    ps2_std  = summary_ps2["balanced_accuracy_std"].iloc[0]

    rows.append({
        "sensors_used": "PS2 only",
        "mean_bal_acc": ps2_mean,
        "std_bal_acc": ps2_std,
        "delta_vs_ps2_only": 0.0
    })

    # PS2 + one candidate
    for s in candidates:
        if s == "PS2":
            continue

        X_pair = keep_only_sensors(X_full, ["PS2", s])
        _, summary = run_decision_tree_3splits(X_pair, y, seeds)
        mean_acc = summary["balanced_accuracy_mean"].iloc[0]
        std_acc  = summary["balanced_accuracy_std"].iloc[0]

        rows.append({
            "sensors_used": f"PS2 + {s}",
            "mean_bal_acc": mean_acc,
            "std_bal_acc": std_acc,
            "delta_vs_ps2_only": mean_acc - ps2_mean
        })

    df = pd.DataFrame(rows).sort_values("mean_bal_acc", ascending=False).reset_index(drop=True)
    return df


In [14]:
CANDIDATES = ["PS3", "PS6", "TS3", "CE", "FS2", "PS1", "TS2", "TS1"]
pair_df = run_pair_search_with_ps2(X, y, SEEDS, CANDIDATES)
pair_df


--- Split 1/3 | random_seed=42 ---
Balanced Acc: 0.9572
              precision    recall  f1-score   support

          73     0.9730    1.0000    0.9863        72
          80     0.9848    0.9028    0.9420        72
          90     0.9178    0.9306    0.9241        72
         100     0.9825    0.9956    0.9890       225

    accuracy                         0.9705       441
   macro avg     0.9645    0.9572    0.9604       441
weighted avg     0.9707    0.9705    0.9703       441


--- Split 2/3 | random_seed=43 ---
Balanced Acc: 0.9851
              precision    recall  f1-score   support

          73     0.9863    1.0000    0.9931        72
          80     0.9726    0.9861    0.9793        72
          90     0.9589    0.9722    0.9655        72
         100     0.9955    0.9822    0.9888       225

    accuracy                         0.9841       441
   macro avg     0.9783    0.9851    0.9817       441
weighted avg     0.9843    0.9841    0.9842       441


--- Split 3/3 |

Unnamed: 0,sensors_used,mean_bal_acc,std_bal_acc,delta_vs_ps2_only
0,PS2 + TS2,0.970463,0.009719,4.6e-05
1,PS2 only,0.970417,0.014021,0.0
2,PS2 + PS6,0.968657,0.014478,-0.001759
3,PS2 + CE,0.967361,0.008438,-0.003056
4,PS2 + FS2,0.967361,0.011684,-0.003056
5,PS2 + PS3,0.965093,0.017973,-0.005324
6,PS2 + PS1,0.964352,0.013236,-0.006065
7,PS2 + TS3,0.962824,0.015356,-0.007593
8,PS2 + TS1,0.962454,0.012794,-0.007963


##### 2.4.2 Signifikanztest für PS2 vs PS2+TS2
- PS2+TS2 hat die beste Leistung im paarweisen Sensorvergleich mit PS2.
- Testen, ob die Verbesserung signifikant ist.
- Nullhypothese: Die Leistung von PS2+TS2 ist gleich der von PS2 allein.
- Alternativhypothese: Die Leistung von PS2+TS2 ist ungleich der von PS2 allein.
- Signifikanzniveau: α = 0.05

In [67]:
# Vorbereitung der Daten für den Hypothesentest

# PS2-only
res_ps2_df, sum_ps2_df = run_decision_tree_3splits(
    keep_only_sensors(X, ["PS2"]), y, SEEDS
)

a = res_ps2_df["balanced_accuracy"].to_numpy()
print("PS2-only per split:", a)

# PS2+TS2
res_pair_df, sum_pair_df = run_decision_tree_3splits(
    keep_only_sensors(X, ["PS2", "TS2"]), y, SEEDS
)

b = res_pair_df["balanced_accuracy"].to_numpy()

print("PS2+TS2 per split:", b)
print("diff mean:", b.mean() - a.mean())
print("diff per split:", b - a)



--- Split 1/3 | random_seed=42 ---
Balanced Acc: 0.9572
              precision    recall  f1-score   support

          73     0.9730    1.0000    0.9863        72
          80     0.9848    0.9028    0.9420        72
          90     0.9178    0.9306    0.9241        72
         100     0.9825    0.9956    0.9890       225

    accuracy                         0.9705       441
   macro avg     0.9645    0.9572    0.9604       441
weighted avg     0.9707    0.9705    0.9703       441


--- Split 2/3 | random_seed=43 ---
Balanced Acc: 0.9851
              precision    recall  f1-score   support

          73     0.9863    1.0000    0.9931        72
          80     0.9726    0.9861    0.9793        72
          90     0.9589    0.9722    0.9655        72
         100     0.9955    0.9822    0.9888       225

    accuracy                         0.9841       441
   macro avg     0.9783    0.9851    0.9817       441
weighted avg     0.9843    0.9841    0.9842       441


--- Split 3/3 |

In [68]:
# Wilcoxon Signed-Rank Test, da die Stichprobengröße klein ist (n=3)

from scipy.stats import wilcoxon

stat, p = wilcoxon(b, a, alternative="two-sided")
print("Wilcoxon p-value:", p)

Wilcoxon p-value: 1.0


###### Fazit
- PS2 + TS2 hat die beste Leistung, aber die Verbesserung ist sehr sehr gering. Vielleicht ist die geringe Verbesserung nur auf Zufall zurückzuführen.
- Der p-Wert ist 1.0 größer als 0.05, daher können wir die Nullhypothese nicht ablehnen, was bedeutet, dass es keinen signifikanten Unterschied zwischen PS2 und PS2+TS2 gibt.
- Es ist schon ausreichend nur PS2.

##### 3.2.2 Kumulative Sensor Auswahl mit PS2 und TS2(nicht nötig)
- Trotzdem PS2+TS2 nicht signifikant besser als PS2 allein ist, versuchen wir PS2+TS2 als Basis für die kumulative Sensor Auswahl.
- Vergleichen PS2+TS2 mit anderen Sensor, um sicherzustellen, ob der dritte Sensor unnötig ist.

In [69]:
CANDIDATES_3RD = ["PS6", "PS3", "CE", "FS2", "PS1", "TS3", "TS1"]
def run_cumulative_search_with_ps2_ts2(X_full, y, seeds, candidates):
    rows = []

    # PS2 + TS2 baseline
    X_base = keep_only_sensors(X_full, ["PS2", "TS2"])
    _, summary_base = run_decision_tree_3splits(X_base, y, seeds)
    base_mean = summary_base["balanced_accuracy_mean"].iloc[0]
    base_std  = summary_base["balanced_accuracy_std"].iloc[0]

    rows.append({
        "sensors_used": "PS2 + TS2",
        "mean_bal_acc": base_mean,
        "std_bal_acc": base_std,
        "delta_vs_base": 0.0
    })

    # PS2 + TS2 + one candidate
    for s in candidates:
        X_cum = keep_only_sensors(X_full, ["PS2", "TS2", s])
        _, summary = run_decision_tree_3splits(X_cum, y, seeds)
        mean_acc = summary["balanced_accuracy_mean"].iloc[0]
        std_acc  = summary["balanced_accuracy_std"].iloc[0]

        rows.append({
            "sensors_used": f"PS2 + TS2 + {s}",
            "mean_bal_acc": mean_acc,
            "std_bal_acc": std_acc,
            "delta_vs_base": mean_acc - base_mean
        })

    df = pd.DataFrame(rows).sort_values("mean_bal_acc", ascending=False).reset_index(drop=True)
    return df


In [18]:
drei_sensoren_df=run_cumulative_search_with_ps2_ts2(X, y, SEEDS, CANDIDATES_3RD)
drei_sensoren_df


--- Split 1/3 | random_seed=42 ---
Balanced Acc: 0.9643
              precision    recall  f1-score   support

          73     0.9595    0.9861    0.9726        72
          80     0.9718    0.9583    0.9650        72
          90     0.9306    0.9306    0.9306        72
         100     0.9866    0.9822    0.9844       225

    accuracy                         0.9705       441
   macro avg     0.9621    0.9643    0.9632       441
weighted avg     0.9706    0.9705    0.9705       441


--- Split 2/3 | random_seed=43 ---
Balanced Acc: 0.9817
              precision    recall  f1-score   support

          73     0.9730    1.0000    0.9863        72
          80     0.9859    0.9722    0.9790        72
          90     0.9459    0.9722    0.9589        72
         100     0.9955    0.9822    0.9888       225

    accuracy                         0.9819       441
   macro avg     0.9751    0.9817    0.9783       441
weighted avg     0.9822    0.9819    0.9819       441


--- Split 3/3 |

Unnamed: 0,sensors_used,mean_bal_acc,std_bal_acc,delta_vs_base
0,PS2 + TS2,0.970463,0.009719,0.0
1,PS2 + TS2 + PS1,0.967407,0.013473,-0.003056
2,PS2 + TS2 + PS6,0.963148,0.010085,-0.007315
3,PS2 + TS2 + CE,0.962731,0.013834,-0.007731
4,PS2 + TS2 + PS3,0.96125,0.016235,-0.009213
5,PS2 + TS2 + FS2,0.958611,0.004654,-0.011852
6,PS2 + TS2 + TS3,0.957083,0.004035,-0.01338
7,PS2 + TS2 + TS1,0.955833,0.005878,-0.01463


##### Fazit des Prozesses der Sensor Auswahl
- nur einen Sensor PS2 zu trainieren, ist schon ausreichend.
- TS2 kann als zweiter Sensor hinzugefügt werden, aber die Verbesserung ist nicht signifikant.
- Andere Sensoren sind nicht notwendig.
- Wir würden nur PS2 weiter verwenden, um mit tsfresh Features zu extrahieren und Decision Tree zu trainieren.

### 3. Feature Extration mit tsfresh
- nur PS2 Sensor verwenden
- mit tsfresh Features extrahieren

In [19]:
# Resample Funktion
def resample_to_fixed_length(df: pd.DataFrame, target_len: int = 60) -> np.ndarray:
    '''
    Parameter:
    ----------
    df : pd.DataFrame
        Form (n_samples, old_len), jede Zeile entspricht einem Messzyklus
    target_len : int
        Ziel-Länge der Zeitreihe nach dem Resampling (z.B 60)

    Rückgabe:
    ---------
    np.ndarray
        Array der Form (n_samples, target_len) mit Datentyp float32
    '''

    x = df.to_numpy(dtype=np.float32)
    n, old_len = x.shape

    # wenn old_len == target_len, dann nichts tun
    if old_len == target_len:
        return x

    # wenn old_len != target_len, dann resample
    old_idx = np.linspace(0, 1, old_len, dtype=np.float32)
    new_idx = np.linspace(0, 1, target_len, dtype=np.float32)

    out = np.empty((n, target_len), dtype=np.float32)
    for i in range(n):
        out[i] = np.interp(new_idx, old_idx, x[i])
    return out

In [39]:
from tqdm import tqdm

def prepare_for_tsfresh_df(df: pd.DataFrame, idx_name: str = "idx") -> pd.DataFrame:
    """Prepare one sensor DataFrame for tsfresh.

    Input df: shape (n_samples, n_timesteps)
    Output: long-format DataFrame with columns: [idx_name, time, value]
    """
    df_prepared = pd.DataFrame()

    x = df.to_numpy(dtype=np.float32)
    n_samples, n_time = x.shape

    for idx in tqdm(range(n_samples), total=n_samples):
        df_temp = pd.DataFrame({"value": x[idx]})
        df_temp[idx_name] = idx
        df_temp["time"] = df_temp.index
        df_prepared = pd.concat([df_prepared, df_temp], axis="index", ignore_index=True)

    return df_prepared


In [40]:
from tsfresh import extract_features
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters

LOAD_FROM_FILE = False  # 你自己控制
SENSOR_NAME = "PS2"

# （强烈建议）先把时间长度统一到 60，避免跑一晚上
TARGET_LEN = 60
df_sensor = sensor_data[SENSOR_NAME]
df_sensor_60 = pd.DataFrame(resample_to_fixed_length(df_sensor, target_len=TARGET_LEN))

# 1) prepare raw long df
if not LOAD_FROM_FILE:
    df_raw = prepare_for_tsfresh_df(df_sensor_60, idx_name="idx")
    df_raw.to_csv(f"{SENSOR_NAME}_raw_long.csv", sep=";", decimal=",", index=False)

if LOAD_FROM_FILE:
    df_raw = pd.read_csv(f"{SENSOR_NAME}_raw_long.csv", sep=";", decimal=",")

# 2) extract features
if not LOAD_FROM_FILE:
    features = extract_features(
        df_raw,
        default_fc_parameters = EfficientFCParameters(),  # 太慢就先换 MinimalFCParameters()
        column_id="idx",
        column_sort="time",
        column_value="value",
        n_jobs=1,   # 你环境里 -1 会出错，先用 1；跑通后可改成 4/8
    )
    features.to_csv(f"{SENSOR_NAME}_tsfresh_features.csv", sep=";", decimal=",")

if LOAD_FROM_FILE:
    features = pd.read_csv(f"{SENSOR_NAME}_tsfresh_features.csv", sep=";", decimal=",", index_col=0)

print("features shape:", features.shape)


100%|██████████| 2205/2205 [00:00<00:00, 2872.33it/s]
Feature Extraction: 100%|██████████| 2205/2205 [01:00<00:00, 36.16it/s]


features shape: (2205, 777)


### 4. Data Processing
- 4.1 Handle invalid values
- 4.2 Remove features with no variance
- 4.3 Normalize data

#### 4.1 Handle invalid values

In [41]:
# drop NaN and Inf columns
num_cols= features.shape[1]
features_clean = features.dropna(how="all", axis="columns")
print(f"Dropped {num_cols - features_clean.shape[1]} columns with all NaN values.")

Dropped 282 columns with all NaN values.


In [42]:
# drop columns with greater or equal 50% invalid values
num_cols= features_clean.shape[1]
features_clean = features_clean.dropna(thresh=features_clean.shape[0] // 2, axis="columns")
print(f"Dropped {num_cols - features_clean.shape[1]} columns with >= 50% invalid values.")

Dropped 0 columns with >= 50% invalid values.


In [43]:
# ausfüllen der restlichen NaN-Werte mit Vorwärts- und Rückwärtsfüllung
features_clean = features_clean.ffill(axis="index")
features_clean = features_clean.bfill(axis="index")

#### 4.2 Remove Features ohne Variance

In [44]:
# drop columns with no variance
num_cols= features_clean.shape[1]
dynamic_features = [col for col in features_clean if features_clean[col].nunique() > 1]
features_clean = features_clean[dynamic_features]
print(f"Dropped {num_cols - features_clean.shape[1]} columns with no variance.")


Dropped 69 columns with no variance.


In [45]:
print("features_clean shape after cleaning:", features_clean.shape)

features_clean shape after cleaning: (2205, 426)


#### 4.3 Normalisierung
- Train- und Testsatz split
- Distribution prüfen
- Normalisierung

In [46]:
def print_class_distribution(y, name=""):
    '''
    print Distribution der Zielgröße zum Prüfen
    '''
    dist = y.value_counts(normalize=True).sort_index()
    print(f"{name} distribution:")
    print(dist)

In [47]:
SEEDS = [SEED_1, SEED_2, SEED_3]

print_class_distribution(y, "Full Data")

# train-test splits
splits = {}

for i, seed in enumerate(SEEDS, start=1):
    X_train, X_test, y_train, y_test = train_test_split(
        features_clean,
        y,
        test_size=0.2,
        random_state=seed,
        stratify=y
    )

    print("\n" + "-" * 40)
    print(f"Split {i} | Seed {seed}")
    print("-" * 40)

    print_class_distribution(y_train, f"Train (Seed {seed})")
    print_class_distribution(y_test,  f"Test  (Seed {seed})")

    #
    splits[seed] = {
        "X_train": X_train,
        "X_test": X_test,
        "y_train": y_train,
        "y_test": y_test,
    }


Full Data distribution:
1
73     0.163265
80     0.163265
90     0.163265
100    0.510204
Name: proportion, dtype: float64

----------------------------------------
Split 1 | Seed 42
----------------------------------------
Train (Seed 42) distribution:
1
73     0.163265
80     0.163265
90     0.163265
100    0.510204
Name: proportion, dtype: float64
Test  (Seed 42) distribution:
1
73     0.163265
80     0.163265
90     0.163265
100    0.510204
Name: proportion, dtype: float64

----------------------------------------
Split 2 | Seed 43
----------------------------------------
Train (Seed 43) distribution:
1
73     0.163265
80     0.163265
90     0.163265
100    0.510204
Name: proportion, dtype: float64
Test  (Seed 43) distribution:
1
73     0.163265
80     0.163265
90     0.163265
100    0.510204
Name: proportion, dtype: float64

----------------------------------------
Split 3 | Seed 44
----------------------------------------
Train (Seed 44) distribution:
1
73     0.163265
80     0.1

In [48]:
# normalize features with z-score
from sklearn.preprocessing import StandardScaler

# basierend auf splits[], normalisieren
splits_scaled = {}

for i,data in splits.items():
    X_train = data["X_train"]
    X_test  = data["X_test"]

    # jeden Rund ein neuer Scaler
    scaler_neu = StandardScaler()

    # train fit transform
    X_train_scaled = pd.DataFrame(
        scaler_neu.fit_transform(splits[i]["X_train"]),
        columns=splits[i]["X_train"].columns,
        index=splits[i]["X_train"].index
    )

    # test transform
    X_test_scaled = pd.DataFrame(
        scaler_neu.transform(splits[i]["X_test"]),
        columns=splits[i]["X_test"].columns,
        index=splits[i]["X_test"].index
    )

    splits_scaled[i] = {
        "X_train": X_train_scaled,
        "X_test": X_test_scaled,
        "y_train": splits[i]["y_train"],
        "y_test": splits[i]["y_test"],
    }

    print(
        f"Seed {i} | "
        f"train mean={X_train.mean().mean():.4f}, "
        f"train std={X_train.std().mean():.4f}"
    )

    print(
        f"Seed {i} | "
        f"train scaled mean={X_train_scaled.mean().mean():.4f}, "
        f"train scaled std={X_train_scaled.std().mean():.4f}"
    )

    print("\n")

    print(
        f"Seed {i} | "
        f"test  mean={X_test.mean().mean():.4f}, "
        f"test  std={X_test.std().mean():.4f}"
    )

    print(
        f"Seed {i} | "
        f"test scaled mean={X_test_scaled.mean().mean():.4f}, "
        f"test scaled std={X_test_scaled.std().mean():.4f}"
    )

    print("\n")


print("Feature normalization mit z-score abgeschlossen.")

Seed 42 | train mean=15707.0886, train std=2732.0026
Seed 42 | train scaled mean=0.0000, train scaled std=0.9979


Seed 42 | test  mean=15580.4814, test  std=2533.2744
Seed 42 | test scaled mean=0.0018, test scaled std=1.0028


Seed 43 | train mean=15661.6021, train std=2684.4415
Seed 43 | train scaled mean=0.0000, train scaled std=1.0003


Seed 43 | test  mean=15762.4272, test  std=2730.5432
Seed 43 | test scaled mean=0.0025, test scaled std=0.9737


Seed 44 | train mean=15681.4742, train std=2688.6594
Seed 44 | train scaled mean=-0.0000, train scaled std=1.0003


Seed 44 | test  mean=15682.9388, test  std=2715.7658
Seed 44 | test scaled mean=0.0094, test scaled std=1.1284


Feature normalization mit z-score abgeschlossen.


#### 4.4 Feature Auswahl durch RFECV und Modellisierung
- Nach dem Entfernen des NaNs, Infs, Spalten ohne Variance(mindestens 50%) haben wir 426 Features übrigens.
- Mit Selektor von RFECV wählen wir wichtige Feature in diesen 426 Feature.

In [58]:
# Feature Auswahl
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier

splits_selected_mit_RFECV = {}

for seed, data in splits_scaled.items():
    X_train = data["X_train"]
    y_train = data["y_train"]
    X_test  = data["X_test"]
    y_test  = data["y_test"]

    print("=" * 60)
    print(f"Feature Selection | Seed {seed}")
    print("=" * 60)

    selector = RFECV(
        estimator=DecisionTreeClassifier(random_state=seed),
        step=10,
        min_features_to_select=1,
        scoring="balanced_accuracy",  # 你这里一定要用这个
        cv=3,
        n_jobs=-1
    )

    # fit only on train
    X_train_sel = selector.fit_transform(X_train, y_train)
    X_test_sel  = selector.transform(X_test)

    selected_features = X_train.columns[selector.support_]


    # Print selected features

    print(f"Selected {len(selected_features)} / {X_train.shape[1]} features")

    splits_selected_mit_RFECV[seed] = {
        "X_train": X_train_sel,
        "X_test": X_test_sel,
        "y_train": y_train,
        "y_test": y_test,
        "selected_features": selected_features,
        "selector": selector
    }


Feature Selection | Seed 42
Selected 16 / 426 features
Feature Selection | Seed 43
Selected 16 / 426 features
Feature Selection | Seed 44
Selected 26 / 426 features


#### 4.4 Feature Auswahl durch Selection und Modellisierung
- Nach dem Entfernen des NaNs, Infs, Spalten ohne Variance(mindestens 50%) haben wir 426 Features übrigens.
- Mit Selektor von RFECV wählen wir wichtige Feature in diesen 426 Feature.

In [62]:
from tsfresh.feature_selection.selection import select_features
from tsfresh.utilities.dataframe_functions import impute

splits_selected_selection = {}

for seed, data in splits_scaled.items():
    X_train = data["X_train"].copy()
    X_test  = data["X_test"].copy()

    # 关键：对齐 y 的 index（否则 tsfresh 会报错）
    y_train = data["y_train"].copy()
    y_test  = data["y_test"].copy()
    y_train = y_train.loc[X_train.index]
    y_test  = y_test.loc[X_test.index]

    print("=" * 60)
    print(f"tsfresh Feature Selection | Seed {seed}")
    print("=" * 60)

    # 1) tsfresh 推荐：先 impute（处理 NaN/inf）
    impute(X_train)
    impute(X_test)

    # 2) select_features：只在 train 上做监督选择（不会看 test）
    X_train_sel = select_features(X_train, y_train)

    print("X_train index equals y_train index:", X_train.index.equals(y_train.index))


    # 3) test 对齐 train 选中的列（相当于 selector.transform）
    selected_features = list(X_train_sel.columns)
    X_test_sel = X_test[selected_features]

    print(f"Selected {len(selected_features)} / {X_train.shape[1]} features")

    splits_selected_selection[seed] = {
        "X_train": X_train_sel,  # DataFrame
        "X_test": X_test_sel,  # DataFrame
        "y_train": y_train,
        "y_test": y_test,
        "selected_features": selected_features
    }


tsfresh Feature Selection | Seed 42
X_train index equals y_train index: True
Selected 334 / 426 features
tsfresh Feature Selection | Seed 43
X_train index equals y_train index: True
Selected 331 / 426 features
tsfresh Feature Selection | Seed 44
X_train index equals y_train index: True
Selected 329 / 426 features


In [51]:
print(features_clean.index[:10])
print(y.index[:10])

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
RangeIndex(start=0, stop=10, step=1)


### Modell Trainieren

In [71]:
def train_decision_tree_3splits(splits, max_depth=None, min_samples_leaf=1):
    rows = []

    for k, (seed, d) in enumerate(sorted(splits.items()), start=1):
        X_train = d["X_train"]
        y_train = d["y_train"]
        X_test  = d["X_test"]
        y_test  = d["y_test"]

        clf = DecisionTreeClassifier(
            random_state=seed,
            max_depth=max_depth,
            min_samples_leaf=min_samples_leaf
        )

        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)

        bal_acc = balanced_accuracy_score(y_test, y_pred)
        f1_macro = f1_score(y_test, y_pred, average="macro")
        f1_weighted = f1_score(y_test, y_pred, average="weighted")

        print("\n" + "="*60)
        print(f"Decision Tree | Split {k}/3 | SEED={seed}")
        print("="*60)
        print("Balanced Accuracy:", round(bal_acc, 4))
        print(classification_report(y_test, y_pred, digits=4, zero_division=0))

        rows.append({
            "split": k,
            "seed": seed,
            "balanced_accuracy": bal_acc,
            "f1_macro": f1_macro,
            "f1_weighted": f1_weighted
        })

    results_df = pd.DataFrame(rows)

    summary_df = pd.DataFrame({
        "balanced_accuracy_mean": [results_df["balanced_accuracy"].mean()],
        "balanced_accuracy_std": [results_df["balanced_accuracy"].std(ddof=1)],
        "f1_macro_mean":          [results_df["f1_macro"].mean()],
        "f1_macro_std":           [results_df["f1_macro"].std(ddof=1)],
        "f1_weighted_mean":       [results_df["f1_weighted"].mean()],
        "f1_weighted_std":        [results_df["f1_weighted"].std(ddof=1)],
    })

    print("\n=== Per-split results ===")
    print(results_df)

    print("\n=== Summary (3 splits) ===")
    print(summary_df)

    return results_df, summary_df


In [76]:
# baseline
results_mit_RFECV_df, summary_mit_RFECV_df = train_decision_tree_3splits(
    splits_selected_mit_RFECV,
    max_depth=None,
    min_samples_leaf=1
)


Decision Tree | Split 1/3 | SEED=42
Balanced Accuracy: 0.9129
              precision    recall  f1-score   support

          73     1.0000    0.9861    0.9930        72
          80     0.8571    0.9167    0.8859        72
          90     0.7317    0.8333    0.7792        72
         100     0.9763    0.9156    0.9450       225

    accuracy                         0.9138       441
   macro avg     0.8913    0.9129    0.9008       441
weighted avg     0.9208    0.9138    0.9161       441


Decision Tree | Split 2/3 | SEED=43
Balanced Accuracy: 0.9012
              precision    recall  f1-score   support

          73     0.9861    0.9861    0.9861        72
          80     0.8553    0.9028    0.8784        72
          90     0.7500    0.7917    0.7703        72
         100     0.9585    0.9244    0.9412       225

    accuracy                         0.9093       441
   macro avg     0.8875    0.9012    0.8940       441
weighted avg     0.9121    0.9093    0.9104       441


Dec

In [77]:
# 运行 baseline（不限制深度）
results_ohne_selektion_df, summary_ohne_selektion_df = train_decision_tree_3splits(
    splits_scaled,
    max_depth=None,
    min_samples_leaf=1
)


Decision Tree | Split 1/3 | SEED=42
Balanced Accuracy: 0.8886
              precision    recall  f1-score   support

          73     1.0000    0.9722    0.9859        72
          80     0.8101    0.8889    0.8477        72
          90     0.7368    0.7778    0.7568        72
         100     0.9537    0.9156    0.9342       225

    accuracy                         0.8980       441
   macro avg     0.8752    0.8886    0.8811       441
weighted avg     0.9024    0.8980    0.8996       441


Decision Tree | Split 2/3 | SEED=43
Balanced Accuracy: 0.8644
              precision    recall  f1-score   support

          73     0.9863    1.0000    0.9931        72
          80     0.8714    0.8472    0.8592        72
          90     0.6711    0.7083    0.6892        72
         100     0.9144    0.9022    0.9083       225

    accuracy                         0.8776       441
   macro avg     0.8608    0.8644    0.8624       441
weighted avg     0.8794    0.8776    0.8783       441


Dec

In [78]:
# 运行 baseline（不限制深度）
results_mit_selection_df, summary_mit_selection_df = train_decision_tree_3splits(
    splits_selected_selection,
    max_depth=None,
    min_samples_leaf=1
)


Decision Tree | Split 1/3 | SEED=42
Balanced Accuracy: 0.9106
              precision    recall  f1-score   support

          73     0.9861    0.9861    0.9861        72
          80     0.8873    0.8750    0.8811        72
          90     0.7209    0.8611    0.7848        72
         100     0.9764    0.9200    0.9474       225

    accuracy                         0.9138       441
   macro avg     0.8927    0.9106    0.8999       441
weighted avg     0.9217    0.9138    0.9163       441


Decision Tree | Split 2/3 | SEED=43
Balanced Accuracy: 0.8704
              precision    recall  f1-score   support

          73     0.9863    1.0000    0.9931        72
          80     0.9000    0.8750    0.8873        72
          90     0.6500    0.7222    0.6842        72
         100     0.9128    0.8844    0.8984       225

    accuracy                         0.8753       441
   macro avg     0.8623    0.8704    0.8658       441
weighted avg     0.8798    0.8753    0.8771       441


Dec

### Evalution
-

In [83]:
# Baseline：alle 17 Sensor mit 9 Merkmalen
print("=========Baseline:alle 17 Sensoren mit 9 bestimmten statistischen Merkmalen=========")
baseline_summary_df




Unnamed: 0,balanced_accuracy_mean,balanced_accuracy_std,f1_macro_mean,f1_macro_std,f1_weighted_mean,f1_weighted_std
0,0.90963,0.007106,0.903202,0.004597,0.918784,0.004041


In [85]:
# Ablation Experiment
print("=========Ablation Experiment:jedesmal ein Sensor entfernen mit 9 bestimmten statistischen Merkmalen=========")
ablation_df



Unnamed: 0,removed_sensor,mean_bal_acc,delta_vs_baseline,std_bal_acc
0,PS2,0.816481,-0.093148,0.013351
1,PS3,0.89875,-0.01088,0.007616
2,PS6,0.900046,-0.009583,0.009697
3,TS3,0.90037,-0.009259,0.003019
4,CE,0.900787,-0.008843,0.006187
5,FS2,0.901111,-0.008519,0.01356
6,PS1,0.901389,-0.008241,0.013269
7,TS1,0.902315,-0.007315,0.011111
8,TS2,0.902731,-0.006898,0.003451
9,SE,0.905185,-0.004444,0.02047


In [91]:
# nur zwei Sensoren
print("=========Baseline:PS2 und ein anderer Sensor mit 9 bestimmten statistischen Merkmalen=========")
pair_df



Unnamed: 0,sensors_used,mean_bal_acc,std_bal_acc,delta_vs_ps2_only
0,PS2 + TS2,0.970463,0.009719,4.6e-05
1,PS2 only,0.970417,0.014021,0.0
2,PS2 + PS6,0.968657,0.014478,-0.001759
3,PS2 + CE,0.967361,0.008438,-0.003056
4,PS2 + FS2,0.967361,0.011684,-0.003056
5,PS2 + PS3,0.965093,0.017973,-0.005324
6,PS2 + PS1,0.964352,0.013236,-0.006065
7,PS2 + TS3,0.962824,0.015356,-0.007593
8,PS2 + TS1,0.962454,0.012794,-0.007963


In [87]:
#
print("=========nur Sensor PS2, mit 426 Features==========")
summary_ohne_selektion_df



Unnamed: 0,balanced_accuracy_mean,balanced_accuracy_std,f1_macro_mean,f1_macro_std,f1_weighted_mean,f1_weighted_std
0,0.874352,0.012657,0.872105,0.009375,0.889054,0.010617


In [88]:
print("=========nur Sensor PS2, aber mit Selectionfunktion Feature ausgewählt==========")
summary_mit_selection_df



Unnamed: 0,balanced_accuracy_mean,balanced_accuracy_std,f1_macro_mean,f1_macro_std,f1_weighted_mean,f1_weighted_std
0,0.881296,0.025615,0.878561,0.018564,0.893536,0.02038


In [90]:
print("=========nur Sensor PS2, aber mit RFECV Feature ausgewählt==========")
summary_mit_RFECV_df



Unnamed: 0,balanced_accuracy_mean,balanced_accuracy_std,f1_macro_mean,f1_macro_std,f1_weighted_mean,f1_weighted_std
0,0.900556,0.012723,0.893353,0.007754,0.907612,0.010144


- Sensor PS2 ist am wichtigsten, nur PS2 anzuwenden ist schon genug zum Trainieren.
- Sensor PS2 mit 9 bestimmten deskriptiven statistischen Merkmalen bei DesicionTree kann die beste Leistung bekommen, durchschnittliche Genauerlichkeit 97%.
- tsfresh läuft nicht so gut in diesem Projekt:(auch nur PS2) </br>
    1. Wir haben dann nur PS2 zum Merkmale Extration mit tsfresh ausgewählt. Dadurch haben wir 777 Feature bekommen. </br>
    2. Dann führen wir nötigen Handle des invalid values, Entfernen Features with no variance durch. Dadurch haben wir dann nur 426 übrige Features behalten.</br>
    3. Danach machen wir Training- und Testsplit dreimals, und Normalisierung.</br>

- Mit den obigen nomalisiert(scaled) Features ohne Feature Selection beommen wir dann durchschnittliche Genauerlichkeit 87.43%.
- Nach dem obigen Normalisierung führen wir noch zweimal Feature Selection:
    - mit Selection zum Selection bekommen wir durchschnittliche Genauerlichkeit 88.13%, niedirig als die 97%, wobei nur PS2 und 9 statitschen Features
    - mit RFECV zum Selection bekommen wir durchschnittliche Genauerlichkeit 90.05%, niedirig als die 97%, wobei nur PS2 und 9 statitschen Features

Zusammenfassend ist PS2 mit grund statischen Features robuster und stabiler als Featuresextration mit tsfresh.