# **Notebook Overview: Rule-Based Trading with Metaheuristic Rule Discovery**

This notebook presents a complete demonstration of a **rule-based trading system** applied to Ethereum price data. The objective is to illustrate how **Genetic Algorithms (GA)** can be used to *automatically discover interpretable IF–THEN trading rules* based on engineered features from **Phase 1 of the course** .

The implementation follows the conceptual framework introduced in the accompanying lecture materials on **Rule Discovery with Metaheuristics** .

---

## **Purpose of the Notebook**

The notebook serves as a educational example demonstrating:

1. **How trading rules are encoded as chromosomes** within a GA.
2. **How conditions, actions, TP/SL levels, and position sizing are represented genetically.**
3. **How a rule list is decoded and executed via a backtesting engine.**
4. **How the GA iteratively improves candidate rule sets** using fitness feedback from historical performance.

Although the demonstration uses only a subset of features, students may extend the system to incorporate the full set of engineered indicators produced in Phase 1.

---

## **Chromosome Structure and Representation**

Each chromosome represents a **complete ordered rule list**, where:

* Each rule consists of multiple **conditions**, specifying a feature, operator, and threshold.
* Each rule also encodes its **action parameters**:

  * Take-profit (TP),
  * Stop-loss (SL),
  * **Fraction of capital allocated to the trade** (position size).
* Activation flags control whether a rule or condition is included in the effective strategy.

The genetic representation enables exploration of a very large combinational search space while retaining **transparent, human-readable rule structures** once decoded.

---

## **Training Data and Features**

For this demonstration, we use only a small set of the available features to keep the example focused and computationally manageable.
However, students are encouraged to:

* Use **all available engineered features**,
* Modify or extend the chromosome structure,
* Experiment with different operator sets, thresholds, or rule depths.

The system is fully modular, and feature selection is handled implicitly via the genetic encoding.

---

## **Optimization and Fitness Function**

The GA is trained exclusively on the **training dataset**.
The objective function (fitness) is defined as:

> **Final equity obtained after backtesting the rule set**, starting with an initial capital of $1000.

This formulation incorporates:

* Profitability of trades,
* Trading frequency,
* Position sizing decisions,
* Compounding through equity updates.

After training, students must evaluate their discovered rule sets on the **separate test dataset** to assess out-of-sample performance and detect overfitting.

---

## **Student Instructions**

* You may run this notebook directly in **Google Colab**, but remember to mount your Google Drive before accessing datasets.
* All configuration parameters (e.g., population size, mutation rates, TP/SL ranges, position-size bounds) can be modified to explore different design choices.
* The implementation is modular; students may:

  * Add new features,
  * Adjust how rules are represented,
  * Implement alternative fitness functions,
  * Replace GA with other metaheuristic algorithms such as PSO or DE.

---

## **Educational Goals**

By working through this notebook, you will gain hands-on understanding of:

* How rule-based trading systems are formalized and optimized,
* How metaheuristic methods operate on structured, interpretable solutions,
* How backtesting interacts with rule logic, signal generation, and position management,
* How to evaluate trading strategies rigorously using both train and test datasets.

This forms the foundation for the **Phase 2 Rule Discovery Project**.

In [104]:
# from google.colab import drive
# drive.mount('/content/drive')

In [105]:
import numpy as np
import json
import pandas as pd
from dataclasses import dataclass
from typing import List, Optional, Tuple
import random

In [106]:
# === CONFIG: rule structure, GA, and trading ===

MAX_RULES = 2        # N (max rules in the rule list)
MAX_CONDS = 4         # K (max conditions per rule)

TP_MIN, TP_MAX = 0.02, 0.04   # 2% .. 5%
SL_MIN, SL_MAX = 0.04, 0.06   # 1% .. 4%

STARTING_CAPITAL = 1000.0    # starting money for the strategy

POS_MIN_FRAC = 0.05          # 10% of capital per trade (min)
POS_MAX_FRAC = 0.30          # 50% of capital per trade (max)

POP_SIZE = 30
N_GENERATIONS = 40
TOURNAMENT_SIZE = 10
CROSSOVER_RATE = 0.85
MUTATION_RATE = 0.08   # base mutation probability per gene

RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

In [107]:
# === 1. Load ETH data with engineered features ===

def load_eth_features(csv_path: str,
                      feature_cols: List[str],
                      close_col: str = "close"
                      ) -> Tuple[pd.DataFrame, List[str]]:
    """
    Load ETH OHLCV + engineered features.
    We keep only 'close' and selected feature columns.

    Parameters
    ----------
    csv_path : str
        Path to CSV containing at least 'close' and feature columns.
    feature_cols : list of str
        Subset of ~50 features you want to use in the demo.
    close_col : str
        Name of the close price column.

    Returns
    -------
    df : pd.DataFrame
        DataFrame indexed by time with 'close' and selected features.
    feature_cols_used : list of str
        The actual feature columns we keep (intersection of requested + available).
    """
    df = pd.read_csv(csv_path, parse_dates=True, index_col=0)
    df.columns = [c.lower() for c in df.columns]

    if close_col.lower() not in df.columns:
        raise ValueError(f"Close column '{close_col}' not found in data.")

    # Intersect requested features with available columns
    feature_cols_lower = [c.lower() for c in feature_cols]
    available_features = [c for c in feature_cols_lower if c in df.columns]

    if len(available_features) == 0:
        raise ValueError("None of the requested feature columns are present in the data.")

    cols_to_keep = [close_col.lower()] + available_features
    df = df[cols_to_keep].sort_index()

    return df, available_features

In [108]:
# === 2. Gene and Rule structures (phenotype) ===

@dataclass
class Condition:
    # active: bool
    feature_idx: int
    operator: str      # "<" or ">"
    q: float    # ADD FOR QUANTILE
    threshold: float = None   # numeric threshold


@dataclass
class Rule:
    # active: bool
    conditions: List[Condition]
    side: str          # "BUY" or "SELL"
    tp: float          # take-profit as decimal (e.g. 0.03 for 3%)
    sl: float          # stop-loss as decimal
    size_frac: float   # fraction of current capital to allocate (0.0–1.0)

In [109]:
# === 2b. Chromosome representation (genotype) ===

@dataclass
class Chromosome:
    # Rule-level genes
    rule_active: np.ndarray      # (MAX_RULES,)  {0,1}
    side_gene: np.ndarray        # (MAX_RULES,)  {0,1}  0=BUY, 1=SELL
    tp_gene: np.ndarray          # (MAX_RULES,)  [0,1]
    sl_gene: np.ndarray          # (MAX_RULES,)  [0,1]
    size_gene: np.ndarray        # (MAX_RULES,)  [0,1]  --> position size fraction

    # Condition-level genes
    cond_active: np.ndarray      # (MAX_RULES, MAX_CONDS)  {0,1}
    feature_idx_gene: np.ndarray # (MAX_RULES, MAX_CONDS)
    operator_gene: np.ndarray    # (MAX_RULES, MAX_CONDS)  {0,1}
    # threshold_gene: np.ndarray   # (MAX_RULES, MAX_CONDS)  [0,1]
    q_gene: np.ndarray

In [110]:
# === 3. Mapping genes to actual TP/SL and thresholds ===

def map_tp_gene(tp_gene: float) -> float:
    """Map [0,1] gene to TP% in [TP_MIN, TP_MAX]."""
    return TP_MIN + tp_gene * (TP_MAX - TP_MIN)


def map_sl_gene(sl_gene: float) -> float:
    """Map [0,1] gene to SL% in [SL_MIN, SL_MAX]."""
    return SL_MIN + sl_gene * (SL_MAX - SL_MIN)


def map_size_gene(size_gene: float) -> float:
    """
    Map [0,1] size_gene to a fraction of capital to allocate per trade.
    For example, POS_MIN_FRAC=0.05, POS_MAX_FRAC=0.5 => 5%..50%.
    """
    return POS_MIN_FRAC + size_gene * (POS_MAX_FRAC - POS_MIN_FRAC)


def map_operator_gene(op_gene: int) -> str:
    """0 -> '<', 1 -> '>'."""
    return "<" if op_gene == 0 else ">"


def map_threshold_gene_to_value(feature_series: pd.Series, thr_gene: float) -> float:
    """
    Map a [0,1] threshold_gene to a numeric threshold using feature quantiles.

    thr_gene ~ 0.0 => low quantile (e.g. oversold RSI)
    thr_gene ~ 1.0 => high quantile (e.g. overbought RSI)
    """
    # np.nanquantile handles NaNs gracefully
    return float(np.nanquantile(feature_series.values, thr_gene))

In [111]:
def compute_condition_thresholds(
    rules,
    df_reference: pd.DataFrame,
    feature_cols
):
    """
    Compute real thresholds using quantiles from df_reference only.
    Prevents data leakage.
    """
    for r in rules:
        for cond in r.conditions:
            feat = feature_cols[cond.feature_idx]
            series = df_reference[feat].dropna()

            if len(series) == 0:
                cond.threshold = None
                continue

            q = min(max(cond.q, 0.01), 0.99)
            cond.threshold = float(series.quantile(q))


In [112]:
# === Decode Chromosome to Rule List (Quantile-based) ===

def decode_chromosome(
    chrom: Chromosome,
    df: pd.DataFrame,
    feature_cols: List[str]
) -> List[Rule]:
    """
    Decode chromosome into a list of Rule objects.
    IMPORTANT:
    - This function does NOT compute numeric thresholds.
    - It only assigns quantile genes (q).
    - Real thresholds are computed later using compute_condition_thresholds().
    """

    rules: List[Rule] = []

    for r in range(MAX_RULES):
        if chrom.rule_active[r] == 0:
            continue

        # --- rule-level genes ---
        side = "BUY" if chrom.side_gene[r] == 0 else "SELL"
        tp = map_tp_gene(chrom.tp_gene[r])
        sl = map_sl_gene(chrom.sl_gene[r])
        size_frac = map_size_gene(chrom.size_gene[r])

        conds: List[Condition] = []

        for c in range(MAX_CONDS):
            if chrom.cond_active[r, c] == 0:
                continue

            feat_idx = int(chrom.feature_idx_gene[r, c]) % len(feature_cols)
            op = map_operator_gene(int(chrom.operator_gene[r, c]))

            # ⭐ Quantile gene (NOT numeric threshold)
            q = float(chrom.q_gene[r, c])   # اگر اسم را عوض نکردی: threshold_gene

            conds.append(
                Condition(
                    feature_idx=feat_idx,
                    operator=op,
                    q=q,              # quantile ∈ (0,1)
                    threshold=None    # computed later
                )
            )

        # discard empty rules
        if len(conds) == 0:
            continue

        rules.append(
            Rule(
                conditions=conds,
                side=side,
                tp=tp,
                sl=sl,
                size_frac=size_frac,
            )
        )

    return rules

In [113]:
def rule_fires(
    rule: Rule,
    df: pd.DataFrame,
    feature_cols: List[str],
    t: int
) -> bool:
    """
    Check if a rule fires at row index t.
    All conditions must be true.

    Quantile-based note:
    - cond.threshold MUST be pre-computed (e.g., by compute_condition_thresholds).
    - If threshold is None, rule does not fire (prevents leakage / undefined behavior).
    """
    row = df.iloc[t]

    for cond in rule.conditions:
        # threshold must exist
        thr = getattr(cond, "threshold", None)
        if thr is None or np.isnan(thr):
            return False

        feat_name = feature_cols[cond.feature_idx]
        x = row[feat_name]

        # missing feature => don't fire
        if x is None or (isinstance(x, float) and np.isnan(x)):
            return False

        op = cond.operator
        if op == "<":
            if not (x < thr):
                return False
        elif op == ">":
            if not (x > thr):
                return False
        else:
            # unknown operator => safe fail
            return False

    return True

In [114]:
# def backtest_rule_list(
#     rules: List[Rule],
#     df: pd.DataFrame,
#     feature_cols: List[str],
#     starting_capital: float = STARTING_CAPITAL
# ) -> Tuple[List[float], float]:
#     """
#     Backtest a rule list with capital and position sizing.

#     Returns
#     -------
#     trade_returns : list of float
#         Per-trade returns (in % terms, like before).
#     final_equity : float
#         Final money after all trades.
#     """
#     if len(rules) == 0:
#         return [], starting_capital

#     close = df["close"].values
#     n = len(df)

#     equity = starting_capital

#     position = 0          # 0 = flat, +1 = long, -1 = short
#     entry_price = None
#     entry_rule: Optional[Rule] = None
#     entry_capital = None  # amount of capital allocated to this trade

#     trade_returns: List[float] = []

#     for t in range(n):
#         price = close[t]
#         if np.isnan(price):
#             continue

#         if position == 0:
#             # --- FLAT: look for entry ---
#             for rule in rules:
#                 if rule_fires(rule, df, feature_cols, t):
#                     # Capital to allocate = fraction of current equity
#                     size_frac = rule.size_frac
#                     trade_capital = equity * size_frac

#                     if trade_capital <= 0:
#                         break  # nothing to allocate

#                     position = 1 if rule.side == "BUY" else -1
#                     entry_price = price
#                     entry_rule = rule
#                     entry_capital = trade_capital
#                     break

#         else:
#             # --- IN POSITION: manage trade ---
#             assert entry_rule is not None and entry_price is not None and entry_capital is not None

#             # Return in direction of position (+ for profit)
#             ret = position * (price / entry_price - 1.0)

#             tp_hit = ret >= entry_rule.tp
#             sl_hit = ret <= -entry_rule.sl

#             if tp_hit or sl_hit:
#                 # Close trade
#                 pnl = ret * entry_capital      # profit or loss in dollars
#                 equity += pnl                  # update money
#                 trade_returns.append(ret)

#                 # Reset position
#                 position = 0
#                 entry_price = None
#                 entry_rule = None
#                 entry_capital = None

#     return trade_returns, equity

def backtest_rule_list(
    rules: List[Rule],
    df: pd.DataFrame,
    feature_cols: List[str],
    starting_capital: float = STARTING_CAPITAL
) -> Tuple[List[float], float, int]:
    """
    Backtest a rule list with capital and position sizing.

    Returns
    -------
    equity_curve : list of float
        True equity over time (incremental)
    final_equity : float
        Final money after all trades
    n_trades : int
        Number of closed trades
    """
    if len(rules) == 0:
        return [starting_capital], starting_capital, 0

    close = df["close"].values
    n = len(df)

    equity = float(starting_capital)
    equity_curve = [equity]

    position = 0          # 0 = flat, +1 = long, -1 = short
    entry_price = None
    entry_rule: Optional[Rule] = None
    entry_capital = None

    n_trades = 0

    for t in range(n):
        price = close[t]
        if np.isnan(price):
            equity_curve.append(equity)
            continue

        if position == 0:
            # --- FLAT: look for entry ---
            for rule in rules:
                if rule_fires(rule, df, feature_cols, t):
                    size_frac = rule.size_frac
                    trade_capital = equity * size_frac

                    if trade_capital <= 0:
                        break

                    position = 1 if rule.side == "BUY" else -1
                    entry_price = price
                    entry_rule = rule
                    entry_capital = trade_capital
                    break

        else:
            # --- IN POSITION: manage trade ---
            ret = position * (price / entry_price - 1.0)

            tp_hit = ret >= entry_rule.tp
            sl_hit = ret <= -entry_rule.sl

            if tp_hit or sl_hit:
                pnl = ret * entry_capital
                equity += pnl

                position = 0
                entry_price = None
                entry_rule = None
                entry_capital = None

                n_trades += 1

        equity_curve.append(equity)

    # --- force close at last price (important!) ---
    if position != 0 and entry_price is not None and entry_capital is not None:
        final_price = close[-1]
        if not np.isnan(final_price):
            ret = position * (final_price / entry_price - 1.0)
            pnl = ret * entry_capital
            equity += pnl
            equity_curve[-1] = equity
            n_trades += 1

    return equity_curve, equity, n_trades

In [115]:
# def compute_fitness(
#     chrom: Chromosome,
#     df: pd.DataFrame,
#     feature_cols: List[str]
# ) -> float:
#     """
#     Decode chromosome -> rule list -> backtest -> final equity.

#     Fitness = final money (higher is better).
#     """
#     rules = decode_chromosome(chrom, df, feature_cols)

#     trade_returns, final_equity = backtest_rule_list(
#         rules, df, feature_cols, starting_capital=STARTING_CAPITAL
#     )

#     # Optional: if you want to slightly penalize "do nothing" strategies:
#     if len(trade_returns) == 0:
#         return STARTING_CAPITAL - 1.0  # tiny penalty

#     return final_equity

def compute_fitness(
    chrom: Chromosome,
    df: pd.DataFrame,
    feature_cols: List[str]
) -> float:
    """
    Walk-forward + Quantile-based fitness.
    Robust against overfitting and scale drift.
    """

    # 1) Decode chromosome → rules (WITHOUT numeric thresholds)
    rules = decode_chromosome(chrom, df, feature_cols)

    # 2) Walk-forward split
    n = len(df)
    if n < 100:  # safety guard
        return -1e6

    split = int(0.7 * n)
    df_A = df.iloc[:split]
    df_B = df.iloc[split:]

    # 3) ⭐ Compute REAL thresholds from df_A ONLY (quantile-based)
    compute_condition_thresholds(rules, df_A, feature_cols)

    # 4) Backtest on A (train part)
    eq_A_curve, final_A, trades_A = backtest_rule_list(
        rules, df_A, feature_cols
    )

    # 5) Backtest on B (forward / pseudo-test)
    eq_B_curve, final_B, trades_B = backtest_rule_list(
        rules, df_B, feature_cols
    )

    # 6) Hard rejection (no-trade or degenerate strategies)
    if trades_A == 0 or trades_B == 0:
        return -1e6

    # 7) Drawdown on B only (future-facing risk)
    eq_B_curve = np.asarray(eq_B_curve, dtype=float)
    running_max = np.maximum.accumulate(eq_B_curve)
    drawdowns = (running_max - eq_B_curve) / np.clip(running_max, 1e-12, None)
    max_dd_B = float(np.max(drawdowns))

    # ---------------- penalties ----------------

    # (1) drawdown penalty (soft, realistic)
    LAMBDA_DD = 0.4
    penalty_dd = LAMBDA_DD * max_dd_B * STARTING_CAPITAL

    # (2) complexity penalty (rules + conditions)
    n_rules = len(rules)
    n_conds = sum(
        len(r.conditions)
        for r in rules
        if hasattr(r, "conditions") and r.conditions is not None
    )
    penalty_complexity = 1.2 * n_rules + 0.4 * n_conds

    # (3) trade-count pressure (soft, on B only)
    penalty_trades = 0.0
    if trades_B > 120:
        penalty_trades += 2.0 * (trades_B - 120)
    elif trades_B < 10:
        penalty_trades += 50.0

    # 8) Final fitness (future weighted more)
    fitness = (
        0.4 * final_A
        + 0.6 * final_B
        - penalty_dd
        - penalty_complexity
        - penalty_trades
    )

    return float(fitness)


In [116]:
# === 6. GA: initialization ===

def random_chromosome(n_features: int) -> Chromosome:
    rule_active = np.random.randint(0, 2, size=(MAX_RULES,))
    if rule_active.sum() == 0:
        rule_active[np.random.randint(0, MAX_RULES)] = 1

    side_gene = np.random.randint(0, 2, size=(MAX_RULES,))
    tp_gene = np.random.rand(MAX_RULES)
    sl_gene = np.random.rand(MAX_RULES)
    size_gene = np.random.rand(MAX_RULES)  # NEW: position size genes in [0,1]

    cond_active = np.random.randint(0, 2, size=(MAX_RULES, MAX_CONDS))
    for r in range(MAX_RULES):
        if rule_active[r] == 1 and cond_active[r].sum() == 0:
            cond_active[r, np.random.randint(0, MAX_CONDS)] = 1

    feature_idx_gene = np.random.randint(0, n_features, size=(MAX_RULES, MAX_CONDS))
    operator_gene = np.random.randint(0, 2, size=(MAX_RULES, MAX_CONDS))
    q_gene = np.random.rand(MAX_RULES, MAX_CONDS)

    return Chromosome(
        rule_active=rule_active,
        side_gene=side_gene,
        tp_gene=tp_gene,
        sl_gene=sl_gene,
        size_gene=size_gene,
        cond_active=cond_active,
        feature_idx_gene=feature_idx_gene,
        operator_gene=operator_gene,
        q_gene=q_gene,   # ⭐
    )

In [117]:
# === 6b. Parent selection (tournament) ===

def tournament_selection(population: List[Chromosome],
                         fitnesses: List[float],
                         k: int = TOURNAMENT_SIZE) -> Chromosome:
    """
    Tournament selection: pick k random individuals, return the best.
    """
    idxs = np.random.choice(len(population), size=k, replace=False)
    best_idx = idxs[0]
    best_fit = fitnesses[best_idx]
    for i in idxs[1:]:
        if fitnesses[i] > best_fit:
            best_fit = fitnesses[i]
            best_idx = i
    return population[best_idx]


In [118]:
def crossover(
    parent1: Chromosome,
    parent2: Chromosome
) -> Tuple[Chromosome, Chromosome]:
    """
    Rule-aware uniform crossover with gentle condition mixing.
    Quantile-aware (uses q_gene instead of threshold_gene).
    """

    # ---------- no crossover → clone ----------
    if np.random.rand() >= CROSSOVER_RATE:
        child1 = Chromosome(
            rule_active=parent1.rule_active.copy(),
            side_gene=parent1.side_gene.copy(),
            tp_gene=parent1.tp_gene.copy(),
            sl_gene=parent1.sl_gene.copy(),
            size_gene=parent1.size_gene.copy(),
            cond_active=parent1.cond_active.copy(),
            feature_idx_gene=parent1.feature_idx_gene.copy(),
            operator_gene=parent1.operator_gene.copy(),
            q_gene=parent1.q_gene.copy(),
        )
        child2 = Chromosome(
            rule_active=parent2.rule_active.copy(),
            side_gene=parent2.side_gene.copy(),
            tp_gene=parent2.tp_gene.copy(),
            sl_gene=parent2.sl_gene.copy(),
            size_gene=parent2.size_gene.copy(),
            cond_active=parent2.cond_active.copy(),
            feature_idx_gene=parent2.feature_idx_gene.copy(),
            operator_gene=parent2.operator_gene.copy(),
            q_gene=parent2.q_gene.copy(),
        )
        return child1, child2

    # ---------- create empty children ----------
    child1 = Chromosome(
        rule_active=np.zeros_like(parent1.rule_active),
        side_gene=np.zeros_like(parent1.side_gene),
        tp_gene=np.zeros_like(parent1.tp_gene),
        sl_gene=np.zeros_like(parent1.sl_gene),
        size_gene=np.zeros_like(parent1.size_gene),
        cond_active=np.zeros_like(parent1.cond_active),
        feature_idx_gene=np.zeros_like(parent1.feature_idx_gene),
        operator_gene=np.zeros_like(parent1.operator_gene),
        q_gene=np.zeros_like(parent1.q_gene),
    )

    child2 = Chromosome(
        rule_active=np.zeros_like(parent1.rule_active),
        side_gene=np.zeros_like(parent1.side_gene),
        tp_gene=np.zeros_like(parent1.tp_gene),
        sl_gene=np.zeros_like(parent1.sl_gene),
        size_gene=np.zeros_like(parent1.size_gene),
        cond_active=np.zeros_like(parent1.cond_active),
        feature_idx_gene=np.zeros_like(parent1.feature_idx_gene),
        operator_gene=np.zeros_like(parent1.operator_gene),
        q_gene=np.zeros_like(parent1.q_gene),
    )

    # ---------- rule-level uniform crossover ----------
    for r in range(MAX_RULES):
        if np.random.rand() < 0.5:
            src1, src2 = parent1, parent2
        else:
            src1, src2 = parent2, parent1

        # rule-level genes
        child1.rule_active[r] = src1.rule_active[r]
        child1.side_gene[r]   = src1.side_gene[r]
        child1.tp_gene[r]     = src1.tp_gene[r]
        child1.sl_gene[r]     = src1.sl_gene[r]
        child1.size_gene[r]   = src1.size_gene[r]

        child2.rule_active[r] = src2.rule_active[r]
        child2.side_gene[r]   = src2.side_gene[r]
        child2.tp_gene[r]     = src2.tp_gene[r]
        child2.sl_gene[r]     = src2.sl_gene[r]
        child2.size_gene[r]   = src2.size_gene[r]

        # ---------- condition-level gentle mixing ----------
        for c in range(MAX_CONDS):

            # ---- child1 ----
            if child1.rule_active[r] == 0:
                # inactive rule → copy directly
                child1.cond_active[r, c]      = src1.cond_active[r, c]
                child1.feature_idx_gene[r, c] = src1.feature_idx_gene[r, c]
                child1.operator_gene[r, c]    = src1.operator_gene[r, c]
                child1.q_gene[r, c]           = src1.q_gene[r, c]
            else:
                donor = src1 if np.random.rand() < 0.7 else src2
                child1.cond_active[r, c]      = donor.cond_active[r, c]
                child1.feature_idx_gene[r, c] = donor.feature_idx_gene[r, c]
                child1.operator_gene[r, c]    = donor.operator_gene[r, c]
                child1.q_gene[r, c]           = donor.q_gene[r, c]

            # ---- child2 ----
            if child2.rule_active[r] == 0:
                child2.cond_active[r, c]      = src2.cond_active[r, c]
                child2.feature_idx_gene[r, c] = src2.feature_idx_gene[r, c]
                child2.operator_gene[r, c]    = src2.operator_gene[r, c]
                child2.q_gene[r, c]           = src2.q_gene[r, c]
            else:
                donor = src2 if np.random.rand() < 0.7 else src1
                child2.cond_active[r, c]      = donor.cond_active[r, c]
                child2.feature_idx_gene[r, c] = donor.feature_idx_gene[r, c]
                child2.operator_gene[r, c]    = donor.operator_gene[r, c]
                child2.q_gene[r, c]           = donor.q_gene[r, c]

    return child1, child2

In [119]:
def mutate(chrom: Chromosome, n_features: int):
    """
    Quantile-aware, rule-aware mutation.
    """

    for r in range(MAX_RULES):

        # ---- mutate rule-level genes ----
        if np.random.rand() < MUTATION_RATE:
            chrom.rule_active[r] = 1 - chrom.rule_active[r]

        if np.random.rand() < MUTATION_RATE:
            chrom.side_gene[r] = 1 - chrom.side_gene[r]

        if np.random.rand() < MUTATION_RATE:
            chrom.tp_gene[r] = np.clip(
                chrom.tp_gene[r] + np.random.normal(scale=0.05),
                0.001,
                0.10,
            )

        if np.random.rand() < MUTATION_RATE:
            chrom.sl_gene[r] = np.clip(
                chrom.sl_gene[r] + np.random.normal(scale=0.05),
                0.001,
                0.10,
            )

        if np.random.rand() < MUTATION_RATE:
            chrom.size_gene[r] = np.clip(
                chrom.size_gene[r] + np.random.normal(scale=0.05),
                0.05,
                0.50,
            )

        # ---- mutate condition-level genes ----
        for c in range(MAX_CONDS):

            # only mutate conditions of active rules
            if chrom.rule_active[r] == 0:
                continue

            if np.random.rand() < MUTATION_RATE:
                chrom.cond_active[r, c] = 1 - chrom.cond_active[r, c]

            if np.random.rand() < MUTATION_RATE:
                chrom.feature_idx_gene[r, c] = np.random.randint(0, n_features)

            if np.random.rand() < MUTATION_RATE:
                chrom.operator_gene[r, c] = 1 - chrom.operator_gene[r, c]

            # ⭐ Quantile mutation (replaces threshold_gene)
            if np.random.rand() < MUTATION_RATE:
                chrom.q_gene[r, c] = np.clip(
                    chrom.q_gene[r, c] + np.random.normal(scale=0.08),
                    0.05,
                    0.95,
                )

In [120]:
# === 7. GA main loop ===

def run_ga(df: pd.DataFrame,
           feature_cols: List[str]
           ) -> Tuple[Chromosome, float]:
    """
    Run a simple GA to discover a good rule list.

    Returns best_chromosome, best_fitness.
    """
    n_features = len(feature_cols)

    # --- Initialize population ---


    initial_pop_size = POP_SIZE * 5
    population: List[Chromosome] = [
        random_chromosome(n_features) for _ in range(initial_pop_size)
    ]

    # Evaluate initial population
    fitnesses = [
        compute_fitness(chrom, df, feature_cols) for chrom in population
    ]

    sorted_idx = np.argsort(fitnesses)[::-1]  # descending
    population = [population[i] for i in sorted_idx[:POP_SIZE]]
    fitnesses = [fitnesses[i] for i in sorted_idx[:POP_SIZE]]


    best_idx = int(np.argmax(fitnesses))
    best_chrom = population[best_idx]
    best_fit = fitnesses[best_idx]

    print(f"Initial best fitness (from expanded pool): {best_fit:.6f}")


    for gen in range(1, N_GENERATIONS + 1):
        new_population: List[Chromosome] = []

        # --- Elitism: keep the best individual ---
        new_population.append(best_chrom)

        # --- Generate rest of population ---
        while len(new_population) < POP_SIZE:
            # Select parents
            p1 = tournament_selection(population, fitnesses)
            p2 = tournament_selection(population, fitnesses)

            # Crossover
            child1, child2 = crossover(p1, p2)

            # Mutation
            mutate(child1, n_features)
            mutate(child2, n_features)

            new_population.append(child1)
            if len(new_population) < POP_SIZE:
                new_population.append(child2)

        population = new_population
        fitnesses = [
            compute_fitness(chrom, df, feature_cols) for chrom in population
        ]

        gen_best_idx = int(np.argmax(fitnesses))
        gen_best_fit = fitnesses[gen_best_idx]

        # Update global best
        if gen_best_fit > best_fit:
            best_fit = gen_best_fit
            best_chrom = population[gen_best_idx]

        print(f"Generation {gen:3d}: best fitness = {gen_best_fit:.6f}, global best = {best_fit:.6f}")

    return best_chrom, best_fit


In [121]:
def pretty_print_rules(rules, feature_cols):
    for i, rule in enumerate(rules, 1):
        cond_strs = []
        for cond in rule.conditions:
            feat_name = feature_cols[cond.feature_idx]

            if cond.threshold is None:
                cond_strs.append(
                    f"{feat_name} {cond.operator} q={cond.q:.2f}"
                )
            else:
                cond_strs.append(
                    f"{feat_name} {cond.operator} {cond.threshold:.4f}"
                )

        cond_part = " AND ".join(cond_strs)
        print(f"Rule {i}: IF {cond_part}")
        print(
            f"    THEN {rule.side} "
            f"TP={rule.tp:.3f} SL={rule.sl:.3f} SIZE={rule.size_frac:.2f}"
        )

# Main program

In [122]:
# Example usage (you adapt the paths and feature names):
FEATURE_COLS = [
 "cand_body_sign_5",
 "mom_cci_20",
 "vol_atr_14",
 "trend_lin_slope_50",
 "rob_zret_60",
 "rob_skew_30",
 "trend_dema_20",
 "vol_vwap_20",
 "cand_candle_dir_1",
 "vol_obv_1",
 "cand_gap_1",
 "trend_sma_5",
 "mom_chande_14",
# "vol_var_20",
# "cand_close_open_ratio_1",
# "rob_median_abs_dev_30",
# "trend_ema_12",
# "mom_roc_10",
# "vol_cmf_20",
# "rob_iqr_20",
# "trend_ema_26",
# "vol_mfi_14",
# "rob_kurt_30",
# "trend_sma_20",
# "rob_hurst_100",
# "vol_pvo_12_26",
# "vol_vpt_1",
# "vol_std_20",
# "cand_shadow_lower_1",
# "trend_tema_20",
# "trend_hma_21",
# "cand_range_1",
# "vol_zclose_60",
# "mom_willr_14",
# "mom_macd_12_26",
# "mom_stoch_d_14_3_3",
# "cand_up_down_vol_ratio_20",
# "rob_autocorr_20",
# "ent_return_30",
# "vol_bbw_20_2",
# "vol_logret_std_20",
# "vol_range_ratio_14",
# "cand_shadow_upper_1",
# "mom_ppo_12_26",
# "trend_wma_14",
# "vol_high_low_corr_20",
# "vol_vroc_10",
# "mom_rsi_14",
# "vol_adi_14",
# "vol_kc_width_20_2",
# "cand_pinbar_flag_1",
# "cand_shaven_head_flag_1",
# "cand_shaven_bottom_flag_1",
# "ent_trend_slope_dir_30",
# "liq_inverse_20",
# "flow_vpin_proxy_50",
# "ent_volume_dir_30",
# "gap_body_ratio_1",
# "mom_tsi_25_13",
# "mom_vol_adj_roc_10",
# "mom_stc_10_23",
# "price_midhl_dist_1",
# "reg_lin_intercept_50",
# "reg_resid_std_50",
# "range_contraction_10",
# "mom_kst_10_15_20_30",
# "price_vwap_dist_20",
# "mom_ultimate_osc_7_14_28",
# "risk_dd_speed_10",
# "rob_median_range_20",
# "rob_close_percentile_50",
# "trend_aroon_down_25",
# "spec_fft_hf_ratio_32",
# "regime_chop_index_14",
# "rob_range_percentile_30",
# "rob_return_cv_30",
# "regime_atr_spike_flag_20",
# "spec_hl_band_energy_24",
# "vol_bipower_var_20",
# "trend_psar_dir_002_02",
# "vol_atr_slope_14",
# "trend_aroon_up_25",
# "trend_frama_20",
# "trend_donchian_pos_20",
# "trend_trix_15",
# "trend_chande_forecast_10",
# "vol_atr_rel_14_50",
# "vol_force_index_13",
# "vol_eom_14",
# "vol_skew_30",
# "vol_ulcer_index_14",
# "vol_vfi_26",
# "vol_kurt_30",
# "vol_turnover_chg_20",
# "band_donchian_width_20",
# "band_kc_pos_20_2",
# "vol_yang_zhang_20",
# "vol_volume_percentile_100",
# "trend_kama_21",
# "vol_kvo_34_55",
# "equal_lows_tightness_20",
# "filt_savgol_11_3",
# "candle_engulf_strength_5",
# "breaker_retest_flag_20",
# "ent_perm_close_30",
# "filt_gauss_close_20",
# "fvg_creation_flag_1",
# "channel_reg_upper_dist_50",
# "fib_extension_near_1_618",
# "fib_retracement_near_0_618",
# "fib_extension_near_1_272",
# "displacement_strength_10",
# "fib_retracement_near_0_500",
# "equal_highs_tightness_20",
# "filt_dema_20",
# "channel_reg_lower_dist_50",
# "ichimoku_cloud_thickness_52",
# "ichimoku_kijun_dist_26",
# "ichimoku_tenkan_dist_9",
# "ichimoku_span_a_dist_52",
# "ichimoku_span_b_dist_52",
# "liq_daily_zone_touch_flag_1d",
# "liq_zone_strength_50",
# "liq_zone_touch_flag_50",
# "liq_weekly_zone_touch_flag_1w",
# "fvg_fill_ratio_30",
# "internal_range_shift_20",
# "micro_range_stack_count_20",
# "liquidity_sweep_wick_ratio_20",
# "liquidity_rebuild_speed_20",
# "liquidity_grab_efficiency_10",
# "market_structure_break_count_50",
# "mom_rsi_div_flag_14_5",
# "pivot_classic_s2_dist_1d",
# "orderblock_freshness_score_50",
# "pivot_classic_r1_dist_1d",
# "price_prev_low_dist_1",
# "range_high_dist_50",
# "pivot_classic_r2_dist_1d",
# "range_breakout_flag_50",
# "premium_discount_balance_50",
# "pivot_classic_s1_dist_1d",
# "mom_volume_trend_div_flag_20",
# "range_low_dist_50",
# "pivot_confluence_score_1d",
# "pivot_classic_pp_dist_1d",
# "price_prev_high_dist_1",
# "prior_range_overlap_ratio_50",
# "range_rotation_index_20",
# "reg_shift_flag_50",
# "regime_range_flag_adx_14",
# "range_tagging_bias_50",
# "regime_range_flag_bb_20_q20",
# "regime_trend_down_flag_adx_14",
# "regime_trend_down_flag_slope_50_atr_14",
# "reg_trending_flag_30",
# "regime_trend_up_flag_slope_50_atr_14",
# "session_asian_high_dist_1d",
# "session_displacement_ratio_1d",
# "regime_trend_up_flag_adx_14",
# "session_initial_balance_breakout_flag_1d",
# "session_high_low_shift_dir_3d",
# "session_asian_low_dist_1d",
# "smc_liquidity_void_depth_50",
# "session_killzone_activity_index",
# "structural_hh_hl_trend_score_50",
# "sweep_and_break_flag_20",
# "swing_failure_pattern_flag_20",
# "swing_leg_efficiency_ratio_30",
# "structure_shift_score_30",
# "time_dow_sin",
# "time_hour_sin",
# "trend_sma_cross_flag_5_20",
# "trendline_touch_flag_100",
# "trendline_slope_100",
# "volprof_poc_dist_100",
# "volprof_vah_dist_100",
# "trendline_break_rsi_14",
# "volprof_val_dist_100",
# "band_gauss_upper_dist_20_2",
# "wick_rejection_intensity_10",
# "break_prev_high_flag_1",
# "break_prev_low_flag_1",
# "band_gauss_lower_dist_20_2",
# "breaker_block_distance_20",
# "microstr_noise_20",
# "eff_ratio_20",
# "vol_weight_mom_20",
# "vol_parkinson_20",
# "rvap_20",
# "osc_adaptive_20",
# "rev_index_50",
# "range_expansion_30",
# "div_pv_20",
# "gap_momentum_5",
# "price_hl_ratio",
# "price_range_position",
# "vol_change_1",
# "price_ohlc_mean",
# "cand_body_ratio",
# "vol_sma_5",
# "mom_acceleration_regime_30",
# "corr_inter_timeframe_coherence",
# "price_abs_change_1",
# "wave_elliott_proxy_impulse_50",
# "adapt_kalman_price_estimate_20",
# "price_open_close_ratio",
# "mom_persistence_score_60",
# "price_max_10",
# "vol_realized_vol_cone_20",
# "hybrid_rsi_volume_weighted_14",
# "chaos_largest_lyapunov_exponent_50",
# "info_mutual_info_price_volume_30",
# "struct_wyckoff_phase_indicator_100",
# "vol_turnover_ratio_20",
# "pattern_inside_bar_strength_10",
# "price_median_vs_close_20",
# "wave_dwt_approx_coeff_level2_20",
# "wave_dwt_detail_energy_level1_20",
# "wave_cwt_ridge_frequency_32",
# "struc_doji_flag_1",
# "range_ratio_prev_1",
# "seq_consecutive_up_3",
# "ret_return_3",
# "risk_drawdown_100",
# "struc_marubozu_bear_1",
# "struc_midpoint_dev_1",
# "gap_down_flag_1",
# "seq_consecutive_green_5",
# "struc_upper_shadow_1",
# "struc_marubozu_bull_1",
# "reg_slope_norm_50",
# "stat_percentile_close_50",
# "ret_cumprod_5",
# "range_zscore_24",
# "struc_body_ratio_1",
# "struc_lower_shadow_1",
# "ent_sign_change_30",
# "gap_up_flag_1",
# "trend_hma_55",
# "microstruct_roll_volskew_30",
# "vol_zclose_30",
# "micro_price_jump_intensity_50",
# "pivot_dynamic_reversal_score_20",
# "mom_roc_3",
# "mom_rsi_21",
# "ar1_resid_volatility_30",
# "trend_sma_13",
# "reg_lin_slope_100",
# "nonlinear_reg_curve_change_40",
# "tail_event_frequency_100",
# "vol_bbw_10_1",
# "vol_atr_21",
# "vol_flow_impulse_ratio_25",
# "spec_multiband_entropy_20_60",
# "fractal_vol_surface_64",
# "regime_entropy_transition_50",
# "trend_ema_34",
# "seasonal_strength_quarterhour",
# "hilbert_freq_variability_30",
# "vol_stdret_10",
# "adaptive_kalman_beta_20",
# "spectral_coherence_multi_32",
# "orderflow_exec_eff_30",
# "fric_slippage_proxy_20",
# "stat_bimodality_coef_80",
# "time_cycle_phase_96",
# "spec_recurring_cycle_strength_72",
# "struct_price_memory_decay_40",
# "vol_intrabar_energy_ratio_20",
# "time_session_transition_volatility_1d",
# "pattern_shadow_symmetry_divergence_10",
# "flow_volume_lead_lag_asymmetry_30",
# "img_candlestick_texture_glcm_entropy_20",
# "img_price_contour_complexity_sobel_30",
# "img_candlestick_morphology_erosion_dilation_15",
# "img_price_hough_trendline_votes_50",
# "img_volume_histogram_equalization_contrast_30",
# "thermodynamic_entropy_price_45",
# "astrophysical_black_hole_68",
# "band_donch_ma_lower_50",
# "band_rvwap_lower_60_20_30",
# "band_smc_equilibrium_50",
# "band_session_range_pct",
# "band_donch_ma_upper_50",
# "band_rvwap_position_60_20",
# "band_donch_ma_mid_50",
# "band_rvwap_upper_60_20_70",
# "band_vwap_zone_n2",
# "band_smc_discount_zone_50",
# "band_smc_range_position_50",
# "band_smc_premium_zone_50",
# "band_donch_ma_position_50",
# "band_rvwap_median_60_20",
# "band_vwap_zone_n1",
# "band_vwap_zone_p2",
# "blockchain_consensus_sim_25",
# "biological_neural_oscillation_36",
# "band_vwap_zone_p1",
# "consciousness_awareness_field_44",
# "chaos_theory_bifurcation_75",
# "band_vwap_zone_n3",
# "band_vwap_zone_p3",
# "cycle_even_odd_skew_24",
# "ecological_predator_prey_55",
# "gravitational_pull_metrics_60",
# "liquidity_volume_persistence_50",
# "liquidity_shock_absorption_20",
# "entropy_range_uniformity_45",
# "entropy_directional_shift_55",
# "molecular_dynamics_simulation_52",
# "mom_adx_14",
# "mom_adx_extreme_14_75",
# "mom_adx_quality_long_14_25",
# "mom_adx_strong_14_25",
# "mom_adx_vstrong_14_50",
# "mom_adx_rising_14_3",
# "mom_adx_quality_short_14_25",
# "mom_adx_weakening_14_3",
# "mom_amo_cross_ama_bear_14_9",
# "mom_ama_14_9",
# "mom_amo_14_9",
# "mom_amo_above_zero_14_9",
# "mom_amo_ama_spread_14_9",
# "mom_amo_cross_ama_bull_14_9",
# "mom_di_spread_14",
# "mom_donch_ma_break_bear_50",
# "mom_donch_ma_break_bull_50",
# "mom_macd_div_bear_12_26_9",
# "mom_macd_div_hbear_12_26_9",
# "mom_macd_div_bull_12_26_9",
# "mom_macd_div_hbull_12_26_9",
# "mom_smc_fvg_bear",
# "mom_session_momentum_10",
# "mom_pdi_cross_ndi_bull_14",
# "mom_smc_order_block_bear_50",
# "mom_smc_fvg_bull",
# "mom_ndi_cross_pdi_bear_14",
# "mom_smc_order_block_bull_50",
# "mom_tl_break_bear_30",
# "mom_tl_break_bear_60",
# "mom_tl_break_bull_10",
# "mom_tl_break_bear_10",
# "mom_tl_ll_10",
# "mom_tl_hh_60",
# "mom_tl_break_bull_60",
# "mom_tl_hh_30",
# "mom_tl_ll_30",
# "mom_tl_hh_10",
# "mom_tl_break_bull_30",
# "mom_ursi_above_ema99_14",
# "mom_ursi_os_14_20",
# "mom_trendline_break_bull_14",
# "mom_tl_ll_60",
# "mom_ursi_14",
# "momentum_relative_velocity_12",
# "mom_trendline_break_bear_14",
# "neural_network_activation_32",
# "mom_ursi_ob_14_80",
# "pattern_bear_engulfing",
# "pattern_bear_confidence",
# "pattern_bear_harami",
# "pattern_bull_confidence",
# "pattern_bull_harami",
# "pattern_bull_engulfing",
# "pattern_falling_three",
# "pattern_rising_three",
# "pattern_inverted_hammer",
# "pattern_hanging_man",
# "pattern_hammer",
# "pattern_morning_star",
# "pattern_evening_star",
# "pattern_net_bias",
# "pattern_shooting_star",
# "pattern_three_black_crows",
# "pattern_three_white_soldiers",
# "pattern_tweezer_bottom",
# "quantum_annealing_optimization_48",
# "quantum_computation_oracle_42",
# "quantum_entanglement_proxy_50",
# "pattern_tweezer_top",
# "quantum_field_fluctuation_28",
# "quantum_gravity_unification_72",
# "quantum_error_correction_38",
# "quantum_teleportation_fidelity_33",
# "trend_bullish_streak_active",
# "relativistic_momentum_50",
# "regime_cusum_drift_60",
# "quantum_tunneling_prob_30",
# "quantum_superposition_state_40",
# "trend_donch_ma_50",
# "trend_ema_angle_21",
# "trend_long_streak_5",
# "trend_session_bias",
# "trend_session_streak_length",
# "trend_session_body_pct",
# "trend_smc_internal_5",
# "trend_session_streak",
# "trend_tl_direction_10",
# "trend_session_streak_strength",
# "trend_tl_direction_30",
# "trend_tl_direction_60",
# "trend_trendline_breaks_osc_14_9_20",
# "trend_tl_confluence",
# "trend_trendlines_bear_5_10",
# "trend_smc_structure_50",
# "trend_smc_structure_confluence",
# "trend_trendlines_bear_cross_5_10",
# "trend_trendlines_bull_5_10",
# "trend_trendlines_bull_cross_5_10",
# "trend_trendlines_net_5_10",
# "trend_trendlines_signal_5_10",
# "trend_volume_coupling_30",
# "vol_donch_ma_width_50",
# "vol_efficiency_ratio_30",
# "vol_smc_fvg_size",
# "vol_rvwap_width_60_20",
# "state_trend_triple_20",
# "markov_trend_transition_prob_30",
# "physics_price_velocity_10",
# "vector_angular_momentum_15",
# "sentiment_greed_fear_simple",
# "mom_full_stoch_14_3_3",
# "trend_adx_14",
# "band_bb_percent_b_20_2",
# "volume_obv_smoothed_10",
# "volume_mfi_14",
# "mom_awesome_osc_5_34",
# "volume_force_index_13",
# "mom_kst_10_15_20_30_10",
# "volume_chaikin_mf_20",
# "vol_keltner_channels_20_2",
# "mom_stoch_rsi_14_14_3",
# "mom_williams_r_14",
# "volume_vwap_daily",
# "volume_mfi_enhanced_14",
# "mom_trix_15_9",
# "vol_bb_relative_width_20_2",
# "trend_dmi_14",
# "volume_obv_ma_ratio_20",
# "mom_pmo_35_20_10",
# "mom_macd_histogram_12_26_9",
# "trend_adx_line_14",
# "trend_elder_ray_13",
# "volume_relative_20",
# "candle_body_total_ratio",
# "price_abs_change_5",
# "ratio_profit_loss_10",
# "vol_bb_squeeze_20_2",
# "mom_price_10",
# "advanced_hurst_exponent",
# "volume_sma_10",
# "advanced_fisher_transform",
# "price_typical",
# "price_hl_spread",
# "momentum_mfi",
# "momentum_tsi",
# "volatility_atr_14",
# "trend_vortex",
# "advanced_efficiency_ratio",
# "momentum_roc_10",
# "volatility_mass_index",
# "trend_aroon",
# "advanced_parabolic_sar",
# "momentum_demarker",
# "momentum_ultimate_oscillator",
# "momentum_connors_rsi",
# "momentum_schaff_trend_cycle",
# "momentum_cmo",
# "volume_accumulation_distribution",
# "trend_linear_regression_slope",
# "volume_ease_of_movement",
# "volatility_choppiness_index",
# "trend_ma_envelope",
# "trend_ichimoku_base",
# "advanced_disparity_index",
# "trend_donchian_channels",
# "price_position",
# "volume_negative_volume_index",
# "volume_balance_of_power",
# "volatility_standard_deviation",
# "volume_positive_volume_index",
# "volume_vroc",
# "trend_awesome_oscillator",
# "trend_sma_50",
# "trend_ema_20",
# "trend_wma_20",
# "volatility_skew_30",
# "volatility_kurt_30",
# "momentum_zscore_return_20",
# "volume_obv",
# "price_drawdown_20",
# "candlestick_upper_shadow_ratio",
# "candlestick_lower_shadow_ratio",
# "trend_ema_ratio_12_26",
# "momentum_percent_rank_20",
# "volatility_mad_return_20",
# "volatility_volume_cv_20",
# "momentum_acceleration_5",
# "price_drawup_20",
# "entropy_return_abs_30",
# "spectral_power_ratio_30",
# "robust_zscore_close_30",
# "trend_r2_50",
# "volume_std_20",
# "volume_change_rate_5",
# "volatility_true_range_20",
# "trend_ichimoku_conversion_9",
# "trend_ichimoku_span_a_9_26",
# "mom_ppo_12_26_9",
# "volm_vwap_distance",
# "trend_hlc3_sma_20",
# "trend_highest_close_flag_25",
# "trend_price_reg_curvature_20",
# "trend_abs_trend_strength_15",
# "trend_median_filter_7",
# "trend_drawdown",
# "trend_smooth_step_10",
# "trend_zscore_hl_mid_20",
# "trend_open_close_position_20",
# "trend_pullback_strength_20",
# "trend_reg_angle_30",
# "trend_rising_window_frac_10",
# "reg_squeeze_flag_20",
# "tod_sin_daily",
# "trend_amplitude_20",
# "trend_price_slope_25",
# "spec_hilbert_phase_vel_20",
# "rob_mad_close_20",
# "stat_skew_50",
# "trend_hl_ratio_50",
# "trend_candle_wick_ratio_15",
# "trend_low_higherlow_flag_20",
# "trend_reg_r2_30",
# "trend_high_ret_strength_20",
# "risk_max_drawdown_100",
# "trend_price_slope_5",
# "trend_up_down_strength_15",
# "trend_break_strength_20",
# "volm_volume_spike_flag_3",
# "stat_change_point_volatility_30",
# "mom_log_return_10",
# "intrabar_body_norm_10",
# "vol_pricecorr_20",
# "trend_sma_slope_20",
# "trend_volscaled_trend_20",
# "volm_volume_pressure_5",
# "mom_log_return_1",
# "stat_autocorr1_50",
# "trend_candle_body_strength_30",
# "stat_tailfrac_50",
# "trend_ema_3_10_ratio",
# "trend_range_slope_20",
# "trend_log_trend_30",
# "vol_bandwidth_bb_20_2",
# "trend_rise_rate_15",
# "stat_kurt_50",
# "vol_maxdraw_50",
# "vol_stdret_20",
# "vol_range_norm_10",
# "trend_hl_mid_acc_10",
# "mom_acceleration_5",
# "reg_trend_flag_20",
# "trend_price_reversal_score_15",
# "trend_price_above_sma_50",
# "trend_sma_percent_change_15",
# "trend_close_range_frac_20",
# "vol_rsi_volatility_14",
# "trend_low_ret_strength_20",
# "trend_pct_change_12",
# "trend_falling_window_frac_10",
# "vol_true_range_1",
# "vol_kurtosis_30",
# "trend_candle_upper_shadow_20",
# "mom_stochk_14",
# "trend_midprice_reg_slope_20",
# "trend_hl_equilibrium_10",
# "trend_lowest_close_flag_25",
# "spec_wavelet_energy_ratio_10_40",
# "trend_price_distribution_kurt_20",
# "trend_maxmin_center_dist_20",
# "vol_rv_20",
# "trend_high_break_ratio_20",
# "mom_std_momentum_20",
# "trend_close_vs_high_20",
# "mom_velocity_5",
# "tod_cos_daily",
# "trend_hlc3_slope_20",
# "trend_ema_double_slope_15",
# "trend_reg_slope_15",
# "mom_abs_momentum_14",
# "trend_hl_center_offset_15",
# "mom_accel_2",
# "trend_close_angle_10",
# "trend_sma_stability_30",
# "reg_breakout_flag_20",
# "ent_range_30",
# "trend_close_shifted_corr_10",
# "vol_spike_5",
# "trend_sma_5_20_cross_dist",
# "trend_centered_range_ratio_20",
# "trend_close_minus_open_20",
# "trend_candle_body_strength_10",
# "price_breakout_20",
# "trend_ema_10",
# "trend_vol_norm_price_15",
# "trend_range_zscore_20",
# "spec_pow_low_64",
# "trend_price_normalized_rolling_20",
# "trend_sma_ratio_10_30",
# "trend_ma_kurtosis_20",
# "trend_reg_residual_std_20",
# "trend_mid_vs_sma20",
# "band_kelt_dist_20_2",
# "trend_zscore_midprice_20",
# "risk_recovery_ratio_50",
# "trend_reg_momentum_20",
# "trend_hl_spread_norm_20",
# "trend_hl_slope_10",
# "trend_tp_range_ratio_20",
# "trend_rolling_min_slope_30",
# "trend_scaled_close_20",
# "trend_close_minus_open_30",
# "trend_price_oscillator_12_26",
# "trend_reg_slope_normalized_20",
# "trend_price_distribution_skew_20",
# "volm_volume_rate_20",
# "trend_price_persistence_10",
# "trend_ema_slope_30",
# "trend_reg_intercept_15",
# "trend_sma_diff_ratio_10_30",
# "trend_normalized_hl_spread_15",
# "trend_midline_slope_15",
# "trend_ma_dist_20",
# "trend_sma_curvature_20",
# "trend_candle_shape_balance_15",
# "vol_mean_30",
# "gap_open_pct",
# "trend_high_low_ratio_20",
# "mom_diff_1",
# "trend_reg_residual_20",
# "stat_interquartile_range_20",
# "trend_sma_diff_10_30",
# "dow_sin_weekly",
# "trend_close_minus_open_10",
# "trend_range_to_body_ratio_20",
# "vol_range_pct_20",
# "vol_range_pct_1",
# "trend_ema_diff_12_50",
# "trend_drawup_20",
# "trend_channel_mid_slope_20",
# "trend_direction_consistency_10",
# "trend_close_to_sma_ratio_30",
# "trend_midprice_rolling_20",
# "spec_bandpower_high_20",
# "spec_zero_cross_rate_20",
# "trend_price_displacement_20",
# "trend_sma_var_20",
# "vol_obv",
# "mom_zscore_close_20",
# "trend_low_break_ratio_20",
# "trend_sma_reg_diff_20",
# "reg_volatility_regime_30",
# "trend_midprice_zscore_15",
# "price_clv_sum_20",
# "volm_obv_change_10",
# "reg_squeeze_bbkc_20",
# "trend_sma_displacement_60",
# "trend_ema_zero_cross_20",
# "trend_close_vs_low_20",
# "spec_energy_10",
# "price_clv",
# "volm_volume_acceleration_10",
# "dow_cos_weekly",
# "trend_price_acceleration_10",
# "trend_closeto_mid_slope_20",
# "rob_pct_rank_close_20",
# "trend_hl_ratio_10",
# "stat_frac_up_moves_20",
# "trend_channel_position_20",
# "trend_low_break_strength_30",
# "vol_expand_ratio_10",
# "trend_normalized_hl_gap_20",
# "trend_sma_undershoot_20",
# "mom_williamsR_14",
# "trend_sma_stddev_30",
# "vol_pctchange_10",
# "trend_hl_coherence_10",
# "trend_high_break_strength_30",
# "trend_trough_detect_15",
# "trend_price_slope_12",
# "spec_bandpower_low_20",
# "volm_zscore_20",
# "trend_sma_velocity_20",
# "vol_close_std_30",
# "trend_ma_convergence_15",
# "trend_smooth_diff_30",
# "trend_reg_line_distance_30",
# "reg_reversal_flag_10",
# "trend_candle_lower_shadow_20",
# "trend_sma_shifted_diff_20",
# "stat_varq05_100",
# "trend_sma_overshoot_20",
# "trend_sma_bandwidth_20",
# "trend_sma_rolling_skew_20",
# "mom_rsi_slope_14",
# "mom_zscore_20",
# "trend_ema_slope_20",
# "reg_high_vol_20",
# "stat_rolling_entropy_20",
# "trend_peak_detect_15",
# "vol_downstd_50",
# "trend_vol_adjusted_trend_20",
# "time_decay_momentum_20",
# "state_trend_flip_rate_50",
# "spec_wavelet_smoothness_30",
# "vol_shock_indicator_30",
# "autocorr_absreturn_20",
# "reg_residual_vol_50",
# "range_center_pull_20",
# "vol_directional_skew_30",
# "price_velocity_norm_10",
# "higher_moment_energy_20",
# "rob_trimmed_mean_return_30",
# "hilbert_trendline_offset_20",
# "micro_bidask_proxy_1",
# "corr_return_volume_lag1_30",
# "pattern_slope_change_flag_10",
# "compressibility_score_30",
# "volatility_of_volatility_30",
# "diff_of_diff_norm_10",
# "swing_amplitude_ratio_20",
# "frac_multiscale_roughness_20_60",
# "ent_vol_sign_entropy_40",
# "chain_return_stability_30",
# "risk_drawdown_duration_100",
# "filt_butter_lowpass_20",
# "vol_price_interaction_20",
# "vol_regime_intensity_40",
# "mean_reversion_speed_20",
# "breakout_intensity_score_20",
# "persistence_ratio_15",
# "rsi_regular_div_flag",
# "rsi_hidden_div_flag",
# "rsi_div_magnitude",
# "rsi_div_persistence",
# "rsi_strength_score",
# "macd_regular_div_flag",
# "macd_hidden_div_flag",
# "macd_div_mag_norm",
# "macd_slope_div_score",
# "macd_div_confidence",
# "momentum_div_score",
# "psar_div_signal",
# "peak_trough_correlation",
# "linear_slope_div_flag",
# "extremum_ratio",
# "vol_hammer_confirm",
# "vol_shooting_confirm",
# "vol_bull_engulf",
# "vol_bear_engulf",
# "vol_doji_diverg",
# "vol_morning_confirm",
# "vol_evening_confirm",
# "vol_double_top_conf",
# "vol_double_bottom_conf",
# "vol_abnormal_z",
# "vol_abnormal_pct",
# "vol_pattern_strength",
# "vol_cluster_high",
# "vol_volume_slope",
# "vol_obv_pattern_align",
# "vol_volume_confidence",
# "vol_price_vol_corr",
# "vol_volume_ratio",
# "vol_candle_volume_gap",
# "mom_time_since_breakout_50",
# "ddn_depth_200",
# "reg_hurst_rs_128",
# "corr_close_volume_50",
# "ent_range_state_30",
# "mom_run_up_frac_20",
# "struct_body_wick_ratio",
# "flt_kz_slope_15_3",
# "spec_pow_ratio_64_1_4",
# "rob_mad_ret_50",
# "vol_exhaust_gap_ratio",
# "whale_signed_spike",
# "vol_volatility_phase_lag",
# "vol_vol_decouple_index",
# "asymm_volume_surface",
# "wick_volume_skew",
# "absorption_entropy_peak",
# "exhaustion_price_clustering",
# "volume_pressure_gradient",
# "rob_median_spread_50",
# "vol_price_velocity_20",
# "spec_fft_energy_ratio_30",
# "mom_range_persistence_10",
# "rev_intraday_reversal_flag",
# "trend_slope_stability_50",
# "vol_corr_range_vol_30",
# "band_zscore_ratio_20",
# "ent_direction_change_50",
# "reg_residual_skew_40",
# "frac_detrended_dim_100",
# "liq_order_imbalance_20",
# "sym_time_reversal_30",
# "burst_volatility_clustering_50",
# "info_mutual_return_volume_40",
# "geom_path_monotonicity_20",
# "geom_arc_to_chord_30",
# "geom_range_overlap_ewm_20",
# "info_runlen_entropy_50",
# "pat_break_density_20_50",
# "spec_haar_energy_ratio_32",
# "ret_signed_streak_20",
# "ret_autocorr_lag1_50",
# "ret_skew_30",
# "range_body_ratio_20",
# "close_location_shift",
# "atr_regime_ratio_14_50",
# "vol_of_vol_30",
# "price_position_in_range_100",
# "breakout_strength_atr_20",
# "overnight_intraday_vol_ratio_20",
# "volume_persistence_20",
# "volume_price_corr_absret_20",
# "volume_dryup_score_20",
# "liquidity_shock_index_20",
# "range_volume_efficiency_20",
# "intraday_volatility_clustering_20",
# "trend_volatility_coupling_30",
# "candle_shape_upper_shadow_20",
# "candle_shape_lower_shadow_20",
# "trend_consistency_50",
# "ret_kurtosis_40",
# "up_down_vol_asym_30",
# "downside_es_30",
# "drawdown_depth_200",
# "realized_parkinson_vol_ratio_30",
# "ret_sign_entropy_50",
# "volume_regime_entropy_50",
# "vol_weighted_direction_20",
# "path_efficiency_ratio_20",
# "range_percentile_50",
# "volume_percentile_50",
# "downside_vol_ratio_30",
# "gap_volatility_20",
# "atr_persistence_20",
# "intraday_reversal_magnitude_20",
# "close_midrange_bias_20",
# "volume_signed_range_corr_20",
# "trapped_volume_pressure",
# "vol_persistent_confirm",
# "leverage_effect_corr_20",
# "divergence_time_to_confirmation",
# "log_price_curvature_50",
# "log_trend_r2_30",
# "log_return_high",
# "log_return_low",
# "log_return_open",
# "log_return_open_std",
# "log_return_high_std",
# "log_return_low_std",
# "bollinger_close_band_width_20",
# "bollinger_high_band_width_20",
# "bollinger_low_band_width_20",
# "bollinger_open_band_width_20",
# "vol_volume_surge_ratio_5",
# "pattern_inside_bar_flag_1",
# "mom_balance_of_power_14",
# "vol_true_range_zscore_20",
# "stat_entropy_directional_30",
# "flow_vol_sma_ratio_20",
# "flow_volatility_volume_ratio_20",
# "info_range_entropy_20",
# "kurt_ret_30",
# "osc_rsi_slope_14",
# "press_price_volume_impact",
# "risk_atr_14",
# "risk_volatility_ratio_20",
# "shift_close_leadlag_5",
# "sign_volcorr_30",
# "smooth_hl_mid_10",
# "speed_close_zscore_20",
# "trend_cum_return",
# "trend_ema_angle_20",
# "trend_supertrend_flag_14_3",
# "var_skewret_30",
# "volu_accel_10",
# "vol_mad_ratio_30",
# "reg_ewm_slope_40",
# "vol_rel_volume_10_30",
# "vol_hl_compression_15",
# "pat_cluster_5",
# "mom_ret_zscore_20",
# "trend_hma_slope_21",
# "vol_rtr_pos_14",
# "pat_upper_shadow_10",
# "ent_vol_sign_50",
# "mom_ema_accel_14",
# "trend_mm_mid_50",
# "vol_logvol_delta_1",
# "vol_vroc_20",
# "reg_persist_dir_10",
# "pat_rb_prop_20",
# "trend_dpo_30",
# "vol_iqr_20",
# "vol_vcr_15",
# "trend_lin_resid_30",
# "mom_signflip_12",
# "vol_mmvd_25",
# "pat_lower_shadow_pct_15",
# "vol_vp_corr_20",
# "vol_rrr_10",
# "pat_bwr_8",
# "trend_epd_26",
# "mom_ret_accel_12",
# "trend_pvt_slope_20",
# "vol_atr_norm_14",
# "pat_engulf_strength_5",
# "reg_vstate_30",
# "trend_reversal_prob_12",
# "vol_vsi_20",
# "pat_crc_18",
# "vol_varret_25",
# "trend_spr_30",
# "mom_rdi_40",
# "pat_hl_bias_20",
# "vol_vmo_14",
# "reg_skewret_50",
# "trend_wclose_slope_30",
# "pat_wick_sym_12",
# "mom_ntrm_16",
# "reg_rvolent_35",
# "dyn_trend_strength_ratio_50",
# "geo_body_wick_ratio",
# "dyn_trend_reversal_index_40",
# "geo_direction_consistency_10",
# "geo_range_ratio_25",
# "dyn_trend_slope_ema_20",
# "geo_price_position_20",
# "dyn_trend_consistency_25",
# "dyn_trend_angle_30",
# "geo_trend_strength_balance_30",
# "hybrid_vol_mom_regime_score_50",
# "hybrid_entropy_trend_momentum_30",
# "info_directional_variability_25",
# "geo_struct_regime_score_50",
# "hybrid_price_volume_momentum_15",
# "hybrid_price_absorption_strength_25",
# "hybrid_liq_volatility_interplay_20",
# "info_joint_entropy_20",
# "info_return_direction_entropy_40",
# "info_return_entropy_20",
# "liq_depth_ratio_15",
# "liq_absorption_ratio_30",
# "info_range_complexity_30",
# "liq_flow_balance_20",
# "info_volume_entropy_20",
# "liq_spike_persistence_15",
# "liq_pressure_index_25",
# "mom_energy_osc_20",
# "mom_inertia_30",
# "mom_impulse_strength_10",
# "liq_volatility_corr_20",
# "mom_entropy_weighted_roc_15",
# "mom_vol_weighted_roc_15",
# "liq_turnover_rate_20",
# "mom_cumulative_push_20",
# "mom_persistence_index_25",
# "vol_smooth_ratio_40",
# "vol_vol_of_vol_50",
# "trend_nonlin_slope_var_40",
# "trend_swing_efficiency_30",
# "vol_cluster_density_40",
# "vol_dynamic_range_index_30",
# "vol_range_volatility_20",
# "vol_spike_density_25",
# "trend_channel_touch_freq_50",
# "vol_mean_reversion_ratio_30",
# "entropy_return_30",
# "entropy_hour_agg",
# "mom_signed_energy_20",
# "hybrid_volume_trend_conflict_20",
# "geo_price_reversal_flag_15",
# "spec_fft_power_5",
# "spec_dct_energy_10",
# "rob_vol_zscore_50",
# "rob_trimmed_mean_20",
# "rob_rolling_skew_25",
# "relative_strength_pair_20",
# "reg_vol_trend_corr_40",
# "mom_log_return_30",
# "mom_cum_return_50",
# "mom_cmo_14",
# "flag_regime_volatility",
# "flag_price_stagnation_20",
# "flag_breakout_range_50",
# "flag_bearish_harami",
# "ent_volume_spike_30",
# "ent_volatility_state_60",
# "ent_trade_direction_30",
# "ent_return_direction_50",
# "ent_polarity_switch_40",
# "downside_volatility_30",
# "conditional_entropy_30",
# "calmar_ratio_50",
# "beta_to_index_50",
# "band_keltner_width_20",
# "vol_true_range_norm_14",
# "vol_rolling_iqr_20",
# "vol_rel_volume_20",
# "vol_range_ratio_10",
# "vol_garman_klass_20",
# "trend_sma_slope_30",
# "trend_rvi_14",
# "trend_price_angle_15",
# "trend_high_low_diff_14",
# "trend_double_ema_20",
# "spec_autocorr_decay_30",
# "sortino_ratio_30",
# "sharpe_ratio_30",
# "rob_zscore_close_60",
# "rob_quantile_spread_40",
# "rob_median_dev_30",
# "reg_polyfit_curvature_50",
# "reg_lin_r2_50",
# "mom_rsi_rate_14",
# "mom_kurtosis_30",
# "flag_bullish_engulf",
# "corr_with_index_50",
# "band_price_position_20_2",
# "spec_wavelet_var_10",
# "mutual_info_price_volume_20",
# "stat_spec_midband_power_ret30",
# "stat_spec_lowfreq_power_ret60",
# "band_bb_percB_20_2",
# "stat_dominant_freq_idx_30",
# "cycle_vol_coherence_60",
# "entropy30",
# "gap_shock_score_20",
# "fib_prox_gauss_618",
# "mom_kaufman_er_14",
# "mom_acf_lag5_10",
# "entsign_ret_50",
# "mom_drawdown_frac_60",
# "ent_ratio_ohlcv",
# "mom_ratio_5_20",
# "mom_sign_multilag_autocorr_20",
# "mom_sign_var_25",
# "mom_close_range_pos_20",
# "mom_change_spike_ratio_14_50",
# "mom_volscale_count_20",
# "pat_double_switch",
# "pattern_bullish_upper_range_20",
# "pattern_doji_count_10",
# "pv_leadlag_corr_spread_10_40",
# "pattern_peak_count_20",
# "stat_fft_band_ratio_60",
# "stat_downstreak_break_5",
# "range_breakout_intensity_50" ,
# "rollcorr_ret_range20",
# "stat_fft_trend_cycle_ratio_40",
# "risk_max_drawdown_flag_1000",
# "stat_extreme_outlier_30_3sigma",
# "stat_fractal_dim_close_30",
# "mom_streak_duration",
# "stat_intraday_low_dist",
# "stat_inv_spearman_high_low_40",
# "stat_kurtosis_ret_20",
# "stat_lz_complexity_30",
# "stat_norm_sign_entropy_20",
# "stat_leadlag_vol_corr_10_5",
# "stat_lz_complexity_50",
# "stat_nvg_avg_degree_30",
# "stat_price_volume_div_1bar",
# "stat_pairdiff_energy_ratio_16_64",
# "stat_ret_outlier_15",
# "stat_pct_greenbars_64",
# "stat_skewness_ret_30",
# "stat_regime_change_flag_10_10",
# "stat_vol_concentration_20_top3" ,
# "stat_vol_concentration_60_top10" ,
# "stat_winloss_balance_60",
# "trend_candle_dir_mean_15",
# "surprisemove10",
# "stat_winloss_balance_20",
# "trend_chop_index_40",
# "trend_convexity_poly2_20",
# "surprisevol15",
# "trend_curvature_quad_50",
# "trend_ichimoku_tenkan_rel_9",
# "trend_ma_confirmation_5_30",
# "trend_ma_coherence_5_30",
# "trend_local_median_polarity_21",
# "trend_ma_spread_norm_5_20",
# "trend_price_rank_20",
# "trend_roc_norm_7",
# "trend_runlength_20",
# "trend_run_max_12",
# "vol_directional_imbal_20",
# "vol_atr_vol_ratio_14_25",
# "atr2vol",
# "vol_diversity_causal_30_200",
# "vol_burst_tight_range_score_30",
# "vol_burst_period_30",
# "vol_consolidation_flag_30",
# "vol_breakout_threat_60",
# "vol_drop_15",
# "vol_chaikin_norm_10_20",
# "vol_hl_swing_rel_10",
# "vol_liq_range_norm_20",
# "vol_multiscale_range_persist_10_20_40",
# "vol_kvo_norm_34_55",
# "vol_jump_parity_21_2x",
# "vol_price_boundary_touch_7",
# "vol_pairdiff_energy_diff_16_64",
# "vol_persist_rel_10_30",
# "vol_norm_atr14_to_close",
# "vol_price_efficiency_15",
# "vol_price_range_touch_7",
# "vol_range_compress_30_0p5",
# "vol_range_ratio_5_20",
# "vol_ratio_5",
# "vol_rel_last_breakout",
# "vol_range14_to_meanvol25",
# "vol_range_shift_32",
# "vol_shock_energy_32",
# "vol_spike_fast_20" ,
# "vol_rel_last_high_break",
# "vol_skew20",
# "vol_regime_persistence_30",
# "vol_trend_normalized_10",
# "vol_spike_50",
# "vol_stat_rcv_20",
# "vol_tw_range_mean_10",
# "vol_zscore_range_50",
# "mom_macd_hvix_20_12_26",
# "sentiment_momentum_30",
# "order_flow_imbalance_20",
# "price_responsiveness_25",
# "liquidity_rotation_25",
# "lead_lag_volume_price_15",
# "trend_slope_atr_norm_50",
# "roll_hurst_exp_100",
# "geom_curvature_60",
# "geom_slope_abs_40",
# "flag_cci_extreme_200_20",
# "rob_zscore_mad_30",
# "mom_macdsig_hvix_20_12_26_9",
# "vol_range_std_30",
# "reg_slope_ratio_20_60",
# "trend_vol_adj_50",
# "flow_dir_vol_20",
# "rob_close_mad_20",
# "vol_return_var_30",
# "mom_sign_10",
# "anomaly_score_zprice",
# "inter_corr_eth_50",
# "fear_greed_index",
# "inter_sp500_corr_50",
# "inter_market_ratio_gold",
# "liquidity_gap_flag",
# "mean_reversion_index",
# "pattern_three_crows",
# "pattern_marubozu_flag",
# "liquidity_index",
# "regime_switch_flag",
# "price_spread_ratio",
# "rob_range_mad_20",
# "pattern_three_soldiers",
# "rsi_volume_conflict",
# "rob_median_close_20",
# "trend_strength_index",
# "volume_imbalance_index",
# "sentiment_smooth_index",
# "vol_spike_density_30",
# "wave_denoise_close_db4_L2",
# "volume_volatility_correlation",
# "volatility_ratio_15_60",
# "wave_denoise_close_db4_L1",
# "wave_denoise_open_db4_L1",
# "wave_denoise_close_db4_L3",
# "wave_denoise_db4_L4",
# "wave_denoise_open_db4_L2",
]

DoNotUse_FEATURE_COLS = [
    "pivot_dynamic_reversal_score_20",
    "trend_tl_confluence",
    "trend_trendlines_bear_cross_5_10",
    "relative_strength_pair_20",
    "calmar_ratio_50",
    "beta_to_index_50",
    "corr_with_index_50",
    "inter_corr_eth_50",
    "fear_greed_index_5m",
    "inter_sp500_corr_50",
    "inter_market_ratio_gold",
    "sentiment_smooth_index",
]

In [123]:
# === Data Preprocessing Function ===

def preprocess_trading_data(
    df: pd.DataFrame,
    feature_cols: List[str],
    is_train: bool = True,
    scaler_params: Optional[dict] = None
) -> Tuple[pd.DataFrame, dict]:
    """
    Comprehensive preprocessing for trading data with features.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with 'close' and feature columns.
    feature_cols : list of str
        List of feature column names to preprocess.
    is_train : bool
        If True, compute statistics from data. If False, use provided scaler_params.
    scaler_params : dict, optional
        Dictionary containing scaling parameters (mean, std, min, max, etc.) from training set.
        Required when is_train=False.
    
    Returns
    -------
    df_processed : pd.DataFrame
        Preprocessed DataFrame.
    scaler_params : dict
        Dictionary with scaling parameters for each feature (for use on test set).
    """
    df_processed = df.copy()
    
    if scaler_params is None:
        scaler_params = {}
    
    # 1. Handle infinite values (replace with NaN for proper handling)
    print(f"Processing {'TRAIN' if is_train else 'TEST'} data...")
    for col in feature_cols:
        if col in df_processed.columns:
            # Replace inf/-inf with NaN
            df_processed[col] = df_processed[col].replace([np.inf, -np.inf], np.nan)
    
    # 2. Handle missing values
    # Strategy: Forward fill first (use previous valid value), then backward fill, then fill remaining with median
    for col in feature_cols:
        if col in df_processed.columns:
            n_missing_before = df_processed[col].isna().sum()
            
            if n_missing_before > 0:
                # Forward fill (use last valid observation)
                df_processed[col] = df_processed[col].ffill()
                
                # Backward fill for any remaining NaNs at the start
                df_processed[col] = df_processed[col].bfill()
                
                # If still NaNs exist, fill with median (computed from train or stored)
                if df_processed[col].isna().any():
                    if is_train:
                        median_val = df_processed[col].median()
                        scaler_params[f"{col}_median"] = median_val
                    else:
                        median_val = scaler_params.get(f"{col}_median", 0.0)
                    
                    df_processed[col] = df_processed[col].fillna(median_val)
                
                n_missing_after = df_processed[col].isna().sum()
                if n_missing_before > 0:
                    print(f"  {col}: filled {n_missing_before} missing values -> {n_missing_after} remaining")
    
    # 3. Remove outliers (clip to reasonable ranges based on percentiles)
    # This prevents extreme values from distorting the strategy
    for col in feature_cols:
        if col in df_processed.columns:
            if not pd.api.types.is_numeric_dtype(df_processed[col]):
                continue

        # ⚠️ Boolean هم numeric محسوب می‌شود، پس حذفش کن
            if pd.api.types.is_bool_dtype(df_processed[col]):
                continue

            if is_train:
                # Compute 1st and 99th percentiles
                lower_bound = df_processed[col].quantile(0.01)
                upper_bound = df_processed[col].quantile(0.99)
                scaler_params[f"{col}_lower"] = lower_bound
                scaler_params[f"{col}_upper"] = upper_bound
            else:
                lower_bound = scaler_params.get(f"{col}_lower", df_processed[col].min())
                upper_bound = scaler_params.get(f"{col}_upper", df_processed[col].max())
            
            # Clip values
            df_processed[col] = df_processed[col].clip(lower=lower_bound, upper=upper_bound)
    
    # 4. Standardization (Z-score normalization)
    # This ensures all features have similar scales for threshold mapping
    # 4. Standardization (Z-score normalization)
    for col in feature_cols:
        if col in df_processed.columns:

            if not pd.api.types.is_numeric_dtype(df_processed[col]):
                continue
            if pd.api.types.is_bool_dtype(df_processed[col]):
                continue

            if is_train:
                mean_val = df_processed[col].mean()
                std_val = df_processed[col].std()
                if std_val == 0 or np.isnan(std_val):
                    std_val = 1.0
                scaler_params[f"{col}_mean"] = mean_val
                scaler_params[f"{col}_std"] = std_val
            else:
                mean_val = scaler_params.get(f"{col}_mean", 0.0)
                std_val = scaler_params.get(f"{col}_std", 1.0)
        df_processed[col] = (df_processed[col] - mean_val) / std_val

    # 5. Final check: ensure no NaNs or infs remain
    for col in feature_cols:
        if col in df_processed.columns:
            n_invalid = df_processed[col].isna().sum() + np.isinf(df_processed[col]).sum()
            if n_invalid > 0:
                print(f"  WARNING: {col} still has {n_invalid} invalid values after preprocessing!")
                # Final fallback: fill with 0
                df_processed[col] = df_processed[col].replace([np.inf, -np.inf], 0).fillna(0)
    
    print(f"Preprocessing complete. Shape: {df_processed.shape}")
    print(f"Date range: {df_processed.index[0]} to {df_processed.index[-1]}")
    print(f"Number of features: {len(feature_cols)}")
    
    return df_processed, scaler_params

In [124]:
# Load raw data
df, FEATURE_COLS = load_eth_features("./eth_5m_with_features.csv", list(set(FEATURE_COLS) - set(DoNotUse_FEATURE_COLS)))
df_test, FEATURE_COLS = load_eth_features("./eth_5m_with_features_test.csv", list(set(FEATURE_COLS) - set(DoNotUse_FEATURE_COLS)))

print(f"Raw train data shape: {df.shape}")
print(f"Raw test data shape: {df_test.shape}")

# Apply preprocessing
df, scaler_params = preprocess_trading_data(df, FEATURE_COLS, is_train=True)
df_test, _ = preprocess_trading_data(df_test, FEATURE_COLS, is_train=False, scaler_params=scaler_params)

print(f"\nFinal train data shape: {df.shape}")
print(f"Final test data shape: {df_test.shape}")

Raw train data shape: (52992, 14)
Raw test data shape: (26208, 14)
Processing TRAIN data...
  vol_vwap_20: filled 19 missing values -> 0 remaining
  rob_skew_30: filled 29 missing values -> 0 remaining
  trend_sma_5: filled 4 missing values -> 0 remaining
  mom_cci_20: filled 19 missing values -> 0 remaining
  cand_body_sign_5: filled 4 missing values -> 0 remaining
  trend_lin_slope_50: filled 49 missing values -> 0 remaining
  mom_chande_14: filled 14 missing values -> 0 remaining
  cand_gap_1: filled 1 missing values -> 0 remaining
  rob_zret_60: filled 60 missing values -> 0 remaining
  vol_atr_14: filled 13 missing values -> 0 remaining
Preprocessing complete. Shape: (52992, 14)
Date range: 2025-03-01 00:00:00 to 2025-08-31 23:55:00
Number of features: 13
Processing TEST data...
  vol_vwap_20: filled 19 missing values -> 0 remaining
  rob_skew_30: filled 29 missing values -> 0 remaining
  trend_sma_5: filled 4 missing values -> 0 remaining
  mom_cci_20: filled 19 missing values ->

In [125]:
# 1) Run GA
best_chrom, best_fit = run_ga(df, FEATURE_COLS)
print(f"\nBest fitness found: {best_fit:.6f}")

# 2) Decode and show final rule list
best_rules = decode_chromosome(best_chrom, df, FEATURE_COLS)
compute_condition_thresholds(best_rules, df, FEATURE_COLS)

pretty_print_rules(best_rules, FEATURE_COLS)

# 3) Evaluate on TRAIN and TEST (3-month out-of-sample)
train_equity_curve, final_train_eq, n_train_trades = backtest_rule_list(
    best_rules, df, FEATURE_COLS
)

test_equity_curve, final_test_eq, n_test_trades = backtest_rule_list(
    best_rules, df_test, FEATURE_COLS
)

print(f"Train final equity: {final_train_eq:.2f}, Number of positions: {n_train_trades}")
print(f"Test  final equity: {final_test_eq:.2f}, Number of positions: {n_test_trades}")

# 4) Export best rules to JSON (schema-compliant)
rules_json = {"rules": []}

for rule in best_rules:
    rule_dict = {"conditions": [], "action": {}}

    for cond in rule.conditions:
        feat_name = FEATURE_COLS[cond.feature_idx]
        rule_dict["conditions"].append({
            "feature": feat_name,
            "op": cond.operator,
            "threshold": cond.threshold
        })

    rule_dict["action"] = {
        "side": rule.side,
        "tp": rule.tp,
        "sl": rule.sl,
        "size": rule.size_frac
    }

    rules_json["rules"].append(rule_dict)

with open("rules_G09.json", "w") as f:
    json.dump(rules_json, f, indent=4)

print("Best rules exported to rules_G09.json")

Initial best fitness (from expanded pool): 1121.261070
Generation   1: best fitness = 1121.261070, global best = 1121.261070
Generation   2: best fitness = 1133.706185, global best = 1133.706185
Generation   3: best fitness = 1133.706185, global best = 1133.706185
Generation   4: best fitness = 1133.706185, global best = 1133.706185
Generation   5: best fitness = 1139.325966, global best = 1139.325966
Generation   6: best fitness = 1139.520898, global best = 1139.520898
Generation   7: best fitness = 1139.520898, global best = 1139.520898
Generation   8: best fitness = 1148.583450, global best = 1148.583450
Generation   9: best fitness = 1148.583450, global best = 1148.583450
Generation  10: best fitness = 1148.583450, global best = 1148.583450
Generation  11: best fitness = 1148.583450, global best = 1148.583450
Generation  12: best fitness = 1148.583450, global best = 1148.583450
Generation  13: best fitness = 1148.583450, global best = 1148.583450
Generation  14: best fitness = 1150