# Feature Engineering: Operationalizing Causality

In this notebook, we detail the construction of the feature vector used by TD2C. The model synthesizes evidence from four domains: Information Theory, Error Analysis, Higher-Order Statistics, and Linear models.

## 1. Generalized Transfer Entropy

Motivated by the path asymmetry hypothesis, we define a set of descriptors based on a generalized form of Transfer Entropy (TE). We evaluate how information from the cause's immediate past ($\mathbf{z}_i^{(t-1)}$) transfers to the effect ($\mathbf{z}_j^{(t)}$) while varying the conditioning history of the effect ($k$).

For each lag $k \in [k_{min}, k_{max}]$, we compute:

$$
\text{TE}_{\text{fwd}}^{(k)} = I(\mathbf{z}_i^{(t-1)}; \mathbf{z}_j^{(t)} \mid \mathbf{z}_j^{(t-k)})
$$

$$
\text{TE}_{\text{bwd}}^{(k)} = I(\mathbf{z}_j^{(t-1)}; \mathbf{z}_i^{(t)} \mid \mathbf{z}_i^{(t-k)})
$$

The primary **asymmetry feature** averages this bias across all lags:

$$
\text{TE}_{\text{asy}} = \mathbb{E}_{k \in [k_{min}, k_{max}]}\left[\text{TE}_{\text{fwd}}^{(k)} - \text{TE}_{\text{bwd}}^{(k)}\right]
$$

## 2. Error-Based Descriptors

Inspired by Additive Noise Models (ANMs), these features look for structure in the residuals.

1.  **Partial Correlation:** Isolates the direct linear association by removing the effect of temporal neighbors $\mathbf{S}_{ij}^{(t)} = \mathbf{M}_i^{(t)} \cup \mathbf{M}_j^{(t)}$.
    $$ \rho_{\text{partial}} = \text{Corr}(\mathbf{z}_i^{(t)} - f_i(\mathbf{S}_{ij}^{(t)}), \mathbf{z}_j^{(t)} - f_j(\mathbf{S}_{ij}^{(t)})) $$

2.  **Residual-Input Correlation:** Tests if the cause contains information present in the effect's residual (a signature of correct causal direction in ANMs).
    $$ \rho_{\text{resid}} = \text{Corr}(\mathbf{z}_j^{(t)} - f_{\text{Ridge}}(\mathbf{z}_i^{(t)}, \mathbf{M}_i^{(t)}), \mathbf{z}_i^{(t)}) $$

## 3. Higher-Order Moments

To capture non-Gaussianity, we compute cross-cumulants between standardized variables:

$$
\text{HOC}_{i,j} = \mathbb{E}\left[\left(\frac{\mathbf{z}_i^{(t)} - \mu_i}{\sigma_i}\right)^i \left(\frac{\mathbf{z}_j^{(t)} - \mu_j}{\sigma_j}\right)^j\right]
$$

Specific features include $\text{HOC}_{3,1}$, $\text{HOC}_{1,2}$, Skewness, and Kurtosis.

## 4. The Complete Descriptor Set

The final feature vector is a combination of the novel descriptors above and legacy D2C descriptors tailored for time series. Below is the full mathematical definition of the feature space.

### Information Flow & Transfer Entropy
| Feature Name | Mathematical Formulation |
| :--- | :--- |
| TE Asymmetry | $\mathbb{E}_{k}\left[I(\mathbf{z}_i^{(t-1)}; \mathbf{z}_j^{(t)} \mid \mathbf{z}_j^{(t-k)}) - I(\mathbf{z}_j^{(t-1)}; \mathbf{z}_i^{(t)} \mid \mathbf{z}_i^{(t-k)})\right]$ |
| TE Forward (Lag 1) | $I(\mathbf{z}_i^{(t-1)}; \mathbf{z}_j^{(t)} \mid \mathbf{z}_j^{(t-1)})$ |
| TE Backward (Lag 1) | $I(\mathbf{z}_j^{(t-1)}; \mathbf{z}_i^{(t)} \mid \mathbf{z}_i^{(t-1)})$ |
| TE Difference | $\text{TE}_{\text{fwd}}^{(1)} - \text{TE}_{\text{bwd}}^{(1)}$ |

### Statistical & Error-Based
| Feature Name | Mathematical Formulation |
| :--- | :--- |
| Partial Correlation | $\operatorname{Corr}(\mathbf{z}_i - \operatorname{proj}_{\mathbf{MB}_i}(\mathbf{z}_i), \mathbf{z}_j - \operatorname{proj}_{\mathbf{MB}_j}(\mathbf{z}_j))$ |
| Error-Input Corr | $\operatorname{Corr}(\mathbf{z}_j - \operatorname{proj}_{\mathbf{MB}_i}(\mathbf{z}_j), \mathbf{z}_i)$ |
| Linear Coeff (fwd) | $b$ from $\mathbf{z}_j = b \cdot \mathbf{z}_i + \sum \mathbf{m}_j \cdot \beta + \epsilon$ |
| Linear Coeff (bwd) | $b$ from $\mathbf{z}_i = b \cdot \mathbf{z}_j + \sum \mathbf{m}_i \cdot \beta + \epsilon$ |
| Cross-Moments | $\text{HOC}_{3,1}, \text{HOC}_{1,2}, \text{HOC}_{2,1}, \text{HOC}_{1,3}$ |
| Higher Order Stats | Kurtosis($\mathbf{z}_i$), Kurtosis($\mathbf{z}_j$), Skewness($\mathbf{z}_i$), Skewness($\mathbf{z}_j$) |

### Markov Blanket Interactions (Information Theoretic)
*Note: $\mathbf{m}_i$ refers to members of the Markov Blanket of $i$.*

| Feature Concept | Formula |
| :--- | :--- |
| **Direct Interaction** | $I(\mathbf{z}_i; \mathbf{z}_j)$, $I(\mathbf{z}_i; \mathbf{z}_j \mid \mathbf{MB}_i \cap \mathbf{MB}_j)$ |
| **Conditioned on MB** | $I(\mathbf{z}_j; \mathbf{z}_i \mid \mathbf{MB}_j)$, $I(\mathbf{z}_i; \mathbf{z}_j \mid \mathbf{MB}_i)$ |
| **Cause's MB stats** | Mean/Std of $\{I(\mathbf{z}_i; \mathbf{m}_j) \mid \mathbf{m}_j \in \mathbf{MB}_j\}$ |
| **Effect's MB stats** | Mean/Std of $\{I(\mathbf{z}_j; \mathbf{m}_i) \mid \mathbf{m}_i \in \mathbf{MB}_i\}$ |
| **Interaction 3-way** | Mean/Std of $\{I(\mathbf{m}_i; \mathbf{m}_j \mid \mathbf{z}_j)\}$ |
| **Temporal Parents/Children** | $I(\mathbf{z}_i^{(t-1)}; \mathbf{z}_j^{(t)} \mid \mathbf{z}_j^{(t-1)})$, $I(\mathbf{z}_i^{(t-1)}; \mathbf{z}_j^{(t+1)} \mid \mathbf{z}_j^{(t)})$ |

In [None]:
# Let's inspect the features generated for a single pair of variables
from td2c.descriptors import D2C

# Re-using the engine from the Quick Start
# We look at the descriptors dataframe we generated earlier
feature_df = df_train.head(5)

# Show Information Theoretic features
mi_cols = [c for c in feature_df.columns if "entropy" in c or "mean" in c]
print("Subset of Information Theoretic Descriptors:")
display(feature_df[mi_cols].head())

# Show Error-based features
err_cols = [c for c in feature_df.columns if "resid" in c or "error" in c]
print("\nSubset of Error-based Descriptors:")
display(feature_df[err_cols].head())

### Key Descriptor: Generalized Transfer Entropy
One of our strongest features is the asymmetry in Transfer Entropy at various lags:

$$ \Delta TE = I(X_{t-1} ; Y_t | Y_{t-k}) - I(Y_{t-1} ; X_t | X_{t-k}) $$

If $X \to Y$, this value tends to be positive. If there is no link, it is near zero. If $Y \to X$, it is negative.