# The Supervised Model: From Descriptors to Probabilities

Once we have converted pairs of variables into 63-dimensional feature vectors, the problem of causal discovery becomes a standard **Binary Classification** task.

### Class Imbalance
In any sparse graph (like a gene network), there are far more *non-causal* pairs than *causal* ones. A standard classifier would achieve high accuracy by simply predicting "No Link" every time.

To solve this, we use a **Balanced Random Forest**, which undersamples the majority class (non-causal) in each bootstrap sample.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 1. Load or Generate Data (Small scale for demonstration)
# (In a real scenario, load 'descriptors_df_train.pkl')
from td2c.data_generation.builder import TSBuilder
from td2c.descriptors import DataLoader, D2C

print("Generating dataset...")
builder = TSBuilder(n_variables=5, maxlags=2, observations_per_time_series=150, time_series_per_process=10, verbose=False)
builder.build()
loader = DataLoader(n_variables=5, maxlags=2)
loader.from_tsbuilder(builder)

d2c = D2C(loader.get_observations(), loader.get_dags(), n_variables=5, maxlags=2, n_jobs=1, full=True, dynamic=True, mb_estimator="ts")
d2c.initialize()
df = d2c.get_descriptors_df()

# 2. Split Data
# Note: We should ideally split by 'graph_id' to avoid data leakage, 
# but for this simple demo, random split is sufficient.
X = df.drop(columns=["graph_id", "edge_source", "edge_dest", "is_causal"])
y = df["is_causal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train Model
clf = BalancedRandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X_train, y_train)

# 4. Evaluate
y_pred = clf.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

### Feature Importance
Does the model actually learn physics, or is it just memorizing noise? We can analyze the **Feature Importance** to see which descriptors matter most. 

Typically, we see a mix of:
1.  **Error-based features** (e.g., `parcorr_errors`) capturing linear signals.
2.  **Information-theoretic features** (e.g., `transfer_entropy`) capturing non-linear/asymmetric signals.

In [None]:
# Extract Feature Importances
importances = clf.feature_importances_
feature_names = X.columns
feat_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feat_df = feat_df.sort_values(by='Importance', ascending=False).head(15)

# Plot
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=feat_df, palette='viridis')
plt.title("Top 15 Most Important Causal Descriptors")
plt.xlabel("Gini Importance")
plt.show()