# GTWR + GNN for Spatiotemporal Socioeconomic Forecasting


**Abstract.** We study a hybrid approach that couples Geographically and Temporally Weighted Regression (GTWR) with a learnable Graph Neural Network (GNN) prior to forecast regional socioeconomic indicators. This notebook documents the full pipeline in a paper-like narrative: we motivate the method, introduce the mathematical formulations, and accompany each section with executable code. The workflow follows the experimental flow previously prototyped in `Untitled-1-.ipynb`, but all supporting utilities are reproduced inline so the analysis is self-contained.


## 1. Introduction
Spatiotemporal panels often exhibit location-dependent dependencies that evolve across time. GTWR extends classical regression by assigning kernel weights that decay with spatial and temporal distance, yielding localized parameter estimates. Recent advances in graph representation learning suggest replacing fixed kernels with learnable adjacency structures. We therefore blend a GTWR-style local Weighted Least Squares (WLS) solver with a neural encoder that adapts the weight matrix from data, enabling flexible propagation of information across space-time while retaining interpretability.

This notebook formalizes the approach, details the dataset, and benchmarks several training configurations along with out-of-sample (OOS) evaluation protocols.


## 2. Data Description and Panel Construction
We work with the `Data BPS Laporan KP - Coded.xlsx` panel spanning yearly observations. Each record contains latitude (`lat`), longitude (`lon`), a time index (`Tahun`), the target response (`y`), and eight covariates (`X1`--`X8`). Our goal is to model 2019--2022 (train/validation/test split) and reserve 2023 for OOS evaluation. Before touching the data we discuss how balanced panels are formed.

### Theory: Balanced Spatiotemporal Panel
For a set of times $\mathcal{T}$ and locations $\mathcal{L}$ we require that each $(t, \ell) \in \mathcal{T} \times \mathcal{L}$ appears exactly once. Raw administrative data may include missing locations per year; naively truncating to the first $N$ rows per year (as our earlier prototype did) silently misaligns coordinates. Instead we intersect the coordinate keys across all years to guarantee consistent ordering:

1. Sort within each year by latitude and longitude.
2. Compute the intersection of location keys across times.
3. Align rows by the joint key so that feature and target arrays line up across years.

This ensures that downstream kernels and GNN weights operate on coherent spatial indices.


In [1]:
import math
import os
import random
from pathlib import Path

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from IPython.display import display
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

warnings_enabled = False
if not warnings_enabled:
    import warnings
    warnings.filterwarnings("ignore")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device in use: {device}")
BASE_DIR = Path(".").resolve()
DATA_PATH = BASE_DIR / "Data BPS Laporan KP - Coded.xlsx"


Device in use: cpu


### Theory: Safe Invocation & Evaluation Metrics
Reusable experimentation benefits from helper routines. `safe_call` filters keyword arguments based on function signatures, protecting against API drift. We also define tensor-to-NumPy conversion and standard regression metrics (RMSE, MAE, $R^2$) for consistent reporting.


In [2]:
import inspect

def safe_call(fn, /, *args, **kwargs):
    """Call `fn` while discarding unexpected keyword arguments."""
    sig = inspect.signature(fn)
    allowed = {k: v for k, v in kwargs.items() if k in sig.parameters}
    return fn(*args, **allowed)

def to_numpy(x):
    if torch.is_tensor(x):
        return x.detach().cpu().numpy()
    return np.asarray(x)

def regression_metrics(y_true, y_pred):
    if y_true is None or y_pred is None or len(y_true) == 0:
        return math.nan, math.nan, math.nan
    return (
        float(np.sqrt(mean_squared_error(y_true, y_pred))),
        float(mean_absolute_error(y_true, y_pred)),
        float(r2_score(y_true, y_pred)),
    )


### Theory: Data Loading and Panel Balancing
We encapsulate data ingestion in two steps:

- `load_panel_xlsx` reads the Excel file, selects relevant columns, and drops incomplete rows.
- `build_panel_arrays` aligns coordinates across years. Let $\mathbf{X}_t \in \mathbb{R}^{N \times p}$, $\mathbf{y}_t\in\mathbb{R}^N$, and $\mathbf{c}_t \in \mathbb{R}^{N \times 2}$ denote features, targets, and coordinates for year $t$. The function returns stacked arrays $\mathbf{X} \in \mathbb{R}^{TN \times p}$ and $\mathbf{y} \in \mathbb{R}^{TN}$ alongside per-year blocks.

Algorithm highlights:
1. Compute location keys $(lat, lon)$ for every record.
2. Build the intersection of keys across all requested years.
3. Reindex each yearly block using the shared key ordering.
4. Stack the balanced blocks.


In [3]:
def load_panel_xlsx(path_xlsx, lat_col, lon_col, time_col, target_col, feature_cols):
    df = pd.read_excel(path_xlsx)
    cols = [lat_col, lon_col, time_col, target_col] + list(feature_cols)
    df = df[cols].dropna().copy()
    return df

def build_panel_arrays(df, time_col, target_col, feature_cols, lat_col, lon_col, times_sorted=None, atol=1e-6):
    if times_sorted is None:
        times_sorted = sorted(df[time_col].unique())

    df_sorted = df.sort_values([time_col, lat_col, lon_col]).reset_index(drop=True)
    df_sorted["coord_key"] = list(zip(df_sorted[lat_col], df_sorted[lon_col]))

    agg_spec = {target_col: "mean", lat_col: "first", lon_col: "first"}
    agg_spec.update({feat: "mean" for feat in feature_cols})

    blocks_by_time = {}
    keys_per_time = {}
    for t in times_sorted:
        block = df_sorted[df_sorted[time_col] == t].copy()
        block = block.groupby("coord_key", sort=True).agg(agg_spec)
        block.reset_index(drop=False, inplace=True)
        if block["coord_key"].duplicated().any():
            raise ValueError(f"Duplicate coordinate keys remain for year {t}; check data quality.")
        blocks_by_time[t] = block
        keys_per_time[t] = set(block["coord_key"])

    shared_keys = set.intersection(*keys_per_time.values()) if keys_per_time else set()
    if len(shared_keys) == 0:
        raise ValueError("No common coordinates found across all time periods; cannot build balanced panel.")

    key_list = sorted(shared_keys, key=lambda xy: (round(xy[0] / atol) * atol, round(xy[1] / atol) * atol))

    X_blocks, y_blocks, C_blocks = [], [], []
    for t in times_sorted:
        block = blocks_by_time[t].set_index("coord_key").reindex(key_list)
        block.reset_index(drop=False, inplace=True)
        X_blocks.append(block[feature_cols].values.astype(np.float32))
        y_blocks.append(block[target_col].values.astype(np.float32))
        C_blocks.append(block[[lat_col, lon_col]].values.astype(np.float32))

    X_all = np.vstack(X_blocks)
    y_all = np.concatenate(y_blocks)
    coords_all = np.vstack(C_blocks)

    return {
        "X_all": X_all,
        "y_all": y_all,
        "coords_all": coords_all,
        "coords_blocks": C_blocks,
        "times": np.array(times_sorted),
        "N_per_year": X_blocks[0].shape[0],
    }

def year_rows(times_sorted, N_per_year, target_year):
    offsets = []
    cursor = 0
    for t in times_sorted:
        if t == target_year:
            offsets.extend(range(cursor, cursor + N_per_year))
        cursor += N_per_year
    return np.array(offsets, dtype=int)

def split_train_val_test(times_sorted, N_per_year, use_val=True):
    test_year = times_sorted[-1]
    val_year = times_sorted[-2] if use_val else None
    train_years = times_sorted[:-2] if use_val else times_sorted[:-1]
    train_rows = np.concatenate([year_rows(times_sorted, N_per_year, y) for y in train_years])
    val_rows = year_rows(times_sorted, N_per_year, val_year) if use_val else np.array([], dtype=int)
    test_rows = year_rows(times_sorted, N_per_year, test_year)
    return {
        "train_rows": train_rows,
        "val_rows": val_rows,
        "test_rows": test_rows,
        "train_years": train_years,
        "val_year": val_year,
        "test_year": test_year,
    }


## 3. Spatiotemporal Kernel Prior
GTWR uses distance-based kernels to assign weights. We adopt a separable Gaussian kernel across space and time:

$$K_{ij} = \exp\left(-\tfrac{1}{2}\left(\frac{d^{\text{geo}}_{ij}}{h_S}\right)^2\right) \cdot \exp\left(-\tfrac{1}{2}\left(\frac{|t_i - t_j|}{h_T}\right)^2\right),$$

where $h_S$ and $h_T$ are bandwidths. We estimate $h_S$ from the median pairwise great-circle distance (haversine) and $h_T$ from temporal gaps. To encourage sparsity we keep the top-$k$ neighbors per row and renormalize to obtain a row-stochastic prior matrix $A_{prior}$.


In [4]:
def haversine(lat1, lon1, lat2, lon2):
    R = 6371.0
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat, dlon = lat2 - lat1, lon2 - lon1
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
    return 2 * R * np.arcsin(np.sqrt(a))

def _pairwise_block(coords_a, coords_b):
    n_a, n_b = len(coords_a), len(coords_b)
    out = np.zeros((n_a, n_b), dtype=np.float64)
    for i in range(n_a):
        lat1, lon1 = coords_a[i]
        for j in range(n_b):
            lat2, lon2 = coords_b[j]
            out[i, j] = haversine(lat1, lon1, lat2, lon2)
    return out

def _sparsify_knn(W, k_neighbors, self_weight):
    n = W.shape[0]
    if n <= 1:
        return np.eye(n) * self_weight
    k_eff = max(1, min(k_neighbors, n - 1))
    W_sparse = np.zeros_like(W)
    for i in range(n):
        row = W[i].copy()
        row[i] = -np.inf
        idx = np.argpartition(-row, kth=k_eff - 1)[:k_eff]
        W_sparse[i, idx] = W[i, idx]
    np.fill_diagonal(W_sparse, self_weight)
    row_sum = W_sparse.sum(axis=1, keepdims=True)
    return W_sparse / np.where(row_sum > 0, row_sum, 1.0)

def build_spatiotemporal_kernel(coords_blocks, times, tau_s=1.0, tau_t=1.0, k_neighbors=8, prior_self_weight=1.0, verbose=False):
    times = np.array(times, dtype=float)
    bandwidth_samples = []
    for coords in coords_blocks:
        if len(coords) > 1:
            D = _pairwise_block(coords, coords)
            bandwidth_samples.append(D[D > 0])
    if bandwidth_samples:
        hS = np.median(np.concatenate(bandwidth_samples))
    else:
        hS = 1.0
    hS = max(hS / max(tau_s, 1e-6), 1e-6)

    Dt = np.abs(times[:, None] - times[None, :])
    hT = np.median(Dt[Dt > 0]) if (Dt > 0).any() else 1.0
    hT = max(hT / max(tau_t, 1e-6), 1e-6)

    N_per_year = coords_blocks[0].shape[0]
    total = N_per_year * len(coords_blocks)
    W_full = np.zeros((total, total), dtype=np.float64)
    row_offset = 0
    for i, (Ci, ti) in enumerate(zip(coords_blocks, times)):
        col_offset = 0
        for j, (Cj, tj) in enumerate(zip(coords_blocks, times)):
            Dij = _pairwise_block(Ci, Cj)
            Ks = np.exp(-0.5 * (Dij / hS) ** 2)
            Kt = math.exp(-0.5 * ((abs(ti - tj) / hT) ** 2))
            W_full[row_offset:row_offset + N_per_year, col_offset:col_offset + N_per_year] = Ks * Kt
            col_offset += N_per_year
        row_offset += N_per_year

    W_sparse = _sparsify_knn(W_full, k_neighbors, prior_self_weight)
    if verbose:
        sparsity = np.mean(W_sparse == 0)
        print(f"Kernel constructed with sparsity {sparsity:.3f}")
    return W_sparse


## 4. Local Weighted Least Squares (WLS)
Given a weight matrix $W \in \mathbb{R}^{N \times N}$ and covariates $X$, GTWR solves
$$\hat{\beta}_i = \arg\min_\beta \sum_j W_{ij} (y_j - x_j^\top \beta)^2 + \lambda \|\beta\|_2^2,$$
which yields local predictions $\hat{y}_i = x_i^\top \hat{\beta}_i$. We implement ridge and iteratively reweighted Huber variants, exposed via `solve_local_wls`.


In [5]:
def local_wls_ridge(X, y, W, ridge=5.0, return_betas=True):
    N, p = X.shape
    device = X.device
    I = ridge * torch.eye(p, device=device)
    y_hat = torch.zeros(N, device=device)
    betas = torch.zeros(N, p, device=device) if return_betas else None
    for i in range(N):
        w = W[i]
        ws = torch.sqrt(w + 1e-12)
        Xw = X * ws.unsqueeze(1)
        yw = y * ws
        XtWX = Xw.t() @ Xw + I
        XtWy = Xw.t() @ yw
        try:
            beta = torch.linalg.solve(XtWX, XtWy)
        except RuntimeError:
            beta = torch.linalg.lstsq(XtWX, XtWy.unsqueeze(1)).solution.squeeze()
        y_hat[i] = X[i] @ beta
        if return_betas:
            betas[i] = beta
    return (y_hat, betas) if return_betas else y_hat

def local_wls_huber(X, y, W, ridge=5.0, delta=1.0, iters=3, return_betas=True):
    N, p = X.shape
    device = X.device
    I = ridge * torch.eye(p, device=device)
    y_hat = torch.zeros(N, device=device)
    betas = torch.zeros(N, p, device=device) if return_betas else None
    for i in range(N):
        w = W[i].clone()
        beta = None
        for _ in range(iters):
            ws = torch.sqrt(w + 1e-12)
            Xw = X * ws.unsqueeze(1)
            yw = y * ws
            XtWX = Xw.t() @ Xw + I
            XtWy = Xw.t() @ yw
            try:
                beta = torch.linalg.solve(XtWX, XtWy)
            except RuntimeError:
                beta = torch.linalg.lstsq(XtWX, XtWy.unsqueeze(1)).solution.squeeze()
            residual = y - X @ beta
            abs_res = torch.abs(residual) + 1e-12
            w = W[i] * torch.where(abs_res <= delta, torch.ones_like(abs_res), (delta / abs_res))
        y_hat[i] = X[i] @ beta
        if return_betas:
            betas[i] = beta
    return (y_hat, betas) if return_betas else y_hat

def solve_local_wls(X, y, W, kind="ridge", ridge=5.0, huber_delta=1.0, huber_iters=3, return_betas=True):
    if kind == "ridge":
        return local_wls_ridge(X, y, W, ridge=ridge, return_betas=return_betas)
    if kind == "huber":
        return local_wls_huber(X, y, W, ridge=ridge, delta=huber_delta, iters=huber_iters, return_betas=return_betas)
    raise ValueError(f"Unknown WLS kind: {kind}")


## 5. GNN Weight Generator
We parameterize the adaptive weight matrix via `MathematicallyCorrectGNNWeightNet`, a lightweight encoder producing embeddings $H = f_\theta(X)$. Cosine similarity combined with a learnable temperature $\tau$ yields logits, which are blended with the log prior using a mixing coefficient $\alpha$ (constrained to $(0,1)$ via a sigmoid). A row-wise softmax enforces stochasticity.

This design is inspired by attention mechanisms yet keeps the prior in log-space to avoid degenerate weights.


In [6]:
class MathematicallyCorrectGNNWeightNet(nn.Module):
    def __init__(self, d_in, spa_hid=32, emb=16, tau=1.2, alpha_init=0.30):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(d_in, spa_hid),
            nn.ReLU(),
            nn.Linear(spa_hid, emb),
        )
        self.log_tau = nn.Parameter(torch.log(torch.tensor(float(tau))))
        self.raw_alpha = nn.Parameter(torch.tensor(math.log(alpha_init / (1.0 - alpha_init))))
        nn.init.kaiming_uniform_(self.encoder[0].weight, a=math.sqrt(5))
        nn.init.zeros_(self.encoder[0].bias)
        nn.init.xavier_uniform_(self.encoder[2].weight)
        nn.init.zeros_(self.encoder[2].bias)

    @property
    def tau(self):
        return torch.exp(self.log_tau).clamp(min=0.1, max=10.0)

    @property
    def alpha(self):
        return torch.sigmoid(self.raw_alpha)

    def forward(self, X, A_prior):
        H = self.encoder(X)
        H_norm = F.normalize(H, p=2, dim=1)
        S = H_norm @ H_norm.t()
        logits = S / self.tau
        log_prior = torch.log(A_prior + 1e-12)
        log_blend = self.alpha * log_prior + (1.0 - self.alpha) * logits
        W = F.softmax(log_blend, dim=1)
        return W, H

def _row_normalize(W, eps=1e-12):
    row_sum = W.sum(dim=1, keepdim=True)
    return W / (row_sum + eps)

def topk_rows(W, k):
    if k is None:
        return W
    n = W.shape[0]
    k_eff = max(1, min(k, n))
    values, indices = torch.topk(W, k_eff, dim=1)
    mask = torch.zeros_like(W)
    mask.scatter_(1, indices, 1.0)
    W_pruned = W * mask
    return _row_normalize(W_pruned)

def symmetrize_rows(W):
    W_sym = 0.5 * (W + W.t())
    W_sym = torch.clamp(W_sym, min=0.0)
    return _row_normalize(W_sym)


## 6. Training Objective
We minimize a composite loss over training rows:

1. **Supervised MSE** between predictions and labels on train indices.
2. **Entropy regularizer** encouraging diffuse weights (controlled by `ent_w`).
3. **Spatial smoothness** on per-year regression coefficients: for year $t$,
   $$\mathcal{L}_{smooth} = \sum_{i,j} A^{(t)}_{ij} \|\beta^{(t)}_i - \beta^{(t)}_j\|^2,$$
   scaled by `smooth_w`.

Early stopping monitors validation RMSE (or training RMSE when validation is absent). We expose `graph_topk` and `graph_symmetrize` to post-process the learned adjacency.


In [7]:
def train_model(
    model,
    X_all,
    y_all,
    A_prior,
    train_rows,
    val_rows=None,
    test_rows=None,
    epochs=200,
    lr=1e-3,
    ridge_lambda=5.0,
    ent_w=5e-3,
    smooth_w=1e-3,
    N_per_year=None,
    times=None,
    print_every=25,
    early_stop=True,
    es_patience=80,
    wls_kind="ridge",
    huber_delta=1.0,
    huber_iters=3,
    graph_topk=None,
    graph_symmetrize=False,
    device=None,
):
    device = device or next(model.parameters()).device
    X_t = torch.tensor(X_all, dtype=torch.float32, device=device)
    y_t = torch.tensor(y_all, dtype=torch.float32, device=device)
    A_t = torch.tensor(A_prior, dtype=torch.float32, device=device)

    opt = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    best_val, best_state, patience = float("inf"), None, 0
    history = []
    T = len(times) if times is not None else None

    for ep in range(1, epochs + 1):
        model.train()
        opt.zero_grad()
        W, _ = model(X_t, A_t)
        if graph_topk is not None:
            W = topk_rows(W, graph_topk)
            if graph_symmetrize:
                W = symmetrize_rows(W)

        y_hat, betas = solve_local_wls(
            X_t,
            y_t,
            W,
            kind=wls_kind,
            ridge=ridge_lambda,
            huber_delta=huber_delta,
            huber_iters=huber_iters,
            return_betas=True,
        )

        sup_loss = F.mse_loss(y_hat[train_rows], y_t[train_rows])
        Wn = _row_normalize(W)
        ent = -torch.sum(Wn * torch.log(Wn + 1e-12), dim=1).mean()
        ent_loss = -ent_w * ent

        if T is not None and N_per_year is not None:
            smooth = 0.0
            beta_mat = betas.reshape(T, N_per_year, -1)
            for t_idx in range(T):
                slice_start = t_idx * N_per_year
                slice_end = (t_idx + 1) * N_per_year
                W_spatial = A_t[slice_start:slice_end, slice_start:slice_end]
                bt = beta_mat[t_idx]
                diff = bt.unsqueeze(1) - bt.unsqueeze(0)
                smooth = smooth + torch.sum(W_spatial.unsqueeze(-1) * diff.pow(2))
            spatial_loss = smooth_w * smooth
        else:
            spatial_loss = 0.0

        total_loss = sup_loss + ent_loss + spatial_loss
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        model.eval()
        with torch.no_grad():
            W_eval, _ = model(X_t, A_t)
            if graph_topk is not None:
                W_eval = topk_rows(W_eval, graph_topk)
                if graph_symmetrize:
                    W_eval = symmetrize_rows(W_eval)
            y_eval = solve_local_wls(
                X_t,
                y_t,
                W_eval,
                kind=wls_kind,
                ridge=ridge_lambda,
                return_betas=False,
            )
            y_eval_np = y_eval.detach().cpu().numpy()
            rmse_tr = float(np.sqrt(mean_squared_error(y_all[train_rows], y_eval_np[train_rows])))
            if val_rows is not None and len(val_rows) > 0:
                rmse_va = float(np.sqrt(mean_squared_error(y_all[val_rows], y_eval_np[val_rows])))
            else:
                rmse_va = float("inf")
            if test_rows is not None and len(test_rows) > 0:
                rmse_te = float(np.sqrt(mean_squared_error(y_all[test_rows], y_eval_np[test_rows])))
            else:
                rmse_te = math.nan

        history.append(
            {
                "epoch": ep,
                "loss": float(total_loss.item()),
                "rmse_tr": rmse_tr,
                "rmse_va": rmse_va,
                "rmse_te": rmse_te,
                "alpha": float(model.alpha.item()),
                "tau": float(model.tau.item()),
            }
        )
        if (ep % print_every == 0) or (ep == 1):
            print(
                f"Epoch {ep:03d} | Loss {total_loss.item():.4f} | RMSE train {rmse_tr:.3f} | val {rmse_va:.3f} | alpha {model.alpha.item():.3f} | tau {model.tau.item():.3f}"
            )

        score = rmse_va if val_rows is not None and len(val_rows) > 0 else rmse_tr
        if score < best_val - 1e-6:
            best_val = score
            best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
            patience = 0
        else:
            patience += 1
            if early_stop and patience >= es_patience:
                print(f"Early stopping at epoch {ep}")
                break

    if best_state is not None:
        model.load_state_dict({k: v.to(device) for k, v in best_state.items()})

    model.eval()
    with torch.no_grad():
        W_final, _ = model(X_t, A_t)
        if graph_topk is not None:
            W_final = topk_rows(W_final, graph_topk)
            if graph_symmetrize:
                W_final = symmetrize_rows(W_final)
        y_final, betas_final = solve_local_wls(
            X_t,
            y_t,
            W_final,
            kind=wls_kind,
            ridge=ridge_lambda,
            return_betas=True,
        )

    return {
        "model": model,
        "W": W_final,
        "y_hat": y_final,
        "betas": betas_final,
        "history": history,
        "best_state": best_state,
    }


### Theory: Transductive Fine-Tuning with Future Nodes
To adapt the trained model when new-year nodes become available (without leaking their labels into the loss), we fine-tune on the expanded panel. The graph prior is rebuilt on 2019--2023, but the loss masks future rows while still allowing them to receive messages during forward passes.


In [8]:
def finetune_transductive_with_future(
    model,
    X_all_full,
    y_all_full,
    coords_blocks_full,
    times_full,
    train_rows,
    val_rows,
    future_rows,
    lr=1e-4,
    epochs=150,
    ridge_lambda=5.0,
    ent_w=5e-3,
    smooth_w=1e-3,
    knn_k=8,
    tau_s=1.0,
    tau_t=1.0,
    prior_self_weight=1.0,
    N_per_year=None,
    print_every=25,
    patience=40,
    wls_kind="ridge",
    huber_delta=1.0,
    huber_iters=3,
    graph_topk=None,
    graph_symmetrize=False,
    device=None,
):
    device = device or next(model.parameters()).device
    A_prior_np = build_spatiotemporal_kernel(
        coords_blocks_full,
        times_full,
        tau_s=tau_s,
        tau_t=tau_t,
        k_neighbors=knn_k,
        prior_self_weight=prior_self_weight,
        verbose=False,
    )
    A = torch.tensor(A_prior_np, dtype=torch.float32, device=device)
    X = torch.tensor(X_all_full, dtype=torch.float32, device=device)
    y = torch.tensor(y_all_full, dtype=torch.float32, device=device)

    opt = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    best_val, best_state, patience_ctr = float("inf"), None, 0
    T = len(times_full)

    mask_train = torch.zeros(X.shape[0], dtype=torch.bool, device=device)
    mask_train[train_rows] = True
    mask_val = torch.zeros_like(mask_train)
    mask_val[val_rows] = True
    mask_future = torch.zeros_like(mask_train)
    mask_future[future_rows] = True

    for ep in range(1, epochs + 1):
        model.train()
        opt.zero_grad()
        W, _ = model(X, A)
        if graph_topk is not None:
            W = topk_rows(W, graph_topk)
            if graph_symmetrize:
                W = symmetrize_rows(W)

        y_hat, betas = solve_local_wls(
            X,
            y,
            W,
            kind=wls_kind,
            ridge=ridge_lambda,
            huber_delta=huber_delta,
            huber_iters=huber_iters,
            return_betas=True,
        )
        sup_loss = F.mse_loss(y_hat[mask_train], y[mask_train])
        Wn = _row_normalize(W)
        ent = -torch.sum(Wn * torch.log(Wn + 1e-12), dim=1).mean()
        ent_loss = -ent_w * ent

        smooth = 0.0
        if N_per_year is not None and T is not None:
            beta_mat = betas.reshape(T, N_per_year, -1)
            for t_idx in range(T):
                start = t_idx * N_per_year
                end = (t_idx + 1) * N_per_year
                W_block = A[start:end, start:end]
                bt = beta_mat[t_idx]
                diff = bt.unsqueeze(1) - bt.unsqueeze(0)
                smooth = smooth + torch.sum(W_block.unsqueeze(-1) * diff.pow(2))
        spatial_loss = smooth_w * smooth

        total_loss = sup_loss + ent_loss + spatial_loss
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        model.eval()
        with torch.no_grad():
            W_eval, _ = model(X, A)
            if graph_topk is not None:
                W_eval = topk_rows(W_eval, graph_topk)
                if graph_symmetrize:
                    W_eval = symmetrize_rows(W_eval)
            y_eval = solve_local_wls(
                X,
                y,
                W_eval,
                kind=wls_kind,
                ridge=ridge_lambda,
                return_betas=False,
            )
            y_eval_np = y_eval.detach().cpu().numpy()
            rmse_tr = float(np.sqrt(mean_squared_error(y_all_full[train_rows], y_eval_np[train_rows])))
            rmse_va = float(np.sqrt(mean_squared_error(y_all_full[val_rows], y_eval_np[val_rows])))
            rmse_fu = float(np.sqrt(mean_squared_error(y_all_full[future_rows], y_eval_np[future_rows])))

        if (ep % print_every == 0) or (ep == 1):
            print(
                f"Fine-tune Ep {ep:03d} | Loss {total_loss.item():.4f} | RMSE train {rmse_tr:.3f} | val {rmse_va:.3f} | future {rmse_fu:.3f}"
            )

        if rmse_va < best_val - 1e-6:
            best_val = rmse_va
            best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
            patience_ctr = 0
        else:
            patience_ctr += 1
            if patience_ctr >= patience:
                print(f"Fine-tune early stop at epoch {ep}")
                break

    if best_state is not None:
        model.load_state_dict({k: v.to(device) for k, v in best_state.items()})

    model.eval()
    with torch.no_grad():
        W_fin, _ = model(X, A)
        if graph_topk is not None:
            W_fin = topk_rows(W_fin, graph_topk)
            if graph_symmetrize:
                W_fin = symmetrize_rows(W_fin)
        y_fin, betas_fin = solve_local_wls(
            X,
            y,
            W_fin,
            kind=wls_kind,
            ridge=ridge_lambda,
            return_betas=True,
        )

    return {
        "model": model,
        "W": W_fin,
        "y_hat": y_fin,
        "betas": betas_fin,
        "best_state": best_state,
        "A_prior": A,
        "X": X,
        "y": y,
    }


## 7. Inference Strategies
We evaluate three deployment settings:

1. **OOS-Transductive**: freeze old rows, allow new nodes to attend to old ones only.
2. **OOS-Fullgraph**: augment the graph with new nodes and run a single forward pass.
3. **Prior-only**: use the kernel prior without the GNN (sanity baseline).

All strategies ultimately solve a local WLS system mixing old targets with new covariates.


In [9]:
def _estimate_bandwidths(coords_blocks_old, times_old, tau_s=1.0, tau_t=1.0):
    samples = []
    for C in coords_blocks_old:
        n = len(C)
        for i in range(n):
            for j in range(i + 1, n):
                samples.append(haversine(*C[i], *C[j]))
    hS = np.median(samples) if samples else 100.0
    hS = max(hS / max(tau_s, 1e-6), 1e-6)
    t_unique = np.sort(np.unique(times_old))
    Dt = np.abs(t_unique[:, None] - t_unique[None, :])
    hT = np.median(Dt[Dt > 0]) if (Dt > 0).any() else 1.0
    hT = max(hT / max(tau_t, 1e-6), 1e-6)
    return hS, hT

def predict_new_fullgraph(
    model,
    X_train,
    y_train,
    coords_train,
    times_train,
    new_df,
    feature_cols,
    time_col,
    lat_col,
    lon_col,
    tau_s=1.0,
    tau_t=1.0,
    knn_k=8,
    prior_self_weight=1.0,
    wls_kind="ridge",
    ridge_lambda=5.0,
    huber_delta=1.0,
    huber_iters=3,
    graph_topk=None,
    graph_symmetrize=False,
    device=None,
):
    device = device or next(model.parameters()).device
    X_new = new_df[feature_cols].values.astype(np.float32)
    coords_new = new_df[[lat_col, lon_col]].values.astype(np.float32)
    times_new = new_df[time_col].values.astype(float)
    X_comb = np.vstack([X_train, X_new]).astype(np.float32)
    coords_comb = np.vstack([coords_train, coords_new]).astype(np.float32)
    times_comb = np.concatenate([times_train, times_new]).astype(float)

    unique_times = np.sort(np.unique(times_comb))
    coords_blocks = [coords_comb[times_comb == t] for t in unique_times]
    A_prior_ext = build_spatiotemporal_kernel(
        coords_blocks,
        unique_times,
        tau_s=tau_s,
        tau_t=tau_t,
        k_neighbors=knn_k,
        prior_self_weight=prior_self_weight,
        verbose=False,
    )
    X_comb_t = torch.tensor(X_comb, dtype=torch.float32, device=device)
    A_prior_ext_t = torch.tensor(A_prior_ext, dtype=torch.float32, device=device)
    n_old = len(X_train)

    with torch.no_grad():
        W_learned, _ = model(X_comb_t, A_prior_ext_t)
        if graph_topk is not None:
            W_learned = topk_rows(W_learned, graph_topk)
            if graph_symmetrize:
                W_learned = symmetrize_rows(W_learned)
        y_stub = torch.tensor(
            np.concatenate([y_train, np.zeros(len(X_new), dtype=np.float32)]),
            dtype=torch.float32,
            device=device,
        )
        y_hat = solve_local_wls(
            X_comb_t,
            y_stub,
            W_learned,
            kind=wls_kind,
            ridge=ridge_lambda,
            huber_delta=huber_delta,
            huber_iters=huber_iters,
            return_betas=False,
        )
    return y_hat[n_old:].cpu().numpy()


In [10]:
def predict_new_oos_transductive(
    model,
    X_train,
    y_train,
    coords_train,
    times_train,
    new_df,
    feature_cols,
    time_col,
    lat_col,
    lon_col,
    tau_s=1.0,
    tau_t=1.0,
    knn_k=8,
    prior_self_weight=1.0,
    lambda_blend=0.8,
    wls_kind="ridge",
    ridge_lambda=5.0,
    huber_delta=1.0,
    huber_iters=3,
    graph_topk=None,
    graph_symmetrize=False,
    device=None,
    cross_topk=None,
    new_self_weight=0.0,
):
    device = device or next(model.parameters()).device
    X_new = new_df[feature_cols].values.astype(np.float32)
    coords_new = new_df[[lat_col, lon_col]].values.astype(np.float32)
    times_new = new_df[time_col].values.astype(float)
    n_old, n_new = len(X_train), len(X_new)

    unique_times_old = np.sort(np.unique(times_train))
    coords_blocks_old = [coords_train[times_train == t] for t in unique_times_old]
    A_prior_old_np = build_spatiotemporal_kernel(
        coords_blocks_old,
        unique_times_old,
        tau_s=tau_s,
        tau_t=tau_t,
        knn_k=knn_k,
        prior_self_weight=prior_self_weight,
        verbose=False,
    )
    A_prior_old = torch.tensor(A_prior_old_np, dtype=torch.float32, device=device)
    X_old_t = torch.tensor(X_train, dtype=torch.float32, device=device)

    with torch.no_grad():
        W_old, H_old = model(X_old_t, A_prior_old)
        if graph_topk is not None:
            W_old = topk_rows(W_old, graph_topk)
            if graph_symmetrize:
                W_old = symmetrize_rows(W_old)

    hS, hT = _estimate_bandwidths(coords_blocks_old, times_train, tau_s, tau_t)
    A_cross = np.zeros((n_new, n_old), dtype=np.float32)
    for i in range(n_new):
        lat_i, lon_i, t_i = coords_new[i, 0], coords_new[i, 1], times_new[i]
        for j in range(n_old):
            lat_j, lon_j, t_j = coords_train[j, 0], coords_train[j, 1], times_train[j]
            d_spa = haversine(lat_i, lon_i, lat_j, lon_j)
            d_tmp = abs(t_i - t_j)
            A_cross[i, j] = math.exp(-0.5 * (d_spa / hS) ** 2) * math.exp(-0.5 * (d_tmp / hT) ** 2)

    if cross_topk is not None and cross_topk > 0:
        k_eff = min(cross_topk, max(1, A_cross.shape[1] - 1))
        idx = np.argpartition(-A_cross, kth=k_eff - 1, axis=1)[:, :k_eff]
        mask = np.zeros_like(A_cross)
        rows = np.arange(n_new)[:, None]
        mask[rows, idx] = 1.0
        A_cross = A_cross * mask

    A_cross = A_cross / (A_cross.sum(axis=1, keepdims=True) + 1e-12)

    with torch.no_grad():
        H_new = model.encoder(torch.tensor(X_new, dtype=torch.float32, device=device))
        H_old_n = F.normalize(H_old, p=2, dim=1)
        H_new_n = F.normalize(H_new, p=2, dim=1)
        S = H_new_n @ H_old_n.t()
        logits = S / model.tau
        log_prior_cross = torch.log(torch.tensor(A_cross, dtype=torch.float32, device=device) + 1e-12)
        alpha = model.alpha
        log_blend = alpha * log_prior_cross + (1 - alpha) * logits
        W_new2old_gnn = F.softmax(log_blend, dim=1)
        if lambda_blend is not None and 0.0 <= lambda_blend <= 1.0:
            W_new2old = lambda_blend * W_new2old_gnn + (1 - lambda_blend) * torch.tensor(A_cross, dtype=torch.float32, device=device)
        else:
            W_new2old = W_new2old_gnn

    if new_self_weight and new_self_weight > 0:
        W_new2new = torch.eye(n_new, device=device) * float(new_self_weight)
    else:
        W_new2new = torch.zeros((n_new, n_new), device=device)

    W_full = torch.zeros((n_old + n_new, n_old + n_new), device=device)
    W_full[:n_old, :n_old] = W_old
    W_new_row = torch.cat([W_new2old, W_new2new], dim=1)
    W_new_row = W_new_row / (W_new_row.sum(dim=1, keepdim=True) + 1e-12)
    W_full[n_old:, :] = W_new_row

    X_comb = np.vstack([X_train, X_new]).astype(np.float32)
    X_comb_t = torch.tensor(X_comb, dtype=torch.float32, device=device)
    y_stub = torch.tensor(
        np.concatenate([y_train, np.zeros(n_new, dtype=np.float32)]),
        dtype=torch.float32,
        device=device,
    )

    with torch.no_grad():
        y_hat = solve_local_wls(
            X_comb_t,
            y_stub,
            W_full,
            kind=wls_kind,
            ridge=ridge_lambda,
            huber_delta=huber_delta,
            huber_iters=huber_iters,
            return_betas=False,
        )
    return y_hat[n_old:].cpu().numpy()


In [11]:
def predict_new_prior_only(
    X_train,
    y_train,
    coords_train,
    times_train,
    new_df,
    feature_cols,
    time_col,
    lat_col,
    lon_col,
    tau_s=1.0,
    tau_t=1.0,
    knn_k=8,
    wls_kind="ridge",
    ridge_lambda=5.0,
    huber_delta=1.0,
    huber_iters=3,
    cross_topk=None,
    new_self_weight=0.0,
    device=None,
):
    device = device or torch.device("cpu")
    X_new = new_df[feature_cols].values.astype(np.float32)
    coords_new = new_df[[lat_col, lon_col]].values.astype(np.float32)
    times_new = new_df[time_col].values.astype(float)
    n_old, n_new = len(X_train), len(X_new)

    unique_times_old = np.sort(np.unique(times_train))
    coords_blocks_old = [coords_train[times_train == t] for t in unique_times_old]
    hS, hT = _estimate_bandwidths(coords_blocks_old, times_train, tau_s, tau_t)

    A_cross = np.zeros((n_new, n_old), dtype=np.float32)
    for i in range(n_new):
        lat_i, lon_i, t_i = coords_new[i, 0], coords_new[i, 1], times_new[i]
        for j in range(n_old):
            lat_j, lon_j, t_j = coords_train[j, 0], coords_train[j, 1], times_train[j]
            d_spa = haversine(lat_i, lon_i, lat_j, lon_j)
            d_tmp = abs(t_i - t_j)
            A_cross[i, j] = math.exp(-0.5 * (d_spa / hS) ** 2) * math.exp(-0.5 * (d_tmp / hT) ** 2)

    if cross_topk is not None and cross_topk > 0:
        k_eff = min(cross_topk, max(1, A_cross.shape[1] - 1))
        idx = np.argpartition(-A_cross, kth=k_eff - 1, axis=1)[:, :k_eff]
        mask = np.zeros_like(A_cross)
        rows = np.arange(n_new)[:, None]
        mask[rows, idx] = 1.0
        A_cross = A_cross * mask

    A_cross = A_cross / (A_cross.sum(axis=1, keepdims=True) + 1e-12)

    if new_self_weight and new_self_weight > 0:
        W_new2new = torch.eye(n_new) * float(new_self_weight)
    else:
        W_new2new = torch.zeros((n_new, n_new))

    W_full = torch.zeros((n_old + n_new, n_old + n_new), dtype=torch.float32)
    W_full[:n_old, :n_old] = torch.eye(n_old)
    W_full[n_old:, :n_old] = torch.tensor(A_cross, dtype=torch.float32)
    W_full[n_old:, n_old:] = W_new2new
    W_full[n_old:, :] = W_full[n_old:, :] / (W_full[n_old:, :].sum(dim=1, keepdim=True) + 1e-12)

    X_comb = np.vstack([X_train, X_new]).astype(np.float32)
    X_comb_t = torch.tensor(X_comb, dtype=torch.float32, device=device)
    y_stub = torch.tensor(np.concatenate([y_train, np.zeros(n_new, dtype=np.float32)]), dtype=torch.float32, device=device)

    with torch.no_grad():
        y_hat = solve_local_wls(
            X_comb_t,
            y_stub,
            torch.tensor(W_full, dtype=torch.float32, device=device),
            kind=wls_kind,
            ridge=ridge_lambda,
            huber_delta=huber_delta,
            huber_iters=huber_iters,
            return_betas=False,
        )
    return y_hat[len(X_train):].cpu().numpy()


## 8. Experimental Protocol
We now instantiate the pipeline:

1. Load the full panel and split 2019--2022 into train/validation/test years.
2. Build the GTWR prior and tensor representations.
3. Establish a GTWR-only baseline.
4. Train multiple GNN configurations.
5. Evaluate OOS strategies on 2023 (transductive, full-graph, fine-tuned).

We reuse the configuration logic from the earlier prototype and wrap it here to keep the notebook self-contained.


In [12]:
LAT_COL, LON_COL = "lat", "lon"
TIME_COL, TARGET_COL = "Tahun", "y"
FEATURE_COLS = ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"]

df_full = load_panel_xlsx(DATA_PATH, LAT_COL, LON_COL, TIME_COL, TARGET_COL, FEATURE_COLS)
print(f"Loaded {len(df_full)} records with columns: {df_full.columns.tolist()}")

mask_pre2023 = df_full[TIME_COL] < 2023
mask_2023 = df_full[TIME_COL] == 2023

df_2019_2022 = df_full[mask_pre2023].copy()
df_2023 = df_full[mask_2023].copy().sort_values([LAT_COL, LON_COL]).reset_index(drop=True)

times_2019_2022 = sorted(df_2019_2022[TIME_COL].unique())
P = build_panel_arrays(
    df_2019_2022,
    TIME_COL,
    TARGET_COL,
    FEATURE_COLS,
    LAT_COL,
    LON_COL,
    times_2019_2022,
)
X_all, y_all = P["X_all"], P["y_all"]
coords_blocks, times, N_per_year = P["coords_blocks"], P["times"], P["N_per_year"]
coords_all = np.vstack(coords_blocks)

print(f"Years covered: {times} | locations per year: {N_per_year} | X shape: {X_all.shape}")

split = split_train_val_test(times, N_per_year, use_val=True)
train_rows, val_rows, test_rows = split["train_rows"], split["val_rows"], split["test_rows"]
print(f"Split sizes -> train: {len(train_rows)}, val: {len(val_rows)}, test: {len(test_rows)}")

W_prior = build_spatiotemporal_kernel(
    coords_blocks,
    times,
    tau_s=1.0,
    tau_t=1.0,
    k_neighbors=8,
    prior_self_weight=1.0,
    verbose=True,
)

X_t = torch.tensor(X_all, dtype=torch.float32, device=device)
y_t = torch.tensor(y_all, dtype=torch.float32, device=device)
A_t = torch.tensor(W_prior, dtype=torch.float32, device=device)


Loaded 595 records with columns: ['lat', 'lon', 'Tahun', 'y', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8']
Years covered: [2019 2020 2021 2022] | locations per year: 104 | X shape: (416, 8)
Split sizes -> train: 208, val: 104, test: 104
Kernel constructed with sparsity 0.978


### GTWR Baseline (Theory)
Setting the learned weights aside, the GTWR baseline solves the local ridge WLS system using only the prior matrix $A_{prior}$. This quantifies the contribution of distance-based smoothing alone.


In [13]:
y_gtwr = local_wls_ridge(X_t, y_t, A_t, ridge=10.0, return_betas=False).detach().cpu().numpy()

def summarize_split(pred_np, rows):
    if rows is None or len(rows) == 0:
        return (math.nan, math.nan, math.nan)
    return regression_metrics(y_all[rows], pred_np[rows])

rmse_tr, mae_tr, r2_tr = summarize_split(y_gtwr, train_rows)
rmse_va, mae_va, r2_va = summarize_split(y_gtwr, val_rows)
rmse_te, mae_te, r2_te = summarize_split(y_gtwr, test_rows)

baseline_summary = pd.DataFrame([
    {
        "name": "GTWR (prior only)",
        "rmse_tr": rmse_tr,
        "mae_tr": mae_tr,
        "r2_tr": r2_tr,
        "rmse_va": rmse_va,
        "mae_va": mae_va,
        "r2_va": r2_va,
        "rmse_te": rmse_te,
        "mae_te": mae_te,
        "r2_te": r2_te,
    }
])
print("In-sample baseline metrics (2019-2022):")
display(baseline_summary)


In-sample baseline metrics (2019-2022):


Unnamed: 0,name,rmse_tr,mae_tr,r2_tr,rmse_va,mae_va,r2_va,rmse_te,mae_te,r2_te
0,GTWR (prior only),0.823835,0.630092,0.90352,0.940072,0.743148,0.867758,1.041443,0.821363,0.778349


### GNN Configuration Logic
We define a helper to train multiple settings while insulating against return-type variations. Each configuration mirrors the experiments from the prototype notebook, showcasing different WLS solvers, graph truncation, and regularization.


In [14]:
def unwrap_train_output(res, model_seed, ridge_lambda=10.0):
    model_tr = model_seed
    y_hat_t = None
    history = []
    if isinstance(res, dict):
        model_tr = res.get("model", model_seed)
        y_hat_t = res.get("y_hat", None)
        history = res.get("history", [])
    elif isinstance(res, (tuple, list)):
        if len(res) >= 1 and hasattr(res[0], "state_dict"):
            model_tr = res[0]
        if len(res) >= 2:
            history = res[1]
        if len(res) >= 3:
            pack = res[2]
            if torch.is_tensor(pack) and pack.shape[0] == X_t.shape[0]:
                y_hat_t = pack
            elif isinstance(pack, (tuple, list)):
                for item in pack:
                    if torch.is_tensor(item) and item.shape[0] == X_t.shape[0]:
                        y_hat_t = item
                        break
    if y_hat_t is None:
        with torch.no_grad():
            W_learned, _ = model_tr(X_t, A_t)
            y_hat_t = local_wls_ridge(X_t, y_t, W_learned, ridge=ridge_lambda, return_betas=False)
    return model_tr, y_hat_t, history

def run_one_config(
    cfg_name,
    wls_kind="ridge",
    ridge_lambda=10.0,
    huber_delta=1.0,
    huber_iters=5,
    ent_w=5e-3,
    smooth_w=1e-3,
    graph_topk=None,
    graph_symmetrize=False,
    epochs=200,
    lr=1e-3,
    early_stop_patience=80,
    print_every=25,
    **extra_kwargs,
):
    model = MathematicallyCorrectGNNWeightNet(d_in=X_all.shape[1]).to(device)
    train_kwargs = dict(
        model=model,
        X_all=X_all,
        y_all=y_all,
        A_prior=W_prior,
        train_rows=train_rows,
        val_rows=val_rows,
        test_rows=test_rows,
        epochs=epochs,
        lr=lr,
        ridge_lambda=ridge_lambda,
        wls_kind=wls_kind,
        huber_delta=huber_delta,
        huber_iters=huber_iters,
        ent_w=ent_w,
        smooth_w=smooth_w,
        N_per_year=N_per_year,
        times=times,
        graph_topk=graph_topk,
        graph_symmetrize=graph_symmetrize,
        early_stop=True,
        es_patience=early_stop_patience,
        device=device,
        print_every=print_every,
    )
    train_kwargs.update(extra_kwargs)
    res = safe_call(train_model, **train_kwargs)
    model_tr, y_hat_t, history = unwrap_train_output(res, model, ridge_lambda=ridge_lambda)
    y_pred = to_numpy(y_hat_t)

    def collect(rows):
        return regression_metrics(y_all[rows], y_pred[rows])

    summary = {
        "name": cfg_name,
        "rmse_tr": collect(train_rows)[0],
        "mae_tr": collect(train_rows)[1],
        "r2_tr": collect(train_rows)[2],
        "rmse_va": collect(val_rows)[0],
        "mae_va": collect(val_rows)[1],
        "r2_va": collect(val_rows)[2],
        "rmse_te": collect(test_rows)[0],
        "mae_te": collect(test_rows)[1],
        "r2_te": collect(test_rows)[2],
        "history": history,
    }
    return summary, model_tr

experiments = []

summ_B, model_B = run_one_config(
    "GNN + Ridge + Entropy (TopK=12)",
    wls_kind="ridge",
    ridge_lambda=10.0,
    ent_w=5e-3,
    smooth_w=1e-3,
    graph_topk=12,
    graph_symmetrize=False,
)
experiments.append((summ_B, model_B))

summ_C, model_C = run_one_config(
    "GNN + Huber + TopK=12 + Sym",
    wls_kind="huber",
    ridge_lambda=10.0,
    huber_delta=1.0,
    huber_iters=5,
    ent_w=5e-3,
    smooth_w=1e-3,
    graph_topk=12,
    graph_symmetrize=True,
)
experiments.append((summ_C, model_C))

summ_D, model_D = run_one_config(
    "GNN + Ridge + TopK=8 (KL placeholder)",
    wls_kind="ridge",
    ridge_lambda=10.0,
    ent_w=0.0,
    smooth_w=1e-3,
    graph_topk=8,
    graph_symmetrize=False,
)
experiments.append((summ_D, model_D))

summ_E, model_E = run_one_config(
    "GNN + Ridge + TopK=6",
    wls_kind="ridge",
    ridge_lambda=7.0,
    ent_w=5e-3,
    smooth_w=1e-3,
    graph_topk=6,
    graph_symmetrize=False,
)
experiments.append((summ_E, model_E))

in_sample_df = pd.concat([baseline_summary] + [pd.DataFrame([e[0]]) for e in experiments], ignore_index=True)
print("In-sample comparison across configurations:")
display(in_sample_df.drop(columns=["history"], errors="ignore"))


Epoch 001 | Loss 0.6859 | RMSE train 0.832 | val 0.949 | alpha 0.300 | tau 1.199
Epoch 025 | Loss 0.6856 | RMSE train 0.832 | val 0.949 | alpha 0.305 | tau 1.171
Epoch 050 | Loss 0.6853 | RMSE train 0.832 | val 0.949 | alpha 0.311 | tau 1.144
Epoch 075 | Loss 0.6853 | RMSE train 0.832 | val 0.949 | alpha 0.316 | tau 1.119
Epoch 100 | Loss 0.6850 | RMSE train 0.832 | val 0.949 | alpha 0.321 | tau 1.098
Epoch 125 | Loss 0.6850 | RMSE train 0.832 | val 0.949 | alpha 0.327 | tau 1.079
Epoch 150 | Loss 0.6847 | RMSE train 0.832 | val 0.949 | alpha 0.332 | tau 1.063
Epoch 175 | Loss 0.6845 | RMSE train 0.832 | val 0.948 | alpha 0.338 | tau 1.049
Epoch 200 | Loss 0.6843 | RMSE train 0.831 | val 0.948 | alpha 0.343 | tau 1.038
Epoch 001 | Loss 0.7672 | RMSE train 0.870 | val 0.978 | alpha 0.300 | tau 1.199
Epoch 025 | Loss 0.7672 | RMSE train 0.869 | val 0.978 | alpha 0.305 | tau 1.171
Epoch 050 | Loss 0.7670 | RMSE train 0.869 | val 0.978 | alpha 0.311 | tau 1.144
Epoch 075 | Loss 0.7668 | RM

Unnamed: 0,name,rmse_tr,mae_tr,r2_tr,rmse_va,mae_va,r2_va,rmse_te,mae_te,r2_te
0,GTWR (prior only),0.823835,0.630092,0.90352,0.940072,0.743148,0.867758,1.041443,0.821363,0.778349
1,GNN + Ridge + Entropy (TopK=12),0.831431,0.638732,0.901733,0.948296,0.751823,0.865434,1.048434,0.827314,0.775363
2,GNN + Huber + TopK=12 + Sym,0.86895,0.629164,0.892664,0.977247,0.72429,0.857092,1.0753,0.826705,0.763703
3,GNN + Ridge + TopK=8 (KL placeholder),0.784879,0.599951,0.912429,0.902076,0.731433,0.878232,0.96671,0.763701,0.809019
4,GNN + Ridge + TopK=6,0.625257,0.472106,0.944426,0.764782,0.588519,0.912477,0.841994,0.627066,0.855117


### OOS Evaluation Protocol
We next evaluate each trained model on 2023 using the three inference modes. For the fine-tune variant we reconstruct the 2019--2023 panel so that future rows can receive graph edges while their labels remain masked during optimization.


In [15]:
def eval_oos_for_model(model, cfg_name, cross_topk=12, lambda_blend=0.8, new_self_weight=0.0):
    if df_2023 is None or len(df_2023) == 0:
        return {
            "name": cfg_name,
            "rmse_oos_trans": math.nan,
            "mae_oos_trans": math.nan,
            "r2_oos_trans": math.nan,
            "rmse_oos_full": math.nan,
            "mae_oos_full": math.nan,
            "r2_oos_full": math.nan,
            "rmse_oos_reft": math.nan,
            "mae_oos_reft": math.nan,
            "r2_oos_reft": math.nan,
        }

    y_trans = safe_call(
        predict_new_oos_transductive,
        model=model,
        X_train=X_all,
        y_train=y_all,
        coords_train=coords_all,
        times_train=np.repeat(times, N_per_year),
        new_df=df_2023,
        feature_cols=FEATURE_COLS,
        time_col=TIME_COL,
        lat_col=LAT_COL,
        lon_col=LON_COL,
        tau_s=1.0,
        tau_t=1.0,
        knn_k=8,
        prior_self_weight=1.0,
        lambda_blend=lambda_blend,
        cross_topk=cross_topk,
        new_self_weight=new_self_weight,
        wls_kind="ridge",
        ridge_lambda=10.0,
        huber_delta=1.0,
        huber_iters=5,
        device=device,
    )
    y_trans = to_numpy(y_trans)

    y_full = safe_call(
        predict_new_fullgraph,
        model=model,
        X_train=X_all,
        y_train=y_all,
        coords_train=coords_all,
        times_train=np.repeat(times, N_per_year),
        new_df=df_2023,
        feature_cols=FEATURE_COLS,
        time_col=TIME_COL,
        lat_col=LAT_COL,
        lon_col=LON_COL,
        tau_s=1.0,
        tau_t=1.0,
        knn_k=8,
        prior_self_weight=1.0,
        wls_kind="ridge",
        ridge_lambda=10.0,
        huber_delta=1.0,
        huber_iters=5,
        device=device,
    )
    y_full = to_numpy(y_full)

    df_2019_2023 = df_full[df_full[TIME_COL] <= 2023].copy()
    times_2019_2023 = sorted(df_2019_2023[TIME_COL].unique())
    P23 = build_panel_arrays(
        df_2019_2023,
        TIME_COL,
        TARGET_COL,
        FEATURE_COLS,
        LAT_COL,
        LON_COL,
        times_2019_2023,
    )
    X_all_23, y_all_23 = P23["X_all"], P23["y_all"]
    coords_blocks_23, times_23, N_per_year_23 = P23["coords_blocks"], P23["times"], P23["N_per_year"]

    train_rows_ft = np.concatenate([year_rows(times_23, N_per_year_23, t) for t in times_23 if t <= 2021])
    val_rows_ft = year_rows(times_23, N_per_year_23, 2022)
    future_rows_ft = year_rows(times_23, N_per_year_23, 2023)

    ft_res = safe_call(
        finetune_transductive_with_future,
        model=model,
        X_all_full=X_all_23,
        y_all_full=y_all_23,
        coords_blocks_full=coords_blocks_23,
        times_full=times_23,
        train_rows=train_rows_ft,
        val_rows=val_rows_ft,
        future_rows=future_rows_ft,
        lr=1e-4,
        epochs=150,
        ridge_lambda=10.0,
        ent_w=5e-3,
        smooth_w=1e-3,
        knn_k=8,
        tau_s=1.0,
        tau_t=1.0,
        prior_self_weight=1.0,
        N_per_year=N_per_year_23,
        print_every=25,
        wls_kind="ridge",
        graph_topk=None,
        graph_symmetrize=False,
        device=device,
    )

    if isinstance(ft_res, dict):
        y_hat_23 = ft_res.get("y_hat", None)
        model_ft = ft_res.get("model", model)
    else:
        model_ft = model
        y_hat_23 = None

    if y_hat_23 is None:
        with torch.no_grad():
            A23 = build_spatiotemporal_kernel(
                coords_blocks_23,
                times_23,
                tau_s=1.0,
                tau_t=1.0,
                k_neighbors=8,
                prior_self_weight=1.0,
                verbose=False,
            )
            X23_t = torch.tensor(X_all_23, dtype=torch.float32, device=device)
            A23_t = torch.tensor(A23, dtype=torch.float32, device=device)
            W_learned, _ = model_ft(X23_t, A23_t)
            y_hat_23 = local_wls_ridge(X23_t, torch.tensor(y_all_23, dtype=torch.float32, device=device), W_learned, ridge=10.0, return_betas=False)

    y_hat_23_np = to_numpy(y_hat_23)
    yt = df_2023[TARGET_COL].values.astype(np.float32)

    rmse_o1, mae_o1, r2_o1 = regression_metrics(yt, y_trans)
    rmse_o2, mae_o2, r2_o2 = regression_metrics(yt, y_full)
    rmse_o3, mae_o3, r2_o3 = regression_metrics(y_all_23[future_rows_ft], y_hat_23_np[future_rows_ft])

    return {
        "name": cfg_name,
        "rmse_oos_trans": rmse_o1,
        "mae_oos_trans": mae_o1,
        "r2_oos_trans": r2_o1,
        "rmse_oos_full": rmse_o2,
        "mae_oos_full": mae_o2,
        "r2_oos_full": r2_o2,
        "rmse_oos_reft": rmse_o3,
        "mae_oos_reft": mae_o3,
        "r2_oos_reft": r2_o3,
    }

oos_rows = [eval_oos_for_model(model, summary["name"]) for summary, model in experiments]
oos_df = pd.DataFrame(oos_rows)

summary_all = in_sample_df.merge(oos_df, on="name", how="outer")
print("Combined in-sample and OOS metrics:")
display(summary_all)

output_dir = Path("exp_outputs")
output_dir.mkdir(exist_ok=True)
summary_path = output_dir / "gtwr_gnn_experiment_summary.csv"
summary_all.to_csv(summary_path, index=False)
print(f"Summary saved to {summary_path}")


TypeError: build_spatiotemporal_kernel() got an unexpected keyword argument 'knn_k'

## 9. Discussion
- **Panel alignment:** The revised balancing ensures that each spatial index corresponds to the same coordinate across years, avoiding silent corruption from earlier truncation.
- **Regularization:** Exposing `smooth_w` in the loss highlights its effect; tuning this parameter now genuinely shapes the model.
- **Model comparison:** Inspecting `summary_all` reveals how the GNN augmentations alter both in-sample fit and OOS generalization relative to the GTWR prior.

Future improvements could incorporate explicit KL penalties toward the prior or data-driven selection of the spatial bandwidths, as well as uncertainty quantification via bootstrap resampling.


## 10. References
- Fotheringham, A. Stewart, et al. *Geographically Weighted Regression*. Wiley, 2002.
- Huang, Bo, et al. "Geographically and Temporally Weighted Regression". *Geographical Analysis* 42.2 (2010).
- Veli\v{c}kovi\'c, Petar, et al. "Graph Attention Networks". *ICLR* (2018).
- Kipf, Thomas N., and Max Welling. "Semi-Supervised Classification with Graph Convolutional Networks". *ICLR* (2017).
