# 06 – TabNet with Tabular + Text Embeddings (Fusion Model)

**Goal**

In this notebook:

1. Load the **tabular splits** (`train_multimodal.csv`, `val_multimodal.csv`, `test_multimodal.csv`).
2. Load the per-listing **text features** exported in `05_text_encoder.ipynb`  
   (`txt_has`, `txt_pred_log`, `txt_emb_000...`) from `PROC_DIR / "multimodal_features"`.
3. Merge tabular + text features on `listing_id`.
4. Build a joint feature matrix:
   - Numeric features from the tabular pipeline.
   - Text meta features (`txt_has`, `txt_pred_log`).
   - Text embedding dimensions (`txt_emb_*`).
5. Train a **TabNetRegressor** that consumes all of these features jointly.
6. Evaluate performance in:
   - log-price space (`log_sold_price`), and
   - back-transformed dollar space (`sold_price`).

This is the “TabNet + text embeddings” counterpart of `04_TabNet_tabularONLY.ipynb`.


In [None]:
!pip install -q pytorch-tabnet wget

from pathlib import Path
import json
import os

import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import torch
from pytorch_tabnet.tab_model import TabNetRegressor

np.random.seed(0)
torch.manual_seed(0)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(0)

print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.5/44.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for wget (setup.py) ... [?25l[?25hdone
Torch version: 2.9.0+cu126
CUDA available: True


In [None]:
try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    drive.mount("/content/drive")
    PROJECT_ROOT = Path("/content/drive/My Drive/SH")
else:
    # Adjust this path if running locally
    PROJECT_ROOT = Path(".").resolve()

DATA_DIR = PROJECT_ROOT / "data"
PROC_DIR = DATA_DIR / "processed"
TXT_FEAT_DIR = PROC_DIR / "multimodal_features"

print("PROJECT_ROOT:", PROJECT_ROOT)
print("PROC_DIR:", PROC_DIR)
print("TXT_FEAT_DIR:", TXT_FEAT_DIR)


Mounted at /content/drive
PROJECT_ROOT: /content/drive/My Drive/SH
PROC_DIR: /content/drive/My Drive/SH/data/processed
TXT_FEAT_DIR: /content/drive/My Drive/SH/data/processed/multimodal_features


## 1. Load tabular splits and preparation summary

Load:

- `train_multimodal.csv`
- `val_multimodal.csv`
- `test_multimodal.csv`

and the `multimodal_prep_summary.json`, which contains:

- `target_column` (e.g., `sold_price`)
- `log_target_column` (e.g., `log_sold_price`)
- `numeric_features`
- `categorical_features`


In [None]:
train_path = PROC_DIR / "train_multimodal.csv"
val_path   = PROC_DIR / "val_multimodal.csv"
test_path  = PROC_DIR / "test_multimodal.csv"

train_df = pd.read_csv(train_path)
val_df   = pd.read_csv(val_path)
test_df  = pd.read_csv(test_path)

print("Train shape:", train_df.shape)
print("Val shape  :", val_df.shape)
print("Test shape :", test_df.shape)

summary_path = PROC_DIR / "multimodal_prep_summary.json"
with open(summary_path, "r") as f:
    prep_summary = json.load(f)

criteria = prep_summary["criteria"]

TARGET_RAW_COL = criteria["target_column"]        # e.g., "sold_price"
TARGET_LOG_COL = criteria["log_target_column"]    # e.g., "log_sold_price"

NUMERIC_FEATURES_TAB = criteria["numeric_features"]
CATEGORICAL_FEATURES = criteria["categorical_features"]

# Keep only columns that are actually present
NUMERIC_FEATURES_TAB = [c for c in NUMERIC_FEATURES_TAB if c in train_df.columns]
CATEGORICAL_FEATURES = [c for c in CATEGORICAL_FEATURES if c in train_df.columns]

print("Target (raw):", TARGET_RAW_COL)
print("Target (log):", TARGET_LOG_COL)
print("Tabular numeric features:", len(NUMERIC_FEATURES_TAB))
print("Tabular categorical features:", len(CATEGORICAL_FEATURES))


Train shape: (143643, 27)
Val shape  : (17955, 27)
Test shape : (17956, 27)
Target (raw): sold_price
Target (log): log_sold_price
Tabular numeric features: 9
Tabular categorical features: 9


## 2. Load and merge text features

Use the text features exported by `05_text_encoder.ipynb`:

- `txt_features_train_*.csv`
- `txt_features_val_*.csv`
- `txt_features_test_*.csv`

Each file contains (at least):

- `listing_id`
- `txt_has` (0/1 flag for non-empty description)
- `txt_pred_log` (text-only prediction in log space)
- `txt_emb_000 ... txt_emb_{D-1}` (text embedding)

Merge these into the tabular splits on `listing_id` and then extend
our numeric feature set with:

- `txt_has`
- `txt_pred_log`
- all `txt_emb_*` columns.


In [None]:
def pick_single_file(pattern):
    matches = sorted(TXT_FEAT_DIR.glob(pattern))
    if not matches:
        raise FileNotFoundError(f"No files match pattern: {pattern}")
    if len(matches) > 1:
        print("Warning: multiple matches found; using the first one:")
        for m in matches:
            print("  ", m.name)
    return matches[0]

txt_train_path = pick_single_file("txt_features_train_*.csv")
txt_val_path   = pick_single_file("txt_features_val_*.csv")
txt_test_path  = pick_single_file("txt_features_test_*.csv")

print("Using text feature files:")
print("  train:", txt_train_path.name)
print("  val  :", txt_val_path.name)
print("  test :", txt_test_path.name)

txt_train = pd.read_csv(txt_train_path)
txt_val   = pd.read_csv(txt_val_path)
txt_test  = pd.read_csv(txt_test_path)

for name, df in [("train_df", train_df), ("val_df", val_df), ("test_df", test_df)]:
    assert "listing_id" in df.columns, f"{name} is missing 'listing_id'"

for name, df in [("txt_train", txt_train), ("txt_val", txt_val), ("txt_test", txt_test)]:
    assert "listing_id" in df.columns, f"{name} is missing 'listing_id'"

print("\nChecking uniqueness of listing_id in each split...")
for name, df in [("train_df", train_df), ("val_df", val_df), ("test_df", test_df),
                 ("txt_train", txt_train), ("txt_val", txt_val), ("txt_test", txt_test)]:
    print(f"{name}: n_rows={len(df)}, n_unique listing_id={df['listing_id'].nunique()}")

# Merge text features into tabular splits
train_df = train_df.merge(txt_train, on="listing_id", how="left", validate="one_to_one")
val_df   = val_df.merge(txt_val,   on="listing_id", how="left", validate="one_to_one")
test_df  = test_df.merge(txt_test, on="listing_id", how="left", validate="one_to_one")

print("\nAfter merge:")
print("Train shape:", train_df.shape)
print("Val shape  :", val_df.shape)
print("Test shape :", test_df.shape)

# Identify text columns
TXT_EMB_COLS = [c for c in train_df.columns if c.startswith("txt_emb_")]
TXT_META_COLS = [c for c in ["txt_has", "txt_pred_log"] if c in train_df.columns]

print("Number of text embedding dims:", len(TXT_EMB_COLS))
print("Text meta features:", TXT_META_COLS)


Using text feature files:
  train: txt_features_train_spbpe_vocab16000_d128.csv
  val  : txt_features_val_spbpe_vocab16000_d128.csv
  test : txt_features_test_spbpe_vocab16000_d128.csv

Checking uniqueness of listing_id in each split...
train_df: n_rows=143643, n_unique listing_id=143643
val_df: n_rows=17955, n_unique listing_id=17955
test_df: n_rows=17956, n_unique listing_id=17956
txt_train: n_rows=143643, n_unique listing_id=143643
txt_val: n_rows=17955, n_unique listing_id=17955
txt_test: n_rows=17956, n_unique listing_id=17956

After merge:
Train shape: (143643, 157)
Val shape  : (17955, 157)
Test shape : (17956, 157)
Number of text embedding dims: 128
Text meta features: ['txt_has', 'txt_pred_log']


## 3. Define final TabNet feature lists

1. Start from the tabular numeric features from the prep summary.
2. Add:
   - text meta features (`txt_has`, `txt_pred_log`), and
   - text embedding dimensions (`txt_emb_*`).
3. Keep the original categorical features.
4. Optionally remove known leaky features (e.g., `price_per_sqft`).

These combined feature lists will be used to build the TabNet inputs.


In [None]:
# Start from numeric features derived from tabular pipeline
NUMERIC_FEATURES_BASE = list(NUMERIC_FEATURES_TAB)

# Extend numeric features with text meta + text embeddings
NUMERIC_FEATURES_TEXT = TXT_META_COLS + TXT_EMB_COLS

NUMERIC_FEATURES = NUMERIC_FEATURES_BASE + NUMERIC_FEATURES_TEXT

# Remove any known leaky features
LEAKY = {"price_per_sqft"}
NUMERIC_FEATURES = [c for c in NUMERIC_FEATURES if c not in LEAKY]

# Ensure features exist in the merged dataframes
NUMERIC_FEATURES = [c for c in NUMERIC_FEATURES if c in train_df.columns]
CATEGORICAL_FEATURES = [c for c in CATEGORICAL_FEATURES if c in train_df.columns]

FEATURES = NUMERIC_FEATURES + CATEGORICAL_FEATURES

print("Final numeric features:", len(NUMERIC_FEATURES))
print("Final categorical features:", len(CATEGORICAL_FEATURES))
print("Total features:", len(FEATURES))

print("\nExample numeric features (including text):", NUMERIC_FEATURES[:15])
print("Example categorical features:", CATEGORICAL_FEATURES[:15])


Final numeric features: 139
Final categorical features: 9
Total features: 148

Example numeric features (including text): ['beds', 'full_baths', 'half_baths', 'sqft', 'year_built', 'days_on_mls', 'lot_sqft', 'hoa_fee', 'fed_funds_rate', 'txt_has', 'txt_pred_log', 'txt_emb_000', 'txt_emb_001', 'txt_emb_002', 'txt_emb_003']
Example categorical features: ['city', 'state', 'zip_code', 'status', 'style', 'parking_garage', 'new_construction', 'stories', 'county']


## 4. Preprocessing

Apply simple, train-only preprocessing:

1. **Numeric features**
   - Fill missing values using the **median computed on the train split**.
2. **Categorical features**
   - Convert to string, fill missing with a special token.
   - Build a per-column mapping on the train split:
     - each category → integer index
     - reserve the last index for `"__UNKNOWN__"` for unseen categories in val/test.
   - Apply the mapping to train/val/test.
3. Build `cat_idxs` and `cat_dims` in the order of `FEATURES` for TabNet.


In [None]:

def fill_numeric_with_train_median(train_df, other_dfs, numeric_cols):
    """
    Fill NaNs in numeric_cols with train medians.
    """
    med = train_df[numeric_cols].median(numeric_only=True)
    train_df[numeric_cols] = train_df[numeric_cols].fillna(med)
    for df in other_dfs:
        df[numeric_cols] = df[numeric_cols].fillna(med)
    return med


def fit_safe_category_maps(train_df, cat_cols):
    """
    Fit per-column mapping on train; reserve last index for UNKNOWN.

    Returns:
      maps: dict[col] -> dict[value(str)] -> int
      dims: dict[col] -> int (num_categories + 1 for UNKNOWN)
    """
    maps = {}
    dims = {}
    for col in cat_cols:
        s = train_df[col].astype("string").fillna("__MISSING__")
        cats = pd.Index(sorted(s.unique().tolist()))
        mapping = {v: i for i, v in enumerate(cats)}
        unknown_idx = len(mapping)
        mapping["__UNKNOWN__"] = unknown_idx
        maps[col] = mapping
        dims[col] = unknown_idx + 1  # total categories including UNKNOWN

        # Apply to train directly
        train_df[col] = s.map(mapping).astype("int64")

    return maps, dims


def apply_safe_category_maps(df, cat_cols, maps):
    """
    Apply pre-fitted maps to df; unseen values go to "__UNKNOWN__".
    """
    for col in cat_cols:
        mapping = maps[col]
        unknown_idx = mapping["__UNKNOWN__"]
        s = df[col].astype("string").fillna("__MISSING__")

        df[col] = s.map(lambda v: mapping.get(v, unknown_idx)).astype("int64")

    return df


In [None]:
# 1) Numeric imputation
_ = fill_numeric_with_train_median(train_df, [val_df, test_df], NUMERIC_FEATURES)

# 2) Categorical encoding
if CATEGORICAL_FEATURES:
    cat_maps, cat_dims_by_col = fit_safe_category_maps(train_df, CATEGORICAL_FEATURES)
    val_df  = apply_safe_category_maps(val_df,  CATEGORICAL_FEATURES, cat_maps)
    test_df = apply_safe_category_maps(test_df, CATEGORICAL_FEATURES, cat_maps)

    # Build TabNet categorical metadata aligned with FEATURE order
    cat_idxs = [FEATURES.index(c) for c in CATEGORICAL_FEATURES]
    cat_dims = [cat_dims_by_col[c] for c in CATEGORICAL_FEATURES]
else:
    cat_maps = {}
    cat_dims_by_col = {}
    cat_idxs = []
    cat_dims = []

print("Number of categorical features:", len(cat_idxs))
print("Example cat idxs:", cat_idxs[:10])
print("Example cat dims:", cat_dims[:10])


Number of categorical features: 9
Example cat idxs: [139, 140, 141, 142, 143, 144, 145, 146, 147]
Example cat dims: [868, 2, 613, 2, 3, 28, 3, 11, 27]


## 5. Build NumPy matrices for TabNet

Train TabNet on the **log target** (e.g., `log_sold_price`) to stabilize
the regression.

Keep the raw target (e.g., `sold_price`) to compute dollar metrics later.


In [None]:
X_train = train_df[FEATURES].to_numpy(dtype=np.float32)
X_val   = val_df[FEATURES].to_numpy(dtype=np.float32)
X_test  = test_df[FEATURES].to_numpy(dtype=np.float32)

y_train = train_df[TARGET_LOG_COL].to_numpy(dtype=np.float32).reshape(-1, 1)
y_val   = val_df[TARGET_LOG_COL].to_numpy(dtype=np.float32).reshape(-1, 1)
y_test  = test_df[TARGET_LOG_COL].to_numpy(dtype=np.float32).reshape(-1, 1)

print("X_train:", X_train.shape, "y_train:", y_train.shape)
print("X_val  :", X_val.shape,   "y_val  :", y_val.shape)
print("X_test :", X_test.shape,  "y_test :", y_test.shape)


X_train: (143643, 148) y_train: (143643, 1)
X_val  : (17955, 148) y_val  : (17955, 1)
X_test : (17956, 148) y_test : (17956, 1)


## 6. Configure and train TabNet

Configure TabNet with:

- decision/attention units (`n_d`, `n_a`),
- number of steps (`n_steps`),
- sparse regularization (`lambda_sparse`),
- Adam optimizer with learning rate / weight decay,
- categorical embeddings (`cat_idxs`, `cat_dims`, `cat_emb_dim`),
- early stopping on validation RMSE.

The model is trained on the joint feature vector (tabular + text embeddings).


In [None]:
tabnet_params = dict(
    n_d=32,
    n_a=32,
    n_steps=5,
    gamma=1.5,
    n_independent=2,
    n_shared=2,
    lambda_sparse=1e-4, # chaning from 1e-4 to 5e-5 worsen the val rmse
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2, weight_decay=1e-5),
    mask_type="entmax",
)

model = TabNetRegressor(
    **tabnet_params,
    cat_idxs=cat_idxs,
    cat_dims=cat_dims,
    cat_emb_dim=8,
    seed=0,
    verbose=10,
    device_name="cuda" if torch.cuda.is_available() else "cpu",
)

max_epochs = 200
patience = 30
batch_size = 1024
virtual_batch_size = 128

model.fit(
    X_train=X_train,
    y_train=y_train,
    eval_set=[(X_train, y_train), (X_val, y_val)],
    eval_name=["train", "val"],
    eval_metric=["rmse"],
    max_epochs=max_epochs,
    patience=patience,
    batch_size=batch_size,
    virtual_batch_size=virtual_batch_size,
    num_workers=2,
    drop_last=False,
)




epoch 0  | loss: 11.08956| train_rmse: 0.3105  | val_rmse: 0.38115 |  0:00:16s
epoch 10 | loss: 0.05369 | train_rmse: 0.21001 | val_rmse: 0.29897 |  0:02:59s
epoch 20 | loss: 0.06567 | train_rmse: 0.21397 | val_rmse: 0.29541 |  0:05:46s
epoch 30 | loss: 0.04893 | train_rmse: 0.19802 | val_rmse: 0.28201 |  0:08:26s
epoch 40 | loss: 0.05412 | train_rmse: 0.20992 | val_rmse: 0.2882  |  0:11:14s
epoch 50 | loss: 0.05011 | train_rmse: 0.31202 | val_rmse: 0.36807 |  0:14:00s
epoch 60 | loss: 0.04119 | train_rmse: 0.18633 | val_rmse: 0.2683  |  0:16:41s
epoch 70 | loss: 0.04349 | train_rmse: 0.24175 | val_rmse: 0.30959 |  0:19:24s
epoch 80 | loss: 0.04422 | train_rmse: 0.20361 | val_rmse: 0.28452 |  0:22:04s
epoch 90 | loss: 0.0454  | train_rmse: 0.20922 | val_rmse: 0.28674 |  0:24:53s

Early stopping occurred at epoch 90 with best_epoch = 60 and best_val_rmse = 0.2683




## 7. Evaluation in log space

First evaluate the model on the transformed target (log space) to compare
against other models trained on `log_sold_price`.


In [None]:
def regression_metrics(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae  = mean_absolute_error(y_true, y_pred)
    r2   = r2_score(y_true, y_pred)
    return rmse, mae, r2

def eval_split(name, X, y):
    pred = model.predict(X).reshape(-1, 1)
    rmse, mae, r2 = regression_metrics(y, pred)
    print(f"{name} RMSE (log): {rmse:.3f}")
    print(f"{name} MAE  (log): {mae:.3f}")
    print(f"{name} R²         : {r2:.3f}")
    return {"rmse": float(rmse), "mae": float(mae), "r2": float(r2)}

print("=== TabNet performance (log target, tabular + text embeddings) ===")
metrics_log = {}
metrics_log["train"] = eval_split("Train", X_train, y_train)
metrics_log["val"]   = eval_split("Val",   X_val,   y_val)
metrics_log["test"]  = eval_split("Test",  X_test,  y_test)


=== TabNet performance (log target, tabular + text embeddings) ===
Train RMSE (log): 0.186
Train MAE  (log): 0.140
Train R²         : 0.926
Val RMSE (log): 0.268
Val MAE  (log): 0.193
Val R²         : 0.849
Test RMSE (log): 0.270
Test MAE  (log): 0.195
Test R²         : 0.847


## 8. Back-transform predictions to dollars

The prep summary contains both:

- the raw target (e.g., `sold_price`),
- the log-transformed target (e.g., `log_sold_price`),

but does not store whether the transform used `log` or `log1p`.

Detect which transform was used and then:

1. Back-transform predictions to dollar space.
2. Compute MAE (dollars) and MAPE (%).


In [None]:
def detect_log_transform(df, raw_col, log_col, n=2000):
    """
    Heuristically detect whether log_col ≈ log(raw_col) or log1p(raw_col).
    """
    sub = df[[raw_col, log_col]].dropna().sample(min(n, len(df)), random_state=0)
    raw = sub[raw_col].to_numpy(dtype=np.float64)
    logv = sub[log_col].to_numpy(dtype=np.float64)

    # If raw has non-positive values, log(raw) is invalid; fall back to log1p
    if np.any(raw <= 0):
        diff_log1p = np.nanmean(np.abs(logv - np.log1p(np.maximum(raw, 0))))
        return "log1p", diff_log1p

    diff_log  = np.nanmean(np.abs(logv - np.log(raw)))
    diff_log1p = np.nanmean(np.abs(logv - np.log1p(raw)))
    if diff_log <= diff_log1p:
        return "log", diff_log
    else:
        return "log1p", diff_log1p

log_kind, err = detect_log_transform(train_df, TARGET_RAW_COL, TARGET_LOG_COL)
print("Detected log transform:", log_kind, "| mean abs diff:", err)

inv = (np.exp if log_kind == "log" else np.expm1)

def dollar_metrics(name, X, df):
    y_true = df[TARGET_RAW_COL].to_numpy(dtype=np.float64)
    y_pred_log = model.predict(X).reshape(-1)
    y_pred = inv(y_pred_log)

    mae = mean_absolute_error(y_true, y_pred)
    denom = np.maximum(np.abs(y_true), 1.0)
    mape = np.mean(np.abs((y_true - y_pred) / denom)) * 100.0

    print(f"{name} MAE ($)  : {mae:,.0f}")
    print(f"{name} MAPE (%) : {mape:.2f}")
    return {"mae_dollars": float(mae), "mape_pct": float(mape)}

print("\n=== Back-transformed ($) metrics ===")
metrics_dollars = {}
metrics_dollars["val"]  = dollar_metrics("Val",  X_val,  val_df)
metrics_dollars["test"] = dollar_metrics("Test", X_test, test_df)


Detected log transform: log1p | mean abs diff: 5.337952302397752e-16

=== Back-transformed ($) metrics ===
Val MAE ($)  : 94,806
Val MAPE (%) : 20.33
Test MAE ($)  : 95,425
Test MAPE (%) : 20.58


## 9. Global feature importance

TabNet exposes **feature importances** that sum to 1 across all features.

List the top features across:

- tabular numeric features,
- categorical features,
- text meta and embedding dimensions.


In [None]:
fi = model.feature_importances_
fi_df = pd.DataFrame({"feature": FEATURES, "importance": fi}).sort_values(
    "importance", ascending=False
)

print("Top 30 features by importance:")
fi_df.head(30)


Top 30 features by importance:


Unnamed: 0,feature,importance
118,txt_emb_107,0.2211581
55,txt_emb_044,0.1897342
1,full_baths,0.120699
67,txt_emb_056,0.1047303
101,txt_emb_090,0.07594568
8,fed_funds_rate,0.03986638
30,txt_emb_019,0.03798416
64,txt_emb_053,0.03622269
123,txt_emb_112,0.0326599
86,txt_emb_075,0.02988683


In [None]:
fi = model.feature_importances_
fi_df = pd.DataFrame({"feature": FEATURES, "importance": fi})

def group_importance(df, group_name, mask):
    return df.loc[mask, "importance"].sum()

is_text_emb  = fi_df["feature"].str.startswith("txt_emb_")
is_text_meta = fi_df["feature"].isin(["txt_has", "txt_pred_log"])
is_tabular   = ~(is_text_emb | is_text_meta)

print("Total importance (tabular):    ", group_importance(fi_df, "tabular", is_tabular))
print("Total importance (text meta): ", group_importance(fi_df, "text_meta", is_text_meta))
print("Total importance (text emb):  ", group_importance(fi_df, "text_emb", is_text_emb))


Total importance (tabular):     0.16792966061022477
Total importance (text meta):  0.0
Total importance (text emb):   0.8320703393897751
