# PRISM-SLAM — Hybrid Scene-Coordinate + Primitive-Shape Mapping
**A Colab-style short paper with dataset & model I/O specs (Updated)**  
*Draft — August 2025*


## Abstract
We propose a hybrid visual localization and mapping approach that combines (i) a **scene-coordinate head** for geometry-tight 6-DoF pose estimation via PnP/RANSAC, and (ii) a **primitive-shape head** that predicts a small set of parametric 3D shapes (planes, cuboids, cylinders, superquadrics). The result is a **compact global map** that is well-suited for navigation and planning, without photorealistic detail or large point clouds.
We provide a concrete dataset format, precise **model inputs/outputs**, loss definitions, training schedule, and evaluation metrics. The notebook includes reference code stubs for dataset validation, primitive parameterization, silhouette rasterization ideas, and set-matching (Hungarian fallback).


## 0. Motivation — Realtime World Model (Monocular)
We target a **realtime world model** built from **monocular RGB**, producing both the **camera pose** and a **compact map of static geometry** while being **robust to dynamic objects**.

We propose a **progressive, anytime pipeline** with **multi-level outputs** emitted at different points during a single forward pass:

- **Level 0 — Coarse anti-collision (early exit, highest FPS)**: obstacle likelihood / free-space in a coarse frustum or BEV, plus a dynamic/static mask.  
- **Level 1 — Medium resolution + classification + map registration**: primitives (planes/cuboids/cylinders/superquadrics) with **class labels**, and a **pose update** (scene-coordinates → PnP).  
- **Level 2 — High resolution identification (lowest FPS)**: fine-grained categories or attributes within class, shape refinements, and optional instance tracking.

Levels have **different update rates** (e.g., L0 ~ 30–60 Hz, L1 ~ 10–20 Hz, L2 ~ 2–10 Hz). The model **emits them progressively** via **early-exit heads**; downstream logic consumes the freshest available level.


## 0.1 Progressive Multi-Level Outputs — Is it possible?
**Yes.** Implement with **early-exit / anytime prediction** over a shared backbone:

- **Backbone + FPN** (strides 4/8/16).  
- **Head L0 (early):** at stride 16/32 → small decoder → coarse occupancy/free-space + dynamic mask.  
- **Head L1 (mid):** at stride 8 → set-prediction primitives (+ classes) and **scene-coordinates** + uncertainty for **PnP/DSAC**.  
- **Head L2 (late):** at stride 4 → refinements (primitive params, sub-class IDs, attributes).

**Training**: anytime distillation (L0/L1 mimic L2), loss balancing (uncertainty/GradNorm), dynamic masking, latency budgeting.  
**Runtime**: emit **L0 ASAP** for safety, **L1** to update pose & map, **L2** opportunistically.

### Pseudocode sketch
```python
features = backbone(rgb)
l0 = head_l0(features["p16"]); emit("L0", l0)
l1 = head_l1(features["p8"]); pose = solve_pnp(l1.scene_coords, K); emit("pose", pose); emit("L1", l1)
if budget_allows():
    l2 = head_l2(features["p4"]); emit("L2", l2)
```


## 1. Introduction
Traditional SLAM often stores large **point clouds** or **voxel volumes**. For embedded navigation, we target a **compact map** made of **few parametric shapes** while keeping tight pose accuracy via **scene-coordinates** (PnP/DSAC). Dynamics are masked out.


## 2. Related Work (brief)
Scene Coordinates (DSAC/DSAC++), primitive mapping (planes, cuboids, cylinders, superquadrics), set prediction (DETR), and loop-closure with pose-graphs.


## 3. Approach Overview
Shared backbone → two heads:
1) **Scene-Coord Head:** sparse 2D→3D with uncertainty → PnP(+RANSAC/DSAC) for pose.  
2) **Primitive Set Head:** predicts planes/cuboids/cylinders/superquadrics (camera frame), transformed to world with pose; data-associated into a **shape-graph** map. Loop closures via retrieval → PGO (Sim(3) for mono).


## 4. Dataset Specification
- **Camera (OpenCV):** +x right, +y down, +z forward.  **World (ENU):** +X east, +Y north, +Z up.
- **Per-frame meta** (`meta/{frame}.json`): `frame_id`, `rgb_path`, optional `depth_path`, `dyn_mask_path`, `K (3x3)`, `T_cw (4x4)`, `scene_id`, `split`.
- **Scene-coordinates** (`scene_coords/{frame}.npz`): `uv (N,2)`, `XYZ (N,3)`, `valid (N,)`, optional `cov (N,3,3)`.
- **Primitives** (`primitives/{frame}.json` or per-scene): items with `type` ∈ {plane,cuboid,cylinder,superquadric} and their parameters.


In [None]:
# Lightweight JSON schema validators (example)
import json

FRAME_META_SCHEMA = {
    "type": "object",
    "required": ["frame_id","rgb_path","K","T_cw","scene_id"],
}
PRIMITIVE_SCHEMA = {
    "type": "object",
    "required": ["type","params"],
}
def validate_frame_meta(meta):
    errs = []
    for k in FRAME_META_SCHEMA["required"]:
        if k not in meta: errs.append(f"Missing {k}")
    return len(errs)==0, errs

def validate_primitive(p):
    if "type" not in p or "params" not in p:
        return False, ["Primitive requires 'type' and 'params'"]
    return True, []


In [None]:
# Primitive helpers (stubs)
import numpy as np

def _norm(v, eps=1e-8):
    v = np.asarray(v, np.float32); n = np.linalg.norm(v) + eps; return (v/n).astype(np.float32)

def plane_from_params(n, d): return _norm(n), float(d)
def cuboid_from_params(c, s, q):
    return np.asarray(c,np.float32), np.maximum(np.asarray(s,np.float32),1e-3), _norm(q)
def cylinder_from_params(p, v, r, h):
    return np.asarray(p,np.float32), _norm(v), float(max(r,1e-5)), float(max(h,1e-5))
def superquadric_from_params(c, q, a, e):
    return np.asarray(c,np.float32), _norm(q), np.maximum(np.asarray(a,np.float32),1e-4), np.maximum(np.asarray(e,np.float32),0.1)


## 5. Model: Inputs & Outputs
**Inputs**: RGB (B,3,H,W); K (B,3,3); optional dynamic mask.  
**Head A (scene-coords)**: predict K sparse 3D points + uncertainty → **PnP/DSAC** → pose T_cw.  
**Head B (primitives)**: M slots, set-prediction (types + params + conf), predicted in camera frame → transformed to world via T_cw.


## 6. Losses
- Scene-coords: L1 on XYZ + reprojection; uncertainty weighting; DSAC pose loss.  
- Primitives: classification CE; parameter L1/L2; Chamfer on sampled surfaces; silhouette/occupancy BCE/IoU; regularizers (unit-norm, positivity).  
- Total: weighted sum with GradNorm/uncertainty balancing.


## 7. Training
AdamW (lr 2e-4, wd 1e-2), cosine; 150–200k iters; batch 8–16 @ 512².  
Augment: color jitter, noise/blur, intrinsics jitter; dynamic masks.  
Curriculum: planes/cuboids first, then cylinders/superquadrics.  
Anytime distillation for L0/L1 from L2 if present.


## 8. Inference & Mapping
Per frame: L0 early for anti-collision; L1 → pose+primitives update; L2 refinements if budget.  
Data-associate primitives to a **shape-graph**; loop closures via retrieval → PGO; optional coarse ESDF for planners.


## 9. Evaluation
Pose: ATE/RPE.  Primitives: Chamfer/IoU/normal error; map size (bytes).  
Navigation: collision-free rate; ESDF coverage.  Ablations: dynamics off, no superquadrics, etc.


## 10. Repro Checklist
- Export RGB/K/T_cw, dynamic masks, scene-coords, primitive GT.  
- Validate JSONs; train with M=32, K=2k, H=W=512.  
- Report pose + primitive metrics + map size.


In [None]:
# Example: validate a tiny meta + primitive JSON in-memory
meta = {"frame_id":1,"rgb_path":"images/000001.png","K":[[600,0,320],[0,600,240],[0,0,1]],"T_cw":[[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]],"scene_id":"scene_000"}
ok, errs = validate_frame_meta(meta); print("meta valid:", ok, errs)
p = {"type":"plane","params":{"n":[0,0,1],"d":0.0,"sx":20.0,"sy":20.0}}
ok, errs = validate_primitive(p); print("primitive valid:", ok, errs)
