# Hybrid Scene-Coordinate + Primitive-Shape Mapping for Compact, Navigation-Grade SLAM
**A Colab-style short paper with dataset & model I/O specs**  
*Draft — August 2025*


## Abstract
We propose a hybrid visual localization and mapping approach that combines (i) a **scene-coordinate head** for geometry-tight 6-DoF pose estimation via PnP/RANSAC, and (ii) a **primitive-shape head** that predicts a small set of parametric 3D shapes (planes, cuboids, cylinders, superquadrics). The result is a **compact global map** that is well-suited for navigation and planning, without photorealistic detail or large point clouds.
We provide a concrete dataset format (files, JSON schemas), precise **model inputs/outputs**, loss definitions, training schedule, and evaluation metrics. The notebook also includes reference code stubs for dataset validation, primitive parameterization, silhouette rasterization to a coarse grid, and set-matching (Hungarian fallback).


## 1. Introduction
Traditional structure-from-motion and SLAM systems often accumulate large **point clouds** or **voxel volumes**.
For embedded navigation, we target a **compact map** made of **few parametric shapes** while keeping tight pose accuracy.
We adopt a **hybrid** design:
(1) a **Scene-Coordinate Regression** head predicts sparse 2D→3D correspondences to recover pose with PnP(+RANSAC/DSAC), and
(2) a **Set-Prediction Primitive** head outputs a small number of 3D shapes that fuse over time into a **shape-graph** map.
Dynamic objects are masked and excluded from mapping. The final map is tiny (dozens–hundreds of parameters per scene) yet supports planning (free-space, clearance).


## 2. Related Work (very brief)
- **Scene Coordinates & DSAC/DSAC++**: Learn pixel→world 3D mappings enabling PnP; strong for precise pose.
- **Primitive-based Mapping**: Plane/Cuboid/Cylinder fitting (e.g., PlaneRCNN, CubeSLAM), and shape abstraction via superquadrics/quadric assemblies for compact modeling.
- **Set Prediction**: DETR-style bipartite matching for variable-cardinality object sets, used here for 3D geometric primitives.


## 3. Approach Overview
At each frame, given RGB, intrinsics **K**, and a dynamic mask (optional), the shared backbone produces features for two heads:
1) **Scene-Coord Head (pose):** predicts 3D world coordinates for a sparse set of pixels and per-point uncertainty. Pose is estimated with PnP + (differentiable) RANSAC (DSAC++).  
2) **Primitive Head (map):** predicts up to **M** parametric shapes with a set-prediction head. Each primitive is in the **camera frame** and then transformed to **world** via the estimated pose.
We data-associate predicted shapes to a persistent **shape-graph**, refine parameters, and perform loop-closure with image retrieval and pose-graph optimization (Sim(3) for monocular).


## 4. Dataset Specification

### 4.1 Coordinate Conventions
- **Camera (OpenCV):** +x right, +y down, +z forward.  
- **World (ENU):** +X east, +Y north, +Z up.  
- We store poses as **T_cw** (camera in world) unless stated otherwise.

### 4.2 Directory Layout
```
dataset_root/
  scene_000/
    images/{frame:06d}.png
    depth/{frame:06d}.exr            # optional, meters
    dyn_mask/{frame:06d}.png         # optional, 1=dynamic
    meta/{frame:06d}.json            # per-frame metadata (K, T_cw, etc.)
    primitives/{frame:06d}.json      # per-frame primitive GT (optional if per-scene file used)
  scene_000_primitives.json          # optional: per-scene GT primitives
  schema/                            # JSON schemas (frames, primitives)
```

### 4.3 Per-Frame Metadata (`meta/{frame}.json`)
```json
{
  "frame_id": 123,
  "timestamp": 1725000000.033,
  "rgb_path": "images/000123.png",
  "depth_path": "depth/000123.exr",
  "dyn_mask_path": "dyn_mask/000123.png",
  "K": [[fx,0,cx],[0,fy,cy],[0,0,1]],
  "T_cw": [[r11,r12,r13,tx],[r21,r22,r23,ty],[r31,r32,r33,tz],[0,0,0,1]],
  "scene_id": "scene_000",
  "split": "train",
  "notes": ""
}
```

### 4.4 Scene-Coordinate GT (sparse)
- For a sparse grid (e.g., every 4th pixel) or a fixed set K per frame, store valid 2D→3D pairs and (optional) per-point covariance.
- File: `scene_coords/{frame}.npz` with arrays:
  - `uv` : shape (N,2) float32 pixel coords
  - `XYZ`: shape (N,3) float32 world points
  - `valid`: shape (N,) uint8
  - `cov`: shape (N,3,3) float32 (optional)

### 4.5 Primitive Ground Truth
A primitive is a dict with at least: `id`, `type`, `params`, `frame_id` (if per-frame) or without `frame_id` if per-scene static.

**Supported primitive types and parameters (world frame unless noted):**
- `plane`: normal (nx,ny,nz), offset `d` (s.t. n·X + d = 0), in-plane extents `(sx, sy)` and orientation for extents `(u,v)` if not axis-aligned.
- `cuboid`: center `c=(cx,cy,cz)`, size `s=(sx,sy,sz)`, rotation `q=(qw,qx,qy,qz)`.
- `cylinder`: point on axis `p`, axis direction `v` (unit), radius `r`, height `h`.
- `superquadric`: center `c`, rotation `q`, semi-axes `(a1,a2,a3)`, exponents `(eps1,eps2)`.

**Example `primitives/{frame}.json`:**
```json
{
  "frame_id": 123,
  "primitives": [
    {"id":"pl_0","type":"plane",
     "params":{"n":[0,0,1],"d":-0.0,"sx":20.0,"sy":20.0}},
    {"id":"cb_1","type":"cuboid",
     "params":{"c":[5.0,2.0,0.0],"s":[4.0,3.0,10.0],"q":[1,0,0,0]}}
  ]
}
```

**Optional per-scene GT (`scene_XXX_primitives.json`)** for static structures:
```json
{
  "scene_id":"scene_000",
  "primitives":[ ... ]  // same structure as above, no frame_id
}
```

### 4.6 JSON Schemas
We provide JSON Schemas in this notebook to validate `meta/*.json` and `primitives/*.json` during data prep.

### 4.7 Dynamic Masks
Binary mask (1=dynamic). Used to ignore pixels for scene-coordinates and to avoid fitting primitives to movers.


In [None]:
# (Optional) Colab: install extras if needed
# !pip install numpy pillow shapely==2.0.4
# If using SciPy for Hungarian: !pip install scipy
# If using Pydantic:           !pip install pydantic


In [None]:
import json, os, math, numpy as np
from typing import Dict, List, Tuple, Optional

# ---- JSON Schemas (lightweight) ----

FRAME_META_SCHEMA = {
    "type": "object",
    "required": ["frame_id","rgb_path","K","T_cw","scene_id"],
    "properties": {
        "frame_id": {"type":"integer"},
        "timestamp": {"type":["number","integer"]},
        "rgb_path": {"type":"string"},
        "depth_path": {"type":["string","null"]},
        "dyn_mask_path": {"type":["string","null"]},
        "K": {"type":"array","minItems":3,"maxItems":3},
        "T_cw": {"type":"array","minItems":4,"maxItems":4},
        "scene_id": {"type":"string"},
        "split": {"type":"string"}
    }
}

PRIMITIVE_SCHEMA = {
    "type": "object",
    "required": ["type","params"],
    "properties": {
        "id": {"type":["string","null"]},
        "type": {"type":"string","enum":["plane","cuboid","cylinder","superquadric"]},
        "params": {"type":"object"}
    }
}

def validate_frame_meta(meta: Dict) -> Tuple[bool, List[str]]:
    errs = []
    for k in FRAME_META_SCHEMA["required"]:
        if k not in meta: errs.append(f"Missing required key: {k}")
    if "K" in meta:
        ok = isinstance(meta["K"], list) and len(meta["K"])==3 and all(isinstance(r,list) and len(r)==3 for r in meta["K"])
        if not ok: errs.append("K must be 3x3 list")
    if "T_cw" in meta:
        ok = isinstance(meta["T_cw"], list) and len(meta["T_cw"])==4 and all(isinstance(r,list) and len(r)==4 for r in meta["T_cw"])
        if not ok: errs.append("T_cw must be 4x4 list")
    return (len(errs)==0, errs)

def validate_primitive(p: Dict) -> Tuple[bool, List[str]]:
    errs = []
    if "type" not in p or "params" not in p:
        return False, ["Primitive requires 'type' and 'params'"]
    t = p["type"]
    params = p["params"]
    if t=="plane":
        for k in ["n","d","sx","sy"]:
            if k not in params: errs.append(f"plane missing {k}")
        if "n" in params and len(params["n"])!=3: errs.append("plane n must be len-3")
    elif t=="cuboid":
        for k in ["c","s","q"]:
            if k not in params: errs.append(f"cuboid missing {k}")
    elif t=="cylinder":
        for k in ["p","v","r","h"]:
            if k not in params: errs.append(f"cylinder missing {k}")
    elif t=="superquadric":
        for k in ["c","q","a","e"]:
            if k not in params: errs.append(f"superquadric missing {k}")
    return (len(errs)==0, errs)

print("Schemas loaded. Use validate_frame_meta() / validate_primitive().")


In [None]:
import numpy as np

def normalize(v, eps=1e-8):
    v = np.asarray(v, dtype=np.float32)
    n = np.linalg.norm(v) + eps
    return (v / n).astype(np.float32)

# ---- Primitive parameterization helpers ----

def plane_from_params(n, d):
    n = normalize(n)
    d = float(d)
    return n, d

def cuboid_from_params(c, s, q):
    c = np.asarray(c, np.float32)
    s = np.maximum(np.asarray(s,np.float32), 1e-3)  # sizes positive
    q = normalize(q)                                 # unit quaternion
    return c, s, q

def cylinder_from_params(p, v, r, h):
    p = np.asarray(p, np.float32)
    v = normalize(v)
    r = float(max(r, 1e-5))
    h = float(max(h, 1e-5))
    return p, v, r, h

def superquadric_from_params(c, q, a, e):
    c = np.asarray(c, np.float32)
    q = normalize(q)
    a = np.maximum(np.asarray(a, np.float32), 1e-4)     # axes
    e = np.maximum(np.asarray(e, np.float32), 0.1)      # exponents lower-bounded
    return c, q, a, e

# ---- Surface sampling (very light stubs) ----

def sample_points_plane(n, d, sx, sy, num=512):
    # Orthonormal basis on plane
    n = normalize(n)
    # pick arbitrary vector not parallel to n
    a = np.array([1,0,0], np.float32) if abs(n[0])<0.9 else np.array([0,1,0], np.float32)
    u = normalize(np.cross(n, a))
    v = normalize(np.cross(n, u))
    # plane point (closest to origin)
    p0 = -d * n
    # sample in rectangle centered at p0
    us = (np.random.rand(num)-0.5)*sx
    vs = (np.random.rand(num)-0.5)*sy
    pts = p0 + np.outer(us,u) + np.outer(vs,v)
    return pts.astype(np.float32)

def sample_points_cuboid(c, s, q, num=1024):
    # Very rough: sample on 6 faces axis-aligned; ignoring q for simplicity in stub
    sx, sy, sz = s
    cx, cy, cz = c
    n = int(num//6)
    faces = []
    # X faces
    x = np.full((n,1), cx+sx/2); y = np.random.uniform(cy-sy/2, cy+sy/2,(n,1)); z = np.random.uniform(cz-sz/2, cz+sz/2,(n,1))
    faces.append(np.hstack([x,y,z]))
    x = np.full((n,1), cx-sx/2); y = np.random.uniform(cy-sy/2, cy+sy/2,(n,1)); z = np.random.uniform(cz-sz/2, cz+sz/2,(n,1))
    faces.append(np.hstack([x,y,z]))
    # Y faces
    y = np.full((n,1), cy+sy/2); x = np.random.uniform(cx-sx/2, cx+sx/2,(n,1)); z = np.random.uniform(cz-sz/2, cz+sz/2,(n,1))
    faces.append(np.hstack([x,y,z]))
    y = np.full((n,1), cy-sy/2); x = np.random.uniform(cx-sx/2, cx+sx/2,(n,1)); z = np.random.uniform(cz-sz/2, cz+sz/2,(n,1))
    faces.append(np.hstack([x,y,z]))
    # Z faces
    z = np.full((n,1), cz+sz/2); x = np.random.uniform(cx-sx/2, cx+sx/2,(n,1)); y = np.random.uniform(cy-sy/2, cy+sy/2,(n,1))
    faces.append(np.hstack([x,y,z]))
    z = np.full((n,1), cz-sz/2); x = np.random.uniform(cx-sx/2, cx+sx/2,(n,1)); y = np.random.uniform(cy-sy/2, cy+sy/2,(n,1))
    faces.append(np.hstack([x,y,z]))
    P = np.vstack(faces).astype(np.float32)
    return P

def chamfer_distance(A, B):
    # A, B: (Na,3), (Nb,3)
    if len(A)==0 or len(B)==0:
        return float('inf')
    from scipy.spatial import cKDTree as KDTree  # if SciPy unavailable this will error
    ta, tb = KDTree(A), KDTree(B)
    da, _ = ta.query(B, k=1)
    db, _ = tb.query(A, k=1)
    return float(np.mean(da**2) + np.mean(db**2))

print("Primitive helpers ready (simple stubs).")


In [None]:
import numpy as np

def greedy_match(cost):
    # Greedy matching as a fallback when SciPy not available.
    # cost: (M, N) matrix (pred vs gt). Returns list of (i_pred, j_gt).
    M, N = cost.shape
    pairs, used_r, used_c = [], set(), set()
    for _ in range(min(M, N)):
        i, j = np.unravel_index(np.argmin(cost + 1e6*(np.isin(np.arange(M)[:,None], list(used_r)) | np.isin(np.arange(N)[None,:], list(used_c)))), (M,N))
        if i in used_r or j in used_c:
            break
        pairs.append((i,j))
        used_r.add(i); used_c.add(j)
    return pairs

def hungarian_match(cost):
    try:
        from scipy.optimize import linear_sum_assignment
        r, c = linear_sum_assignment(cost)
        return list(zip(r.tolist(), c.tolist()))
    except Exception as e:
        return greedy_match(cost)

# Example usage:
# cost = np.random.rand(5,4)
# pairs = hungarian_match(cost)
# print(pairs)


## 5. Model Specification: Inputs & Outputs

### 5.1 Inputs
- **RGB**: `float32` in `[0,1]`, shape **(B, 3, H, W)**.
- **Intrinsics `K`**: shape **(B, 3, 3)**.
- **(Optional) Dynamic mask**: **(B, 1, H, W)**.
- **(Optional) Previous keyframe memory**: features or pose prior.

### 5.2 Shared Backbone
- ConvNeXt-T or ViT-B/16 + FPN features at strides {4, 8, 16}.

### 5.3 Head A — Scene-Coordinate Regression (Pose Head)
**Output (at stride s, e.g., s=8):**
- **Dense scene-coords (optional)**: `(B, H/s, W/s, 3)` in **world**.
- **Uncertainty map**: `(B, H/s, W/s, 1)`.
**OR** (recommended for efficiency): predict on a **sparse lattice** of K pixels per frame:
- `uv_pred`: `(B, K, 2)` pixel centers,
- `XYZ_pred`: `(B, K, 3)` world coords,
- `sigma_pred`: `(B, K, 1)` uncertainties.

**Pose estimation:** PnP + (differentiable) RANSAC (DSAC++).  
**Pose output:** rotation `R_cw`, translation `t_cw` (or SE(3)/Sim(3) if monocular scale handling).

### 5.4 Head B — Primitive Set-Prediction (Mapping Head)
- Up to **M** slots, DETR-style bipartite matching with a **no-object** class.
- For each slot: `type` ∈ {none, plane, cuboid, cylinder, superquadric} and a parameter vector.
- We predict primitives in the **camera frame** and transform to **world** with the pose from Head A.

**Slot outputs (per item):**
- `type_logits`: `(B, M, T)` where `T=1+num_types` (no-object + types)
- `params`: flattened vector per type; packed as `(B, M, P_max)` with masks
- `conf`: `(B, M, 1)` confidence

**Typical parameterizations (camera frame):**
- Plane: unit normal `n`, offset `d`, extents `(sx,sy)`; reparam via `n = v/||v||`, `sx,sy = exp(.)`.
- Cuboid: center `c`, size `s=exp(.)`, rotation quaternion `q/||q||`.
- Cylinder: axis dir `v/||v||`, point `p`, `r=softplus(.)`, `h=softplus(.)`.
- Superquadric: `c`, `q`, semi-axes `a_i=softplus(.)`, exponents `eps_i=softplus(.)` with bounds.

After pose composition: **world-frame primitives** are sent to data association & map update.


## 6. Losses

### 6.1 Scene-Coordinate / Pose Losses
- **Scene-coord regression:** L1 on valid pixels:  \(\mathcal{L}_{xyz} = \frac{1}{N}\sum \|\hat{\mathbf{X}}_w - \mathbf{X}_w\|_1\).
- **Uncertainty weighting:** scale residuals by predicted \(\sigma\): \(\mathcal{L}_{unc} = \sum \frac{\|\Delta\|_2^2}{\sigma^2} + \log \sigma^2\).
- **DSAC pose loss:** negative log inlier probability and reprojection residual under best hypothesis.
- **Reprojection auxiliary:** \(\mathcal{L}_{reproj} = \sum \|\pi(K, T_{cw}, \hat{\mathbf{X}}_w) - \mathbf{u}\|_1\).

### 6.2 Primitive Set Losses (Hungarian-matched)
For a predicted set \( \hat{\mathcal{S}} \) and GT set \( \mathcal{S} \):
- **Type/class CE**: classification cost.
- **Param losses** (per type): L1/L2 on canonical params (e.g., plane normal cosine, offsets, sizes).
- **Shape similarity**: Chamfer distance between sampled surface points.
- **Silhouette/occupancy**: rasterize to a coarse grid (in camera or world) and use BCE/IoU.
- **Regularizers**: unit-norm for normals/axis/quaternion, positivity for sizes/radii/axes (via softplus).

**Total:** \(\mathcal{L} = \lambda_{xyz}\mathcal{L}_{xyz} + \lambda_{pose}\mathcal{L}_{pose} + \lambda_{cls}\mathcal{L}_{cls} + \lambda_{param}\mathcal{L}_{param} + \lambda_{geom}\mathcal{L}_{chamfer} + \lambda_{sil}\mathcal{L}_{sil}\).


## 7. Training Schedule & Augmentation
- **Backbone**: ImageNet init; AdamW (lr 2e-4, wd 1e-2), cosine decay; 150–200k iters.
- **Batch**: 8–16 @ 512×512 (mixed precision).
- **Sampling**: K=1–2k scene-coord pixels per image (avoid dynamic mask).
- **Augmentations**: color jitter, blur, noise, exposure, small intrinsics jitter; motion blur; random occluders.
- **Balancing**: gradient normalization or loss weights (\(\lambda\)) tuned to equalize magnitudes across heads.
- **Curriculum**: start with planes & cuboids; add cylinders/superquadrics in stage 2.


## 8. Inference & Mapping
**Per frame:**
1. Dynamic mask (if available).
2. Head A → scene-coords → PnP/DSAC → pose \(T_{cw}\). If inliers low, keep last good pose and increase reliance on step 4.
3. Head B → M primitives in **camera**; transform to **world** with \(T_{cw}\).
4. **Data association** with existing primitives (IoU in BEV/silhouette, Chamfer on samples). Update via EKF/LM. Spawn new primitives as needed.
5. Periodically: image retrieval → loop closure → pose-graph optimization (Sim(3) for mono). Jointly adjust primitive params.

**Outputs to downstream:**
- Current pose; **compact map**: list of primitives with covariances.
- Optional coarse ESDF/occupancy derived from primitives for planners.


## 9. Evaluation
- **Pose**: ATE (abs trajectory error), RPE (rel pose error), rotation/translation medians.
- **Primitives**: Chamfer on sampled surfaces; silhouette IoU; plane normal error; cuboid IoU; cylinder axis angle; **map bitrate** (floats × 4B).
- **Navigation**: collision-free rate on planned paths; clearance min/max; ESDF coverage.
- **Ablations**: no dynamics mask; no superquadrics; DSAC vs RANSAC; sparse vs dense scene-coords; set size M.


## 10. Repro Checklist
- Export RGB/K/T_cw, dynamic masks, scene-coords (sparse), and primitive GT (per-frame or per-scene).
- Verify JSON schemas with the validators below.
- Start training with {planes, cuboids} and M=32 slots; H=W=512; K=2k scene-coord points.
- Evaluate on held-out sequences; report pose + primitive metrics + map size.


In [None]:
# Example: build a tiny example frame meta and primitive list and validate
meta = {
  "frame_id": 1,
  "timestamp": 1725001234.0,
  "rgb_path": "images/000001.png",
  "depth_path": None,
  "dyn_mask_path": None,
  "K": [[600.0,0.0,320.0],[0.0,600.0,240.0],[0.0,0.0,1.0]],
  "T_cw": [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]],
  "scene_id": "scene_000",
  "split": "train"
}
ok, errs = validate_frame_meta(meta)
print("Frame meta valid:", ok, errs)

prims = [
  {"id":"pl0","type":"plane","params":{"n":[0,0,1],"d":0.0,"sx":20.0,"sy":20.0}},
  {"id":"cb1","type":"cuboid","params":{"c":[5,2,0],"s":[4,3,10],"q":[1,0,0,0]}}
]
for p in prims:
    ok, errs = validate_primitive(p)
    print(p["id"], "valid:", ok, errs)
