- **3D vision**: autonomous driving, robotics, AR/VR, medical imaging, virtual production
- **Challenges**: depth ambiguity, occlusion(parts of objects are hidden), viewpoint variation (same object looks very different from different angles), sparse measurements(LiDAR gives points, not surfaces, many regions are missing), data scarcity
- **3D shape representations**:
    - **Depth maps**: one value per pixel, distance to the visible surface used as it's simple, compact, and useful for navigation, first step toward 3D, some sensors give depth directly (e.g. Kinect). train with per-pixel loss. 
        - **Problem**: absolute scale is ambiguous. 
        - **Solution**: use **scale-invariant loss** and focus on **relative depth**. 
        - **Cons**: only visible surfaces, not full 3D, scale ambiguity.
    - **Voxels**: 3D version of pixels, grid of size V × V × V, each cell says: empty or occupied. Simple idea, represent full volume, useful in robotics, easy to fuse multiple views, medical scans already use this. 
        - **Process** with 3D convolutions (like CNNs, but in 3D). 
        - **Problem**: memory heavy. Fix: octrees (coarse far away, fine nearby). 
        - **Pros**: simple, full space representation, works with 3D CNNs. 
        - **Cons**: memory heavy, hard to get fine detail.
    - **Point clouds**: sets of 3D points, shape P × 3, no grid/order. Used in LiDAR, robotics, autonomous driving. 
        - **Process** with PointNet-style approach: MLP on each point, pool (max/mean), get global feature. 
        - **Tasks**: classification, segmentation, scene understanding. Predict from images by CNN extracting features and predicting 3D points. Compare with Chamfer distance. 
        - **Pros**: efficient, flexible resolution, good for large scenes, matches real sensors. 
        - **Cons**: no explicit surface, sensitive to noise, uneven density.
    - **Meshes**: vertices (points) + faces (triangles), explicitly defines surface, standard in graphics. Why use: clear surface, supports texture, memory efficient, easy to render, used in games/AR/VR/movies. Process as graphs with **Graph Neural Networks**. Predict from images by starting with simple shape and predicting vertex offsets. Loss: sample points from mesh and compare with Chamfer distance. **Pros**: explicit surface, high detail, graphics-friendly, interpretable. **Cons**: hard to process with NNs, fixed topology often, no volume info.
    - **Implicit surfaces**: store a function (e.g. Signed Distance Function), surface is where function = 0. Why: continuous, resolution-free, high quality, flexible topology, smooth surfaces. Learn by sampling 3D points and predicting SDF value with neural network. **Pros**: very flexible, smooth, detailed, great for scenes. **Cons**: slow inference, hard to edit, hard to convert to mesh, careful sampling needed.

- **Chamfer distance**: Sum of L2 distance to each point’s nearest neighbour in the other set
- **Implicit surfaces**: Instead of storing points or faces, store a **function** like **Signed Distance Function** (SDF) surface is where function = 0. It's continuous, resolution-free, high quality, flexible topology, smooth surfaces. Used in NeRF, DeepSDF, scene reconstruction. Learn by sampling 3D points, predicting SDF value, training a neural network. **Pros**: very flexible, smooth, detailed, great for scenes. **Cons**: slow inference, hard to edit, hard to convert to mesh, careful sampling needed.


| Representation | Strength         | Weakness     |
| -------------- | ---------------- | ------------ |
| Depth map      | Simple           | Only visible |
| Voxels         | Full space       | Huge memory  |
| Point cloud    | Efficient        | No surface   |
| Mesh           | Explicit surface | Hard for NN  |
| Implicit       | High quality     | Slow         |
