### Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

link to the [paper](https://arxiv.org/pdf/2311.10887v1)

In [None]:
import torch.nn as nn
import torch
import open3d as o3d
import numpy as np

In [None]:
# example multi-view projection of 3D point clouds into multiple 2D images from different angles using open3d lib
import open3d as o3d
import numpy as np

# Load or create a point cloud
pcd = o3d.io.read_point_cloud("model.ply")  # or use your own point cloud

# Define camera viewpoints
camera_positions = [
    [0, 0, 1],  # front
    [1, 0, 0],  # side
    [0, 1, 0],  # top
]

# Render projections
images = []
for pos in camera_positions:
    vis = o3d.visualization.Visualizer()
    vis.create_window(visible=False)
    vis.add_geometry(pcd)

    ctr = vis.get_view_control()
    ctr.set_lookat(pcd.get_center())
    ctr.set_front(pos)
    ctr.set_up([0, 0, 1])
    ctr.set_zoom(0.5)

    vis.poll_events()
    vis.update_renderer()
    image = vis.capture_screen_float_buffer(False)
    images.append(np.asarray(image))
    vis.destroy_window()

In [None]:
# or if using PyTorch for implementing from scratch
import torch
# Suppose P is a (3, N) point cloud
P_hom = torch.cat([P, torch.ones(P.shape[0], 1)], dim=1)  # Make it homogeneous (N, 4)

# Camera projection matrix (extrinsics + intrinsics)
projection_matrix = get_projection_matrix(view_angle)

# Get 2D projections
projected_2d = (projection_matrix @ P_hom.T).T
projected_2d = projected_2d[:, :2] / projected_2d[:, 2:3]  # Normalize by depth

important vocabulary of the paper:
- Projection: Rendering 3D point cloud into 2D views using virtual cameras.
- Multi-view: Using several camera angles to get a more complete understanding.
- In code: Simulated with Open3D or PyTorch3D, capturing rasterized 2D images from different viewpoints.
- Purpose: Feed these 2D views into powerful 2D ViTs, then decode back to 3D — enabling masked self-supervised learning in 3D.


### Depth map projection phase.

$?$  how we project 3D point clouds to multi-view 2D depth images?

#### From 3D to 2D via Perspective Projection

We want to render a depth map from a point cloud by simulating how a virtual camera would view the scene from multiple viewpoints. This is a classic perspective projection task.

Here’s what happens:

1. Define Camera Intrinsics and Extrinsics
	- Intrinsic matrix K defines the camera’s internal parameters: focal length, principal point, etc.
   - Extrinsic matrix [R|t] defines the camera’s position and orientation in space.

2. Transform 3D point cloud into camera coordinates
	- Apply the extrinsic matrix to bring the point cloud into the camera’s local coordinate frame.

3. Project onto 2D image plane
	- Use the intrinsic matrix to project the 3D coordinates to 2D.
	- The z coordinate after transformation is used as the depth at that 2D pixel.

Let’s say a 3D point is $P = [X, Y, Z, 1]^T$, and we have:
 - Extrinsic matrix: $E = [R | t] \in \mathbb{R}^{3 \times 4}$
 - Intrinsic matrix: $K \in \mathbb{R}^{3 \times 3}$

Then,
$$P_{\text{cam}} = R \cdot P_{3D} + t$$
$$p_{\text{img}} = K \cdot P_{\text{cam}} \quad \text{(homogeneous coords)}$$
$$(u, v) = \left(\frac{x}{z}, \frac{y}{z}\right), \quad \text{depth} = z$$

In [None]:
# example code for this phase -> 3.2 3D to multi-view projection and encoding
import numpy as np
import matplotlib.pyplot as plt

def get_camera_matrix(fx, fy, cx, cy):
    return np.array([
        [fx, 0, cx],
        [0, fy, cy],
        [0,  0,  1]
    ])

def project_point_cloud_to_depth_map(points, intrinsic, extrinsic, H, W):
    """
    points: (N, 3) numpy array of 3D points
    intrinsic: (3, 3) camera intrinsic matrix
    extrinsic: (4, 4) camera extrinsic matrix
    H, W: height and width of output depth map
    """
    N = points.shape[0]
    
    # Convert to homogeneous coordinates
    points_hom = np.concatenate([points, np.ones((N, 1))], axis=1).T  # (4, N)

    # Transform to camera coordinates
    cam_coords = extrinsic @ points_hom  # (4, N)
    cam_coords = cam_coords[:3, :]  # (3, N)

    # Project to 2D
    pixels = intrinsic @ cam_coords  # (3, N)
    pixels = pixels / pixels[2, :]  # Normalize by depth

    u = np.round(pixels[0, :]).astype(int)
    v = np.round(pixels[1, :]).astype(int)
    z = cam_coords[2, :]

    # Create depth map
    depth_map = np.zeros((H, W), dtype=np.float32)
    for i in range(N):
        x, y = u[i], v[i]
        if 0 <= x < W and 0 <= y < H:
            if depth_map[y, x] == 0 or z[i] < depth_map[y, x]:  # Take nearest depth
                depth_map[y, x] = z[i]
    return depth_map

part segmentation with Point-MAE:

<img src=./images/part_segmenation_Point-MAE.png width=650>

in upsampling part, it uses **coordinate-based interpolation** (likely nearest-neighbor or k-NN) to up-sample these sparse features back to each individual point. This is how it bridges patch-level representation to dense per-point prediction.
the upsampling part in Point-MAE is similar to image segmentation, where a low-resolution feature map (e.g., 1/8th the image) is upsampled back to full resolution (e.g., *via bilinear interpolation or transposed convs*).

the most used upsampling is based on **Inverse Distance Weighted Interpolation**:

Use the 3 or 4 nearest center points $c_1, c_2, …, c_k$ and blend their features based on distance:

$$f(p_i) = \frac{\sum_{j=1}^k w_j f(c_j)}{\sum_{j=1}^k w_j} \quad \text{where } w_j = \frac{1}{\|p_i - c_j\| + \epsilon}$$

after this interpolation, some models (e.g PointNet++) use learnable MLPs after interpolation:

$$f{\prime}(p_i) = \text{MLP}(f(p_i), p_i)$$

This lets the network refine interpolated features with positional information. (learnable MLP which can learn from local and global features --> both from Avg/Max-pooling wigh global class token and Up-sampled tokens from encoder)

In [None]:
class FeaturePropagationMLP(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(in_dim, 128),
            nn.ReLU(),
            nn.Linear(128, out_dim)
        )

    # interpolated_feats --> per-point features from upsampled point cloud
    # concatenated by coords --> 3D coordinates of the point cloud
    # this way, our MLP is able to learn from both local and global features
    # and can be used to predict the final features for each point in the point cloud
    # in_dim = D + 3, out_dim = D
    # where D is the number of features per point in the point cloud
    # and 3 is the number of coordinates (x, y, z)
    def forward(self, interpolated_feats, coords):
        x = torch.cat([interpolated_feats, coords], dim=-1)  # (N, D+3)
        return self.mlp(x) # this now can learn from local and global features.

$?$ Why Use MLP Here?
1.	Learn geometry-aware feature mappings
    - Nearby points might have different meanings depending on object shape.
2.	Add positional context
    - Raw coordinates + interpolated features helps localize features better.
3.	Handle complex patterns
    - MLP can learn to denoise, sharpen, or smooth features smartly.

*so it Makes upsampling learnable, adaptive, and context-aware*