### You Only Need One Thing One Click: 
#### Self-Training for Weakly Supervised 3D Scene Understanding

source [paper](https://arxiv.org/pdf/2303.14727)

full story short: **just one point need to be annotated for any object in the scene.**

$?$ can we achieve a performance comparable to a fully supervised baseline given the extremely sparse annotations?

To meet such a challenge, we propose to design a self-training
approach with a label-propagation mechanism for weakly supervised
semantic segmentation. On the one hand, with the prediction result
of the model, the pseudo labels can be expanded to unknown regions
through our graph propagation module. On the other hand, with
richer and higher quality labels being generated, the model performance can be further improved. Thus, we conduct the label propagation and network training iteratively, forming a closed loop to boost
the performance of each other.

Core Contributions

This research introduces a self-training framework for 3D scene understanding, particularly focusing on semantic and instance segmentation tasks, using extremely sparse annotations. The key innovations include: ￼
1.	Minimal Annotation Strategy: The approach requires annotating only a single point per object, significantly reducing the manual labeling effort compared to traditional methods. ￼
2.	Iterative Self-Training with Label Propagation: The model employs an iterative training process where it alternates between training on the current set of labels and propagating labels to unlabeled data points. This propagation is facilitated by a graph-based module that considers the spatial and feature similarities between points. ￼
3.	Category Prototype Generation: A relation network is used to generate per-category prototypes, which help in enhancing the quality of pseudo-labels during the training iterations. ￼
4.	Extension to Instance Segmentation: Beyond semantic segmentation, the framework is adapted for instance segmentation tasks by incorporating a point-clustering strategy, allowing the model to distinguish between different object instances within the same category. ￼

core innovation: the **Category-Aware Label Propagation**
- To spread labels from the few clicked points to nearby points, the model uses:
  - Graph-based label propagation, considering spatial and feature similarity.
  - Category prototypes: these are learned representations (or “embeddings”) of each object category, guiding the label spreading process.

**super-voxel over-segmentation**:

A super-voxel is the 3D equivalent of a super-pixel in image processing.
- In 2D, super-pixels group adjacent pixels with similar color or texture into coherent regions.
- In 3D, super-voxels do the same for points in a point cloud: they group geometrically and visually similar 3D points into compact regions.

These super-voxels are:
- Geometrically homogeneous: points have similar surface normals or curvature
- Compact in 3D space
- Non-overlapping and complete:
$$\bigcup_j v_j = X,\quad v_j \cap v_{j{\prime}} = \emptyset \text{ for } j \ne j{\prime}$$

So, each point $p_i \in X$ belongs to exactly one super-voxel $v_j$.

Overview architecture:

<img src=./images/One-thing_One-click.png width=650>

the most important part of this diagram can be summerized as below:
- 3D semantic segmentation network $\Theta$ (3D U-Net)
- relation Network $\text{R}$

3D U-Net architecture as $\Theta$.

$$L_s = -\frac{1}{N} \sum_{i=1}^{N} \log P(y_i, \bar{c} | p_i, c_i, \Theta).$$

where $\bar{c}$ is the ground-truth category for the point $p_i$

**Pseudo Label Propagation** part: (graph-model based or transformer based)

**3D U-Net for Semantic Label Prediction** (Blue Path)

Purpose:
- Learns semantic features at the point level from the raw 3D input (geometry, color, etc.).
- Produces point-wise semantic predictions (e.g., “chair”, “table”, etc.).

Architecture:
- This is a 3D adaptation of the well-known U-Net architecture:
- Encoder: progressively downsamples spatial resolution while increasing feature dimensionality.
- Decoder: upsamples and combines features from the encoder (via skip connections) to preserve fine details.
- Operates on 3D volumes or point-wise features (depending on implementation, often voxelized input is used).

Why U-Net?
- U-Net preserves both local (fine-grained) and global (contextual) features, which is crucial for semantic segmentation.
- It gives strong point-level features that will later be aggregated over super-voxels.


**Relation Net for Super-Voxel Similarity Learning** (Green Path)

Purpose:
- Learns a feature embedding per super-voxel, capturing semantic and relational structure.
- Used to compute pairwise affinities between super-voxels, which later guides label propagation.

Architecture & Inputs:
- Takes input from:
    - Super-voxel partition of the point cloud
	- Possibly pooled U-Net features from inside the super-voxel
	- Geometric information (e.g., centroid, normal, curvature)
	- Learns a super-voxel-level representation using:
	- MLPs or small CNNs
	- Contrastive or similarity-based loss (depending on paper details)

What’s “Relation” About It?
- It learns to encode how similar or different two super-voxels are.
- This relation is used later in graph-based label propagation, where:
- Unary potentials come from the 3D U-Net
- Pairwise terms (affinities) are computed via Relation Net
- Combined via graph inference or a Transformer

more on Relation Net($\mathcal{R}$):

The relation network $\mathcal{R}$ shares the same backbone architecture as
the 3D U-Net $\Theta$ except for removing the last category-wise prediction
layer. It aims to predict a category-related embedding $f_j$ for each
super-voxel $v_j$ as the similarity measurement. $f_j$ is the per super-
voxel pooled feature in $\mathcal{R}$. In other words,the relation network groups
the embeddings of the same category, while pushing those of diﬀerent
categories apart. To this end, we propose to learn a prototypical
embedding for each category.

then we use this category-related embedding $f_j$ generated by $\mathcal{R}$ as the Query and the $k_c$ in the memory bank as the key, the two modules are optimized simultaneously with **contrastive learning**:

$$L_c = \frac{1}{M} \sum_{j} \left( -\log \frac{f_j \cdot k \bar{c}/\tau}{\sum_{c} f_j \cdot k c/\tau} \right)$$

where $\tau$ is the temprature hyperparameter, the contrastive learning is equivalent to c-way softmax classification task.

Our relation net complements with 3D U-Net. It measures the
relations between super-voxels using diﬀerent training strategies and
losses, while 3D U-Net aims to project the inputs into the latent feature space for category assignment. The prediction of relation
network is further combined with the prediction of 3D U-Net by
multiplying the predicted possibilities of each category to boost the
performance. In addition, the relation net oﬀers a stronger measurement of the pairwise term in CRF vs. handcrafted features like colours and also complements with the 3D U-Net features

**the training step in Relation Network**:

- The Relation Network outputs embedding f_j for every voxel.
- For each voxel j, we compute similarity with all prototypes $p_k$:
$\text{sim}(f_j, p_k)$
- Use softmax over similarities to treat this as a probability distribution.
- Use the U-Net’s UnaryToken to guide the correct prototype:
$$\text{Loss}_{\text{relation}} = \text{KL-Divergence}(\text{UnaryToken}_j \parallel \text{softmax}(\text{sim}(f_j, p_k)))$$

This is called **Unary-Guided Contrastive Learning**.

In short:
- U-Net learns semantic segmentation from sparse points.
- Relation Network learns object-level grouping using the U-Net’s predictions and contrastive embedding losses.


The 3D U-Net and Relation Network work in tandem within a self-training framework. The U-Net first generates pseudo-labels for unlabeled points based on the sparse annotations. The Relation Network then uses these pseudo-labels to learn relational patterns, refining the predictions. The refined predictions are fed back into the U-Net as pseudo-labels for the next iteration, creating a feedback loop that improves both networks’ performance over time. This synergy allows the model to generalize better with minimal supervision.

**Transformer-based label propagation**:

Unlike the graph model-based approach that learns the aﬃnity
among super-voxels, where the size of the aﬃnity matrix $M \times M$
grows quadratically relative to the number of super-voxels $M$, the
transformer-based label propagation aims to learn the correlation
between a super-voxel $v_j$ and a category prototype $k_c$. Therefore,
the size of the attention map $M \times c$ grows proportionally to $M$,
significantly improving eﬃciency in terms of memory and inference
time. Additionally, transformer-based label propagation can be optimised end-to-end, further improving the performance of 3D semantic segmentation.

$$\hat{f}_j = \Sigma_{c} \text{softmax} \left( \frac{Q(F_j) K(k_c)}{\sqrt{d_l}} \right) V(k_c)$$

where $Q$, $K$, and $V$ represent MLP layers, while $F_j$ represents
the feature vector of the 3D U-Net. The transformer then aggregates the category prototype kc based on the similarity between $F_j$
and $k_c$. The resulting output feature
$\hat{f}_j$ is then concatenated with $F_j$
to make the final prediction for the semantic category.



#### self-training mechanism

With the label propagation(Using U-Net $F_j$ and Relation-Net), there is a self-training approach
to update networks $\Theta$ and $\mathcal{R}$ and also the pseudo labels $Y$ iteratively.

first let's research more about **CRF(conditional Random Field)**:

A Conditional Random Field (CRF) is a type of probabilistic graphical model used for structured prediction. Unlike models that make independent predictions (e.g., logistic regression), CRFs jointly model the distribution of all outputs $Y$ given inputs $X$, while capturing dependencies between the outputs.

Formal Definition:
$$P(Y | X) = \frac{1}{Z(X)} \exp(-E(Y | X))$$

CRF model label dependencies, not just individual label likelihook.

self-training loop:

With $\Theta$ and $\mathcal{R}$ fixed, the label propagation is conducted to minimise the energy function. Then, the predictions
with high confidence are taken as the updated pseudo labels for
training the two networks in the following iteration. The confidence of super-voxel $v_j$ , denoted as $C_j$ , is the average of the minus
log probability of all $n_j$ points in $v_j$ after the label propagation:

$$C_j = \frac{1}{n_j} \sum_i^{n_j} \log P(y_i \mid \mathbf{p}_i, \mathbf{V}, \mathbf{\Theta}, \mathcal{R}, \mathbf{G}), \quad \text{where} \quad \mathbf{p}_i \in \mathbf{v}_j$$

where $G$ denotes the graph propagation. With pseudo labels $Y$, $\Theta$ and $\mathcal{R}$ are optimised, respectively.

**3D instance segmentation**:

[**PointGroup**](https://arxiv.org/abs/2004.01658) approach on 3D instance segmentation, using dual-set clustering and a MLP head to compute the offset of each point cloud to the centroid of each cluster, after that using Non-Maximum Suppression (NMS) and reducing mIoU(Mean Intersection over Union).