# DiffusionGS

<div align="center">
  <img src="https://raw.githubusercontent.com/pleasure97/3D-AI-ML-Code-Implementation/main/2025/DiffusionGS/assets/pipeline.jpg" alt="Pipeline of DiffusionGS">
</div>

# 1. Training

## 1.1 3D Diffusion
---
*  $\mathbf{x}_{\text {con }} \in \mathbb{R}^{H \times W \times 3}$ - 1 clean condition view
* $\mathcal{X}_t=\left\{\mathbf{x}_t^{(1)}, \mathrm{x}_t^{(2)}, \cdots, \mathbf{x}_t^{(N)}\right\}$ -  $N$ noisy views
  * $\mathcal{X}_0=\left\{\mathbf{x}_0^{(1)}, \mathrm{x}_0^{(2)}, \cdots, \mathrm{x}_0^{(\mathrm{N})}\right\}$ - *Concatenated with $\mathcal{X}_t$*
* $\mathbf{v}_{\text {con }} \in \mathbb{R}^{H \times W \times 6}$
  * $\mathcal{V}=\left\{\mathbf{v}^{(1)}, \mathbf{v}^{(2)}, \cdots, \mathbf{v}^{(\mathbb{N})}\right\}$

$$\mathbf{x}_t^{(i)}=\overline{\alpha_t} \mathbf{x}_0^{(i)}+\sqrt{1-\overline{\alpha_t}} \epsilon_t^{(i)}$$
* $\overline{\alpha_t}$ - pre-scheduled hyper-parameter
* $\epsilon_t^{(i)} \sim \mathcal{N}(0, \mathbf{I})$ and $i=1,2, \cdots, N$
* $t$ - timestep
---
$$\mathcal{G}_\theta\left(\mathcal{X}_t \mid \mathbf{x}_{c o n}, \mathbf{v}_{c o n}, t, \mathcal{V}\right)=\left\{G_t^{(k)}\left(\mu_t^{(k)}, \boldsymbol{\Sigma}_t^{(k)}, \alpha_t^{(k)}, c_t^{(k)}\right)\right\}$$
* $\theta$ - denoiser
* $\mathcal{G}_\theta$ - predicted 3D Gaussians by $\theta$
* $1 \leq k \leq N_g$
* $N_g=(N+1) H W$ - the number of per-pixel Gaussian $G_t^{(k)}$
* $H , W$ - Height and Width of the image
* $\mu_t^{(k)} \in$ $\mathbb{R}^3$ - the center position of each $G_t^{(k)}$
* $\Sigma_t^{(k)} \in \mathbb{R}^{3 \times 3}$ - the covariance of each $G_t^{(k)}$ controlling its shape
  * parameterized by a rotation matrix $\mathbf{R}_t^{(k)}$ and a scaling matrix $\mathbf{S}_t^{(k)}$
* $\alpha_t^{(k)} \in \mathbb{R}$ - the opacity of each $G_t^{(k)}$ characterizing the transmittance
* $c_t^{(k)} \in \mathbb{R}^3$ - the RGB color of each $G_t^{(k)}$
---
$$\mu_t^{(k)}=o^{(k)}+u_t^{(k)} d^{(k)}$$
* $o^{(k)}$ - the origin of the $k$-th pixel-aligned ray
* $d^{(k)}$ - the direction of the $k$-th pixel-aligned ray
---
$$u_t^{(k)}=w_t^{(k)} u_{\text {near }}+\left(1-w_t^{(k)}\right) u_{f a r}$$
* $w_t^{(k)} \in \mathbb{R}$ - the weight to control $u_t^{(k)}$
* $u_{\text {near }}$ - the nearest distances
* $u_{f a r}$ - the farthest distances

## 1.2 Denoiser

---

* $L$ - the number of tranformer blocks
* Each transformer block contains 1 MSA, 1 MLP, and 2 LN.
* $\hat{\mathcal{H}}=\left\{\hat{\mathbf{H}}_{\text {con }}, \hat{\mathbf{H}}^{(1)}, \cdots, \hat{\mathbf{H}}^{(N)}\right\}$ - per-pixel Gaussian Maps
  * $\hat{\mathbf{H}}_{\text {con }}$, $\hat{\mathbf{H}}^{(i)} \in$ $\mathbb{R}^{H \times W \times 14}$
---
$$\hat{\mathcal{X}}_{(0, t)}=\left\{\hat{\mathbf{x}}_{(0, t)}^{(1)}, \hat{\mathbf{x}}_{(0, t)}^{(2)}, \cdots, \hat{\mathbf{x}}_{(0, t)}^{(N)}\right\}$$
* $\hat{\mathcal{X}}_{(0, t)}$ - the denoised multi-view images

---
$$\hat{\mathbf{x}}_{(0, t)}^{(i)}=F_r\left(\mathbf{M}_{e x t}^{(i)}, \mathbf{M}_{i n t}^{(i)}, \mathcal{G}_\theta\left(\mathcal{X}_t \mid \mathbf{x}_{c o n}, \mathbf{v}_{c o n}, t, \mathcal{V}\right)\right)$$
* $F_r$ - the differentiable rasterization function
* $1 \leq i \leq N$
* $\mathbf{M}_{e x t}^{(i)}$ - the extrinsic matrix of the viewpoint $\mathbf{c}^{(i)}$.
* $\mathbf{M}_{i n t}^{(i)}$ - the intrinsic matrix of the viewpoint $\mathbf{c}^{(i)}$.

---
$$\boldsymbol{\Sigma}_t^{\prime(k, i)}=\mathbf{J}_t^{(i)} \mathbf{W}_t^{(i)} \boldsymbol{\Sigma}_t^{(k)} \mathbf{W}_t^{(i)^{\top}} \mathbf{J}_t^{(i)^{\top}}$$
* $\boldsymbol{\Sigma}_t^{(k)}$ - the 3D covariance matrix of each $G_t^{(k)}$ at viewpoint $\mathbf{c}^{(i)}$ in the world coordinate system
* $\boldsymbol{\Sigma}_t^{\prime(k, i)} \in \mathbb{R}^{3 \times 3}$  - the 3D covariance matrix of each $G_t^{(k)}$ at viewpoint $\mathbf{c}^{(i)}$ in the camera coordinate system
*  $\mathbf{J}_t^{(i)} \in \mathbb{R}^{3 \times 3}$ - the Jacobian matrix of the affine approximation of the projective transformation
* $\mathbf{W}_t^{(i)} \in \mathbb{R}^{3 \times 3}$ - the viewing transformation

## 1.3 Scene-Object Mixed Training Strategy

