# DiffusionGS - Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

Official GitHub - https://caiyuanhao1998.github.io/project/DiffusionGS/

<div align="center">
  <img src="https://raw.githubusercontent.com/pleasure97/3D-AI-ML-Code-Implementation/main/2025/DiffusionGS/assets/pipeline.JPG" alt="Pipeline of DiffusionGS">
</div>

# 1. Dataset

In page 6 of the paper,

> We use **Objaverse and MVImgNet** as the training sets for objects.

> We center and scale each 3D object of Objaverse into $[-1, 1]^3$, and render 32 images at random viewpoints with 50 FOV.

> For MVImgNet, we crop the object, remove the background, normalize the cameras, and center and scale the object to $[-1, 1]^3$.

## 1.1 Objaverse Dataset

For more details of loading Objaverse dataset, you can look here - https://colab.research.google.com/drive/1ZLA4QufsiI_RuNlamKqV7D7mn40FbWoY

In [1]:
!pip install objaverse --upgrade --quiet

import objaverse
objaverse.__version__

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for gputil (setup.py) ... [?25l[?25hdone


'0.1.7'

Each object has a unique corresponding ID (universal identifier).

In [6]:
uids = objaverse.load_uids()
print(f"length of uids : {len(uids)}")
print(f"type of uids : {type(uids)}")

length of uids : 798759
type of uids : <class 'list'>


In [5]:
uids[:10]

['8476c4170df24cf5bbe6967222d1a42d',
 '8ff7f1f2465347cd8b80c9b206c2781e',
 'c786b97d08b94d02a1fa3b87d2e86cf1',
 '139331da744542009f146018fd0e05f4',
 'be2c02614d774f9da672dfdc44015219',
 'efd35e7d21ac482688c294e3b6c9f74e',
 '21d5f90dbc9f4f229b0faa7b56b67f3e',
 'dcd33159a0864de388de3a08f55e604a',
 'a7ad32b5d4d84ee5a40ebbd86da4dbe4',
 '7d6a14874eed48c2b720f0d1adfe6dd9']

We can get the object annotations for each of object using `objaverse.load_annotations()`.

In [11]:
annotations = objaverse.load_annotations(uids[:1])
annotations

{'8476c4170df24cf5bbe6967222d1a42d': {'uri': 'https://api.sketchfab.com/v3/models/8476c4170df24cf5bbe6967222d1a42d',
  'uid': '8476c4170df24cf5bbe6967222d1a42d',
  'name': 'Iain_Dawson_Kew_Road_Formby',
  'staffpickedAt': None,
  'viewCount': 4,
  'likeCount': 0,
  'animationCount': 0,
  'viewerUrl': 'https://sketchfab.com/3d-models/8476c4170df24cf5bbe6967222d1a42d',
  'embedUrl': 'https://sketchfab.com/models/8476c4170df24cf5bbe6967222d1a42d/embed',
  'commentCount': 0,
  'isDownloadable': True,
  'publishedAt': '2021-03-18T09:36:25.430631',
  'tags': [{'name': 'stair',
    'slug': 'stair',
    'uri': 'https://api.sketchfab.com/v3/tags/stair'},
   {'name': 'staircase',
    'slug': 'staircase',
    'uri': 'https://api.sketchfab.com/v3/tags/staircase'},
   {'name': 'staircon',
    'slug': 'staircon',
    'uri': 'https://api.sketchfab.com/v3/tags/staircon'}],
  'categories': [],
  'thumbnails': {'images': [{'uid': '606cf3aaaea14bb598913e803c7b26af',
     'size': 37800,
     'width': 1920

We're going to use multiprocessing to download the objects.

In [13]:
import multiprocessing
processes = multiprocessing.cpu_count()
processes

2

`objaverse.load_objects()` takes in a list of object UIDs and optionally the number of download processes, and returns a map from each object UIDs to its `.glb` file location on disk.

In [14]:
objects = objaverse.load_objects(uids=uids[:10],
                                 download_processes=processes)
objects

starting download of 10 objects with 2 processes
Downloaded 1 / 10 objects
Downloaded 2 / 10 objects
Downloaded 3 / 10 objects
Downloaded 4 / 10 objects
Downloaded 5 / 10 objects
Downloaded 6 / 10 objects
Downloaded 7 / 10 objects
Downloaded 8 / 10 objects
Downloaded 9 / 10 objects
Downloaded 10 / 10 objects


{'8476c4170df24cf5bbe6967222d1a42d': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/8476c4170df24cf5bbe6967222d1a42d.glb',
 '8ff7f1f2465347cd8b80c9b206c2781e': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/8ff7f1f2465347cd8b80c9b206c2781e.glb',
 'c786b97d08b94d02a1fa3b87d2e86cf1': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/c786b97d08b94d02a1fa3b87d2e86cf1.glb',
 '139331da744542009f146018fd0e05f4': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/139331da744542009f146018fd0e05f4.glb',
 'be2c02614d774f9da672dfdc44015219': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/be2c02614d774f9da672dfdc44015219.glb',
 'efd35e7d21ac482688c294e3b6c9f74e': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/efd35e7d21ac482688c294e3b6c9f74e.glb',
 '21d5f90dbc9f4f229b0faa7b56b67f3e': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/21d5f90dbc9f4f229b0faa7b56b67f3e.glb',
 'dcd33159a0864de388de3a08f55e604a': '/root/.objaverse/hf-objaverse-v1/glbs/000-023/dcd33159a0864de388de3a08f55e604a.glb',
 'a7ad32b5d4d84e

Let's load up one of the `.glb` files to visualize it.

In [15]:
!pip install trimesh --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/707.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/707.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m706.6/707.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m707.0/707.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [18]:
import trimesh
trimesh.load(list(objects.values())[0]).show()

## 1.2 MVImgNet Dataset

You should enter the required information at the following link to download MVImgNet.- https://docs.google.com/forms/d/e/1FAIpQLSfU9BkV1hY3r75n5rc37IvlzaK2VFYbdsvohqPGAjb2YWIbUg/viewform?usp=sf_link

# 2. Training

## 2.1 3D Diffusion
---
*  $\mathbf{x}_{\text {con }} \in \mathbb{R}^{H \times W \times 3}$ - 1 clean condition view
* $\mathcal{X}_t=\left\{\mathbf{x}_t^{(1)}, \mathrm{x}_t^{(2)}, \cdots, \mathbf{x}_t^{(N)}\right\}$ -  $N$ noisy views
  * $\mathcal{X}_0=\left\{\mathbf{x}_0^{(1)}, \mathrm{x}_0^{(2)}, \cdots, \mathrm{x}_0^{(\mathrm{N})}\right\}$ - *Concatenated with $\mathcal{X}_t$*
* $\mathbf{v}_{\text {con }} \in \mathbb{R}^{H \times W \times 6}$
  * $\mathcal{V}=\left\{\mathbf{v}^{(1)}, \mathbf{v}^{(2)}, \cdots, \mathbf{v}^{(\mathbb{N})}\right\}$

$$\mathbf{x}_t^{(i)}=\overline{\alpha_t} \mathbf{x}_0^{(i)}+\sqrt{1-\overline{\alpha_t}} \epsilon_t^{(i)}$$
* $\overline{\alpha_t}$ - pre-scheduled hyper-parameter
* $\epsilon_t^{(i)} \sim \mathcal{N}(0, \mathbf{I})$ and $i=1,2, \cdots, N$
* $t$ - timestep
---
$$\mathcal{G}_\theta\left(\mathcal{X}_t \mid \mathbf{x}_{c o n}, \mathbf{v}_{c o n}, t, \mathcal{V}\right)=\left\{G_t^{(k)}\left(\mu_t^{(k)}, \boldsymbol{\Sigma}_t^{(k)}, \alpha_t^{(k)}, c_t^{(k)}\right)\right\}$$
* $\theta$ - denoiser
* $\mathcal{G}_\theta$ - predicted 3D Gaussians by $\theta$
* $1 \leq k \leq N_g$
* $N_g=(N+1) H W$ - the number of per-pixel Gaussian $G_t^{(k)}$
* $H , W$ - Height and Width of the image
* $\mu_t^{(k)} \in$ $\mathbb{R}^3$ - the center position of each $G_t^{(k)}$ (clipped into $[-1, 1]^3$)
* $\Sigma_t^{(k)} \in \mathbb{R}^{3 \times 3}$ - the covariance of each $G_t^{(k)}$ controlling its shape
  * parameterized by a rotation matrix $\mathbf{R}_t^{(k)}$ and a scaling matrix $\mathbf{S}_t^{(k)}$
* $\alpha_t^{(k)} \in \mathbb{R}$ - the opacity of each $G_t^{(k)}$ characterizing the transmittance
* $c_t^{(k)} \in \mathbb{R}^3$ - the RGB color of each $G_t^{(k)}$
---
$$\mu_t^{(k)}=o^{(k)}+u_t^{(k)} d^{(k)}$$
* $o^{(k)}$ - the origin of the $k$-th pixel-aligned ray
* $d^{(k)}$ - the direction of the $k$-th pixel-aligned ray
---
$$u_t^{(k)}=w_t^{(k)} u_{\text {near }}+\left(1-w_t^{(k)}\right) u_{f a r}$$
* $w_t^{(k)} \in \mathbb{R}$ - the weight to control $u_t^{(k)}$
* $u_{\text {near }}$ - the nearest distances
* $u_{f a r}$ - the farthest distances
* For the object-level Gaussian decoder, $[u_{\text {near }}, u_{f a r}] = [0.1, 4.2]$
* For the scene-level Gaussian decoder, $[u_{\text {near }}, u_{f a r}] = [0, 500]$

## 2.2 Denoiser

---

* $L$ - the number of tranformer blocks
* Each transformer block contains 1 MSA, 1 MLP, and 2 LN.
* $\hat{\mathcal{H}}=\left\{\hat{\mathbf{H}}_{\text {con }}, \hat{\mathbf{H}}^{(1)}, \cdots, \hat{\mathbf{H}}^{(N)}\right\}$ - per-pixel Gaussian Maps
  * $\hat{\mathbf{H}}_{\text {con }}$, $\hat{\mathbf{H}}^{(i)} \in$ $\mathbb{R}^{H \times W \times 14}$



In [None]:
import torch
import torch.nn as nn

In [None]:
class PatchEmbedding(nn.Module):
  """ Turns a 2D input image into a 1D sequence learnable embeding vector.
  Args:
    in_channels (int) - Number of color channels for the input images. Defaults to 3.
    patch_size (int) - Size of patches to convert input images into. Defaults to 16.
    embedding_dim (int) - Size of embedding to turn image into. Defaults to 768.
  """
  def __init__(self, in_channels: int=3, patch_size: int=16, embedding_dim: int=768):
    super().__init__()

    self.in_channels = in_channels
    self.patch_size = patch_size
    self.embedding_dim = embedding_dim

    self.patchify = nn.Conv2d(in_channels=self.in_channels,
                              out_channels=self.embedding_dim,
                              kernel_size=self.patch_size,
                              stride=self.patch_size,
                              padding=0)

    self.flatten = nn.Flatten(start_dim=2, end_dim=3)

  def forward(self, x: torch.Tensor):
    image_resolution = x.shape[-1]
    assert image_resoultion % self.patch_size == 0,
      f"Input size must be divisible by patch size, image size : {image_resolution}, patch size : {self.patch_size}"

    x_patched = self.patchify(x)
    x_flattened = self.flatten(x_patched)

    return x_flattened.permute(0, 2, 1) # [batch_size, patch_size ** 2 * channel, embedding_dim] -> [batch_size, embedding_dim, patch_size ** 2 ]

In [None]:
!pip install torchinfo --quiet

In [None]:
from torchinfo import summary

PatchEmbedding = PatchEmbedding()
height, width = image.shape[1], image.shape[2]
num_patches = int((height * width) / PatchEmbedding.patch_size ** 2)

# summary(PatchEmbedding(),
#         input_size=,
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"] )

In [None]:
PositionalEmbedding = nn.Parameter(torch.ones(1, num_patches + 1, PatchEmbedding.embedding_dimension), requires_grad=True)
PatchAndPositionalEmbedding = PatchEmbedding + PositionalEmbedding

In [None]:
class MultiHeadSelfAttentionBlock(nn.Module):
  def __init__(self,
               embedding_dim: int=768,
               num_heads: int=12,
               attention_dropout: float=0.):
    super().__init__()

    self.embedding_dim = embedding_dim
    self.num_heads = num_heads
    self.attention_dropout = attention_dropout

    self.layer_norm = nn.LayerNorm(normalized_shape=self.embdding_dim)

    self.multihead_attention = nn.MultiheadAttention(embed_dim=self.embedding_dim,
                                                     num_heads=self.num_heads,
                                                     dropout=self.attention_dropout,
                                                     batch_first=True)

  def forward(self, x: torch.Tensor):
    x = self.layer_norm(x)
    attention_output, _ = self.multihead_attention(query=x, key=x, value=x, need_weights=False)
    return attention_output

In [None]:
class MLPBlock(nn.Module):
  def __init__(self,
               embedding_dim: int=768,
               mlp_size: int=3072,
               dropout: float=0.1):
    super().__init__()

    self.embedding_dim = embedding_dim
    self.mlp_size = mlp_size
    self.dropout = dropout

    self.layer_norm = nn.LayerNorm(normalized_shape=embedding_dim)

    self.mlp = nn.Sequential(
        nn.Linear(in_features=self.embedding_dim, out_features=self.mlp_size),
        nn.GELU(),
        nn.Dropout(p=dropout),
        nn.Linear(in_features=self.mlp_size, out_features=self.embedding_dim),
        nn.Dropout(p=dropout)
    )

  def forward(self, x: torch.Tensor):
    x = self.layer_norm(x)
    x = self.mlp(x)
    return x

In [None]:
class TransformerBlock(nn.Module):
  def __init__(self,
               embedding_dim: int=768,
               num_heads: int=12,
               mlp_size: int=3072,
               mlp_dropout: float=0.1,
               attention_dropout: float=0.):
    super().__init__()

    self.embedding_dim = embedding_dim
    self.num_heads = num_heads
    self.mlp_size = mlp_size
    self.mlp_dropout = mlp_dropout
    self.attention_dropout = attention_dropout

    self.MSABlock = MultiheadSelfAttentionBlock(embedding_dim=self.embedding_dim,
                                                num_heads=self.num_heads,
                                                attention_dropout=self.attention_dropout)

    self.MLPBlock = MLPBlock(embedding_dim=self.embedding_dim,
                             mlp_size=self.mlp_size,
                             mlp_dropout=self.mlp_dropout)

  def forward(self, x: torch.Tensor):

    x = self.MSABlock(x) + x
    x = self.MLPBlock(x) + x

    return x

In [None]:
class TransformerLayer(nn.Module):
  def __init__(self,
               img_size: int=224,
               in_channels: int=3,
               patch_size: int=12,
               num_transformer_layers: int=12,
               embedding_dim: int=768,
               mlp_size: int=3072,
               num_heads: int=12,
               attention_dropout: float=0.,
               mlp_dropout: float=0.1,
               embedding_dropout: float=0.1,
               num_outputs: int=3072):
    super().__init__()

    assert img_size % patch_size == 0, f"Image size must be divisible by patch size, image size: {img_size}, patch size: {patch_size}."

    self.img_size = img_size
    self.in_channels = in_channels
    self.patch_size = patch_size
    self.num_transformer_layers = num_transformer_layers
    self.embedding_dim = embedding_dim
    self.mlp_size = mlp_size
    self.num_heads = num_heads
    self.attention_dropout = attention_dropout
    self.mlp_dropout = mlp_dropout
    self.embedding_dropout = embedding_dropout
    self.num_outputs = num_outputs

    self.num_patches = (self.img_size * self.img_size) // self.patch_size ** 2
    self.PositionEmbedding = nn.Parameter(data=torch.randn(1, self.num_patches + 1, self.embedding_dim), requires_grad=True)
    self.EmbeddingDropout = nn.Dropout(p=self.embedding_dropout)
    self.PatchEmbedding = PatchEmbedding(in_channels=self.in_channels,
                                         patch_size=self.patch_size,
                                         embedding_dim=self.embedding_dim)
    self.TransformerBlocks = nn.Sequential(*[TransformerBlock(embedding_dim=self.embedding_dim,
                                                             num_heads=self.num_heads,
                                                             mlp_size=self.mlp_size,
                                                             mlp_droput=self.mlp_dropout) for _ in range(self.num_transformer_layers)])
    self.classifier = nn.Sequential(nn.LayerNorm(normalized_shape=self.embedding_dim),
                                    nn.Linear(in_features=self.embedding_dim, out_features=self.num_outputs))

  def forward(self, x: torch.Tensor):
    batch_size = x.shape[0]
    x = self.PatchEmbedding(x)
    x = self.PositionEmbedding(x) + x
    x = self.EmbeddingDropout(x)
    x = self.TransformerBlocks(x)
    x = self.classifier(x)
    return

---
$$\hat{\mathcal{X}}_{(0, t)}=\left\{\hat{\mathbf{x}}_{(0, t)}^{(1)}, \hat{\mathbf{x}}_{(0, t)}^{(2)}, \cdots, \hat{\mathbf{x}}_{(0, t)}^{(N)}\right\}$$
* $\hat{\mathcal{X}}_{(0, t)}$ - the denoised multi-view images

---
$$\hat{\mathbf{x}}_{(0, t)}^{(i)}=F_r\left(\mathbf{M}_{e x t}^{(i)}, \mathbf{M}_{i n t}^{(i)}, \mathcal{G}_\theta\left(\mathcal{X}_t \mid \mathbf{x}_{c o n}, \mathbf{v}_{c o n}, t, \mathcal{V}\right)\right)$$
* $F_r$ - the differentiable rasterization function
* $1 \leq i \leq N$
* $\mathbf{M}_{e x t}^{(i)}$ - the extrinsic matrix of the viewpoint $\mathbf{c}^{(i)}$.
* $\mathbf{M}_{i n t}^{(i)}$ - the intrinsic matrix of the viewpoint $\mathbf{c}^{(i)}$.

In [None]:
!git clone https://github.com/graphdeco-inria/diff-gaussian-rasterization.git diff-gaussian-rasterization

---
$$\boldsymbol{\Sigma}_t^{\prime(k, i)}=\mathbf{J}_t^{(i)} \mathbf{W}_t^{(i)} \boldsymbol{\Sigma}_t^{(k)} \mathbf{W}_t^{(i)^{\top}} \mathbf{J}_t^{(i)^{\top}}$$
* $\boldsymbol{\Sigma}_t^{(k)}$ - the 3D covariance matrix of each $G_t^{(k)}$ at viewpoint $\mathbf{c}^{(i)}$ in the world coordinate system
* $\boldsymbol{\Sigma}_t^{\prime(k, i)} \in \mathbb{R}^{3 \times 3}$  - the 3D covariance matrix of each $G_t^{(k)}$ at viewpoint $\mathbf{c}^{(i)}$ in the camera coordinate system
*  $\mathbf{J}_t^{(i)} \in \mathbb{R}^{3 \times 3}$ - the Jacobian matrix of the affine approximation of the projective transformation
* $\mathbf{W}_t^{(i)} \in \mathbb{R}^{3 \times 3}$ - the viewing transformation

## 2.3 Scene-Object Mixed Training Strategy

---

### Viewpoint Selecting

$$\theta_{c d}^{(i)} \leq \theta_1, \quad \theta_{d n}^{(i, j)} \leq \theta_2,$$

* The first constraint of the angle between viewpoint and positions
* $\theta_{c d}^{(i)}$ - the angle between the $i$-th noisy view position and the condition view position
* $\theta_{d n}^{(i, j)}$ - the angle between the $i$-th noisy view position and the $j$-th novel view position
* $\theta_{1}, \theta_{2}$ - hyperparamters
* $1 \leq i \leq N$
* $1 \leq$ $j \leq M$

$$
\frac{\vec{z}_{c o n} \cdot \vec{z}_{n o i s e}^{(i)}}{\left|\vec{z}_{\text {con }}\right| \cdot\left|\vec{z}_{\text {noise }}^{(i)}\right|} \geq \cos \left(\varphi_1\right), \frac{\vec{z}_{\text {con }} \cdot \vec{z}_{n v}^{(j)}}{\left|\vec{z}_{\text {con }}\right| \cdot\left|\vec{z}_{n v}^{(j)}\right|} \geq \cos \left(\varphi_2\right)
$$

* The second constraint of the angle between viewpoint orientations
* $\vec{z}_{c o n}$ - the forward direction vectors of the condition view
* $\vec{z}_{n o i s e}^{(i)}$ - the forward direction vectors of the $i$-th noisy view
* $\vec{z}_{n v}^{(j)}$ - the forward direction vectors of the $j$-th novel view
* $\varphi_1$, $\varphi_2$ - hyperparameters

---

### Reference-Point Plücker Coordinate (RPPC)

$$r=(o-(o \cdot d) d, d)$$

* $r$ - the pixel-aligned ray embeddings
* $o$ - the position of the ray landing on the pixel
* $d$ - the direction of the ray landing on the pixel

---

### Overall Training Objective

$$\mathcal{L}_{\text{pd}} = \mathbb{E}_k \left[ l_t^{(k)} - \frac{\mathbb{E}_k[l_t^{(k)}] - \sigma_0 + \mathbb{E}[o^{(k)}]}{\sqrt{\text{Var}(l_t^{(k)})}} \right]$$

* ${L}_{\text{pd}}$ - the point distribution loss for training warm-up
* $\mathbb{E}$ - the mean value
* $l_t^{(k)} = |u_t^{(k)} d^{(k)}|$
* $Var$ - the variance
* $\sigma_{0}$ - the target standard deviation (set to 0.5)

$$\mathcal{L} = (\mathcal{L}_{\text{de}} + \mathcal{L}_{\text{nv}}) \cdot \mathbb{1}_{\text{iter} > \text{iter}_0} + \mathcal{L}_{\text{pd}} \cdot \mathbb{1}_{\text{iter} \leq \text{iter}_0} \cdot \mathbb{1}_{\text{object}}$$

* $\mathcal{L}$ - the overall training objective
* $\mathcal{L}_{\text{nv}}$ - the novel view loss
* $\mathbb{1}_{\text{iter} > \text{iter}_0}$ - the conditional indicator function which equals 1 if the current training iteration is greater than the threshold $iter_{0}$
* $\mathbb{1}_{\text{iter} > \text{iter}_0}$  - similar indicator function as above


# References

* Loading Objaverse Dataset - https://colab.research.google.com/drive/1ZLA4QufsiI_RuNlamKqV7D7mn40FbWoY
* Paper Replicating - https://github.com/mrdbourke/pytorch-deep-learning/blob/main/08_pytorch_paper_replicating.ipynb
* 3D Gaussian Rasterization - https://github.com/graphdeco-inria/diff-gaussian-rasterization