LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation (ECCV 2024)

Yushi Lan¹ Fangzhou Hong¹ Shuai Yang² Shangchen Zhou¹ Xuyi Meng¹
Bo Dai ³ Xingang Pan ¹ Chen Change Loy ¹

S-Lab, Nanyang Technological University¹;
Wangxuan Institute of Computer Technology, Peking University²;
Shanghai Artificial Intelligence Laboratory ³

LN3Diff (Latent Neural Fields 3D Diffusion) is a generic, feedforward 3D LDM framework that creates high-quality 3D object mesh from text within SECONDS.


The eiffel tower.	A stone waterfall with wooden shed.	A plate of sushi	A wooden chest with golden trim	A blue plastic chair.

For more visual results, go checkout our project page 📃

~~Codes coming soon 👊~~

This repository contains the official implementation of LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

[Project Page] • [arXiv] • [Gradio Demo]

📣 Updates

[10/2024] Further organize the code and support loading checkpoint directly from huggingface.

[08/2024] We have released the ZeroGPU huggingface demo on I23D, please check it on the Gradio space. For local run, simply run bash shell_scripts/final_release/inference/gradio_sample_obajverse_i23d_dit.sh.

[08/2024] We have released the new 3D VAE trained on G-Objaverse full sets, and the corresponding DiT-based T23D and I23D model, trained with flow-matching. Please check the samples below.

[06/2024] LN3Diff got accepted to ECCV 2024 🥳!

[04/2024] Inference and training codes on Objaverse, ShapeNet and FFHQ are released, including pre-trained model and training dataset.

[03/2024] Initial code release.

Demo

Check out our online demo on Gradio space. To run the demo locally, simply follow the installation instructions below, and afterwards call

bash shell_scripts/final_release/inference/gradio_sample_obajverse_i23d_dit.sh

🐪 TODO

🖥️ Requirements

NVIDIA GPUs are required for this project. We conduct all the training on NVIDIA V100-32GiB (ShapeNet, FFHQ) and NVIDIA A100-80GiB (G-Objaverse). We have test the inference codes on NVIDIA V100. We recommend using anaconda to manage the python environments.

The environment can be created via conda env create -f environment_ln3diff.yml, and activated via conda activate ln3diff. If you want to reuse your own PyTorch environment, install the following packages in your environment:

pip install -r requirements.txt
# then, install apex from https://github.com/NVIDIA/apex. Note that you should build with cuda support.

🏃‍♀️ Inference

(Recommended) Single-click inference on Objaverse-trained models

The easiest way for inference on Objaverse model is by launching the gradio demo locally. The checkpoint will be directly downloaded form huggingface.

bash shell_scripts/final_release/inference/gradio_sample_obajverse_i23d_dit.sh

For cli inference for image-to-3D:

bash shell_scripts/final_release/inference/sample_obajverse_i23d_dit.sh

For cli inference for text-to-3D:

bash shell_scripts/final_release/inference/sample_obajverse_t23d_dit.sh

Download Models manually & ShapeNet models inference

The pretrained checkpoints can be downloaded via OneDrive.

Put the downloaded checkpoints under checkpoints folder for inference. The checkpoints directory layout should be

checkpoints
├── objaverse
│     ├── model_rec1890000.pt # DiT/L-based 3D VAE 
│     └── objaverse-dit
│           └── t23d/model_joint_denoise_rec_model3820000.pt # 
│           └── i23d/model_joint_denoise_rec_model2990000.pt # 
├── shapenet
│     └── car
│           └── model_joint_denoise_rec_model1580000.pt
│     └── chair
│           └── model_joint_denoise_rec_model2030000.pt
│     └── plane
│           └── model_joint_denoise_rec_model770000.pt
├── ffhq
│     └── objaverse-vae/model_joint_denoise_rec_model1580000.pt
└── ...

Inference Commands

Note that to extract the mesh, 24GiB VRAM is required.

(New) Inference: (Single) Image-to-3D

We train a single-image-conditioned DiT-L/2 on the extracted VAE latents using flow-matching framework, for more controllable 3D generation. To inference the results, please run

bash shell_scripts/final_release/inference/sample_obajverse_i23d_dit.sh

Which reconstructs the 3D assets given input images from assets/i23d_examples/for_demo_inference. The input images are borrowed from InstantMesh. The model outputs are shown below (input in the next row.):

To run 3D reconstruction with your own data, just change the $eval_path in the above bash file. E.g., change it to eval_path=./assets/i23d_examples/instant_mesh_samples will do 3D reconstruction on more real images from InstantMesh. Also, tuning the cfg through $unconditional_guidance_scale will balance the generation fidelity and diversity.

We have uploaded the inference results on some common I23D images (from InstantMesh) to onedrive, including the condition images, rendered images/videos and the corresponding extracted textured mesh (with 4 different seeds, and cfg=5.0). Feel free to use them for comparison in your own method.

Inference: Text-to-3D

We train text-conditioned 3D latent diffusion model on top of the stage-1 extracted latents. For the following bash inference file, to extract textured mesh from the generated tri-plane, set --save_img True. To change the text prompt, set the prompt variable. For unconditional sampling, set the cfg guidance unconditional_guidance_scale=0. Feel free to tune the cfg guidance scale to trade off diversity and fidelity.

Note that the diffusion sampling batch size is set to 4, which costs around 16GiB VRAM. The mesh extraction of a single instance costs 24GiB VRAM.

text-to-3D on Objaverse

bash shell_scripts/final_release/inference/sample_obajverse_t23d_dit.sh

which shall reproduce the results shown in the Fig.5 in our paper, using the same text prompts. The results may slightly differ due to random seed used, but the quality are the same. Some output samples are shown in the top figure.

Note that the text prompts are directly hard-coded in the scripts/vit_triplane_diffusion_sample_objaverse.py.

text-to-3D on ShapeNet

For text-to-3D on ShapeNet, run one of the following commands (which conducts T23D on car, chair and plane.)

bash shell_scripts/final_release/inference/sample_shapenet_car_t23d.sh

bash shell_scripts/final_release/inference/sample_shapenet_chair_t23d.sh

bash shell_scripts/final_release/inference/sample_shapenet_plane_t23d.sh

The output samples for FID, COV/MMD calculation are uploaded here, which shall reproduce the quantitative results in Tab. 1 in the paper.

text-to-3D on FFHQ

For text-to-3D on FFHQ, run

bash shell_scripts/final_release/inference/sample_ffhq_t23d.sh

Stage-1 VAE 3D reconstruction

For (Objaverse) stage-1 VAE 3D reconstruction and extract VAE latents for diffusion learning, please run

bash shell_scripts/final_release/inference/sample_obajverse.sh

which shall give the following result:

The marching-cube extracted mesh can be visualized with Blender/MeshLab:

The above VAE input and reconstruction outputs can be found in the assets/stage1_vae_reconstruction folder.

!! We upload the pre-extracted vae latents here, which contains the correponding VAE latents (with shape 32x32x12) of 176K G-buffer Objaverse objects. Feel free to use them in your own task.

For more G-buffer Objaverse examples, download the demo data.

🏃‍♀️ Training

For training stage-1 VAE

For Objaverse, we use the rendering provided by G-buffer Objaverse. We process the data into multi-view chunks for faster loading, and the pre-processed data (176K instances) can be downloaded here. Noted that you need 450 GiB storage to download the dataset.

For ShapeNet, we render our own data with foreground mask for training, which can be downloaded from here. For training, we convert the raw data to LMDB for faster data loading. The pre-processed LMDB file can be downloaded from here.

For FFHQ, we use the pre-processed dataset from EG3D and compress it into LMDB, which can also be found in the onedrive link above.

For training stage-2 LDM

Pre-extracted latents

We have uploaded the pre-extracted vae latents here, which contains the correponding VAE latents (with shape 32x32x3x4) of 176K G-buffer Objaverse objects. Feel free to use them in the LDM training.

text-to-3D

The Cap3D captions can be downloaded from here. Please put under './datasets/text_captions_cap3d.json'

image-to-3D

We directly use G-Objaverse rendering images for training, and you may need to download their data for this experiments.

Training Commands

Coming soon.

More discussions of the proposed method

Compared to existing 3D generation framework such as SDS-based (DreamFusion), mulit-view generation-based (MVDream, Zero123++, Instant3D) and feedforward 3D reconstruction-based (LRM, InstantMesh, LGM), LN3Diff is an origin 3D Diffusion framework. Like 2D/Video AIGC pipeline, LN3Diff first trains a 3D-VAE and then conduct LDM training (text/image conditioned) on the learned latent space. Some related methods from the industry (Shape-E, CLAY, Meta 3D Gen) also follow the same paradigm. Though currently the performance of the origin 3D LDM's works are overall inferior to reconstruction-based methods, we believe the proposed method has much potential and scales better with more data and compute resources, and may yield better 3D editing performance due to its compatability with diffusion model.

🤝 BibTex

If you find our work useful for your research, please consider citing the paper:

@inproceedings{lan2024ln3diff,
    title={LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation}, 
    author={Yushi Lan and Fangzhou Hong and Shuai Yang and Shangchen Zhou and Xuyi Meng and Bo Dai and Xingang Pan and Chen Change Loy},
    year={2024},
    booktitle={ECCV},
}

🗞️ License

Distributed under the NTU S-Lab License. See LICENSE for more information.

Contact

If you have any question, please feel free to contact us via lanyushi15@gmail.com or Github issues.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
checkpoints		checkpoints
cldm		cldm
datasets		datasets
dit		dit
dnnlib		dnnlib
evaluations		evaluations
guided_diffusion		guided_diffusion
ldm		ldm
nsr		nsr
scripts		scripts
sgm		sgm
shell_scripts/final_release		shell_scripts/final_release
transport		transport
utils		utils
vit		vit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment_ln3diff.yml		environment_ln3diff.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation (ECCV 2024)

[Project Page] • [arXiv] • [Gradio Demo]

📣 Updates

Demo

🐪 TODO

🖥️ Requirements

🏃‍♀️ Inference

(Recommended) Single-click inference on Objaverse-trained models

Download Models manually & ShapeNet models inference

Inference Commands

(New) Inference: (Single) Image-to-3D

Inference: Text-to-3D

text-to-3D on Objaverse

text-to-3D on ShapeNet

text-to-3D on FFHQ

Stage-1 VAE 3D reconstruction

🏃‍♀️ Training

For training stage-1 VAE

For training stage-2 LDM

Pre-extracted latents

text-to-3D

image-to-3D

Training Commands

More discussions of the proposed method

🤝 BibTex

🗞️ License

Contact

About

Releases

Packages

Languages

License

NIRVANALAN/LN3Diff

Folders and files

Latest commit

History

Repository files navigation

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation (ECCV 2024)

[Project Page] • [arXiv] • [Gradio Demo]

📣 Updates

Demo

🐪 TODO

🖥️ Requirements

🏃‍♀️ Inference

(Recommended) Single-click inference on Objaverse-trained models

Download Models manually & ShapeNet models inference

Inference Commands

(New) Inference: (Single) Image-to-3D

Inference: Text-to-3D

text-to-3D on Objaverse

text-to-3D on ShapeNet

text-to-3D on FFHQ

Stage-1 VAE 3D reconstruction

🏃‍♀️ Training

For training stage-1 VAE

For training stage-2 LDM

Pre-extracted latents

text-to-3D

image-to-3D

Training Commands

More discussions of the proposed method

🤝 BibTex

🗞️ License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages