This is an official pytorch implementation of the following paper:
Y. Deng, J. Yang, J. Xiang, and X. Tong, GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation, IEEE Computer Vision and Pattern Recognition (CVPR), 2022. (Oral Presentation)
Project page | Paper | Video
Abstract: 3D-aware image generative modeling aims to generate 3D-consistent images with explicitly controllable camera poses. Recent works have shown promising results by training neural radiance field (NeRF) generators on unstructured 2D images, but still cannot generate highly-realistic images with fine details. A critical reason is that the high memory and computation cost of volumetric representation learning greatly restricts the number of point samples for radiance integration during training. Deficient sampling not only limits the expressive power of the generator to handle fine details but also impedes effective GAN training due to the noise caused by unstable Monte Carlo sampling. We propose a novel approach that regulates point sampling and radiance field learning on 2D manifolds, embodied as a set of learned implicit surfaces in the 3D volume. For each viewing ray, we calculate ray-surface intersections and accumulate their radiance generated by the network. By training and rendering such radiance manifolds, our generator can produce high quality images with realistic fine details and strong visual 3D consistency.
- Currently only Linux is supported.
- 64-bit Python 3.6 installation or newer. We recommend using Anaconda3.
- One or more high-end NVIDIA GPUs, NVIDIA drivers, and CUDA toolkit 10.1 or newer. We recommend using 8 Tesla V100 GPUs with 32 GB memory for training to reproduce the results in the paper.
Clone the repository and set up a conda environment with all dependencies as follows:
git clone https://github.com/microsoft/GRAM.git
cd GRAM
conda env create -f environment.yml
source activate gram
Alternatively, we provide a Dockerfile to build an image with the required dependencies.
Checkpoints for pre-trained models used in our paper (default settings) are as follows.
Dataset | Config | Resolution | Training iterations | Batchsize | FID 20k | KID 20k (x100) | Download |
---|---|---|---|---|---|---|---|
FFHQ | FFHQ_default | 256x256 | 150k | 32 | 14.5 | 0.65 | Github link |
Cats | CATS_default | 256x256 | 80k | 16 | 14.6 | 0.75 | Github link |
CARLA | CARLA_default | 128x128 | 70k | 32 | 26.3 | 1.15 | Github link |
Run the following script to render multi-view images of generated subjects using a pre-trained model:
# face images are generated by default (FFHQ_default)
python render_multiview_images.py
# custom setting for image generation
python render_multiview_images.py --config=<CONFIG_NAME> --generator_file=<GENERATOR_PATH.pth> --output_dir=<OUTPUT_FOLDER> --seeds=0,1,2
By default, the script generates images with watermarks. Use --no_watermark argument to remove them.
- FFHQ: Download the original 1024x1024 images. We additionally provide detected 5 facial landmarks (google drive) for image preprocessing and face poses (google drive) estimated by Deep3DFaceRecon for training. Download all files and organize them as follows:
GRAM/
│
└─── raw_data/
|
└─── ffhq/
│
└─── *.png # original 1024x1024 images
│
└─── lm5p/ # detected 5 facial landmarks
| |
| └─── *.txt
|
└─── poses/ # estimated face poses
|
└─── *.mat
- Cats: Download the original cat images and provided landmarks using this link and organize all files as follows:
GRAM/
│
└─── raw_data/
|
└─── cats/
│
└─── *.jpg # original images
│
└─── *.jpg.cat # provided landmarks
- CARLA: Download the original images and poses from GRAF and organize all files as follows:
GRAM/
│
└─── raw_data/
|
└─── carla/
│
└─── *.png # original images
│
└─── poses/ # provided poses
|
└─── *_extrinsics.npy
Finally, run the following script for data preprocessing:
python preprocess_dataset.py --raw_dataset_path=./raw_data/<CATEGORY> --cate=<CATEGORY>
It will align all images and save them with the estimated/provided poses into ./datasets for the later training process.
Run the following script to train a generator from scratch using the preprocessed data:
python train.py --config=<CONFIG_NAME> --output_dir=<OUTPUT_FOLDER>
The code will automatically detect all available GPUs and use DDP training. You can use the default configs provided in the configs.py or add your own config. By default, we use batch split suggested by pi-GAN to increase the effective batchsize during training.
The following table lists training times for different configs using 8 NVIDIA Tesla V100 GPUs (32GB memory):
Config | Resolution | Training iterations | Batchsize | Times |
---|---|---|---|---|
FFHQ_default | 256x256 | 150k | 32 | 12d 4h |
CATS_default | 256x256 | 80k | 16 | 4d 6h |
CARLA_default | 128x128 | 70k | 32 | 3d 15h |
Training GRAM under 256x256 image resolution requires around 30GB memory for a typical forward-backward cycle with a batchsize of 1 using Pytorch Automatic Mixed Precision. To enable training using GPUs with limited memory, we provide an alternative way using patch-level forward and backward process (see here for a detailed explanation):
python train.py --config=<CONFIG_NAME> --output_dir=<OUTPUT_FOLDER> --patch_split=<NUMBER_OF_PATCHES>
Currently we support a patch split of a power of 2 (e.g. 2, 4, 8, ...). It will effectively reduce the memory cost with a slight increase of the training time.
Run the following script for FID&KID calculation:
python fid_evaluation.py --no_watermark --config=<CONFIG_NAME> --generator_file=<GENERATOR_PATH.pth> --output_dir=<OUTPUT_FOLDER>
By default, 8000 real images and 1000 generated images from EMA model are used for evaluation. You can adjust the number of images according to your own needs.
The goal of this work is to study generative modelling of the 3D objects from 2D images, and to provide a method for generating multi-view images of non-existing, virtual objects. It is not intended to manipulate existing images nor to create content that is used to mislead or deceive. This method does not have understanding and control of the generated content. Thus, adding targeted facial expressions or mouth movements is out of the scope of this work. However, the method, like all other related AI image generation techniques, could still potentially be misused for impersonating humans. Currently, the images generated by this method contain visual artifacts, unnatural texture patterns, and other unpredictable failures that can be spotted by humans and fake image detection algorithms. We also plan to investigate applying this technology for advancing 3D- and video-based forgery detection.
Per concerns about misuse of this method, the code is available for use under a research-only license.
Please cite the following paper if this work helps your research:
@inproceedings{deng2022gram,
title={GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation},
author={Deng, Yu and Yang, Jiaolong and Xiang, Jianfeng and Tong, Xin},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2022}
}
If you have any questions, please contact Yu Deng (dengyu2008@hotmail.com) and Jiaolong Yang (jiaoyan@microsoft.com)
We thank Harry Shum for the fruitful advice and discussion to improve the paper. This implementation takes pi-GAN as a reference. We thank the authors for their excellent work.