Skip to content

kocasariumut/FaceAnything

Repository files navigation

Face Anything: 4D Face Reconstruction from Any Image Sequence

ECCV 2026

arXiv Project Page Video Code Hugging Face Space

teaser

Table of Contents

Overview

Face Anything is a unified feed-forward model for high-fidelity 4D face reconstruction and dense tracking from arbitrary image sequences. The key idea is canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a single canonical reconstruction problem, producing temporally consistent geometry and reliable correspondences.

What It Produces

The teaser above is a single grand tour run: one orbit that morphs through every modality. A full run writes:

Output Description
videos/pointcloud.mp4 Colored 3D point-cloud reconstruction (orbiting camera)
videos/tracks.mp4 3D reconstruction with colorful, temporally-consistent point tracks
videos/canonical.mp4 3D reconstruction colored by canonical facial coordinates
videos/depth.mp4 3D reconstruction colored by depth
videos/normals.mp4 3D reconstruction colored by surface normals (from depth)
videos/grand_tour.mp4 One orbit that morphs through all modalities (pointcloud → tracks → canonical → depth → normals)
videos/grand_tour_2d.mp4 The grand tour in image space (2D)
videos/*_2d.mp4, maps/* Per-modality image-space (2D) videos and per-frame maps
ply/{geometry,canonical,tracks}/frame_XXXX.ply Colored point clouds per timestamp (geometry, canonical, and tracks)
cameras.npz, cameras.json Per-frame camera intrinsics and extrinsics
raw_predictions.npz Raw depth, intrinsics, extrinsics, canonical, conf, valid

Installation

Face Anything needs a CUDA GPU. Tested with Python 3.11 / PyTorch 2.9 (CUDA 12.8).

One-line setup (creates the conda env, installs everything, and downloads the checkpoint):

git clone https://github.com/kocasariumut/FaceAnything.git
cd FaceAnything

bash install.sh

Or do it manually:

git clone https://github.com/kocasariumut/FaceAnything.git
cd FaceAnything

conda create -n faceanything python=3.11 -y
conda activate faceanything

# install PyTorch matching your CUDA, then the rest
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt

# install this package (exposes `faceanything` and `depth_anything_3`)
pip install -e .

pip install -e . is optional, since run_inference.py adds src/ to the path automatically, so once the dependencies are installed you can run it directly.

The architecture/config of the backbone is loaded from the public HuggingFace model depth-anything/DA3-GIANT-1.1 (downloaded automatically on first run; its weights are then overwritten by our checkpoint). To use a local cache instead of downloading, set export HF_HOME=/path/to/hf_cache.

Checkpoint

Download the released checkpoint.pt (~15 GB) and place it at checkpoints/checkpoint.pt (where it is loaded by default). install.sh fetches it automatically; to do it manually, use either option below.

Option 1: Google Drive

pip install gdown
gdown --fuzzy "https://drive.google.com/file/d/1PdQQxzm-tU50RmJhgeoMCYVRlEiW3f8p/view?usp=sharing" \
    -O checkpoints/checkpoint.pt

Option 2: Hugging Face (UmutKocasari/FaceAnything)

huggingface-cli download UmutKocasari/FaceAnything checkpoint.pt --local-dir checkpoints

(The Hugging Face model repository becomes public when the code is released.)

Use --checkpoint /path/to/checkpoint.pt to load it from elsewhere.

Usage

python run_inference.py --input <INPUT> --output <OUT_DIR>

<INPUT> can be a single image, a folder of images, or a video file (.mp4/.mov/.avi/.mkv/...). For a single image, the reconstruction is rendered as a turntable orbit.

# image folder
python run_inference.py --input path/to/images --output output/demo
# video
python run_inference.py --input clip.mp4 --output output/clip
# single image
python run_inference.py --input face.jpg --output output/face

Background removal

Backgrounds are removed automatically. If you don't provide masks, Face Anything runs Robust Video Matting to segment the foreground and saves the generated masks to <OUT_DIR>/masks. You can also supply your own masks with --mask-dir /path/to/masks, or disable background removal entirely with --no-background-removal (use the full frame).

Selecting outputs

Everything is produced by default. To produce a subset, pass a comma-separated list to --outputs (choices: pointcloud, depth, normals, canonical, tracks, grandtour, ply, raw):

python run_inference.py --input clip.mp4 --output output/clip \
    --outputs pointcloud,canonical,grandtour

Processing modes & detail

--process-mode controls how frames are fed to the model, which trades off surface detail against temporal/3D consistency:

  • all-at-once (default): all frames are processed jointly, giving more 3D-consistent results across the sequence but less detailed surfaces.
  • one-by-one: each frame is processed independently, giving more detailed surfaces but less 3D-consistent results across frames. It also uses less memory, so it pairs well with a larger --process-res for higher-resolution, more detailed outputs.

All options

Command-line options
Flag Default Description
--outputs all subset of outputs to generate
--max-frames all cap the number of frames
--process-res 504 model resolution (increase for more detailed outputs)
--process-mode all-at-once all-at-once (joint: more 3D-consistent, less detail) or one-by-one (per-frame: more detailed surfaces, less 3D-consistent; lower memory)
--stride 1 use every N-th frame
--render-size 1024 output video resolution
--point-size auto point splat size (auto from resolution)
--render-backend auto open3d (smooth, default) or torch fallback
--fps 10 output frame rate
--orbit sway sway / turntable / none camera motion
--orbit-amplitude 80 sway amplitude in degrees (0 to -amp to 0 to +amp to 0)
--orbit-frames 80 number of orbit frames for a single-image input
--n-tracks 100 number of seeded point tracks
--track-k 25 canonical nearest-neighbours recolored per track
--track-threshold 0.01 max canonical distance for a correspondence
--mask-dir (none) use foreground masks from this directory
--remove-background off force RVM mask regeneration
--no-background-removal off reconstruct the full frame (no masks)
--use-predicted-poses off multi-view-consistent world frame instead of monocular
--checkpoint checkpoints/checkpoint.pt model checkpoint
--save-frames off also dump the individual rendered orbit frames

Run python run_inference.py --help for the full list.

Dataset

We release the canonical maps for the selected timestamps used in Face Anything. To request access to the FaceAnything NeRSemble dataset, please fill out this form.

Acknowledgments

Face Anything builds on Depth-Anything-3. We also thank the authors of FLAME, COLMAP, and Robust Video Matting for their open-source work, which we use in training data preparation and background segmentation.

Citation

@article{kocasari2026face,
  title={Face Anything: 4D Face Reconstruction from Any Image Sequence},
  author={Kocasari, Umut and Giebenhain, Simon and Shaw, Richard and Nie{\ss}ner, Matthias},
  journal={arXiv preprint arXiv:2604.19702},
  year={2026}
}

License

CC BY-NC 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors