Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey

Authors: Yang Liu, Changzhen Qiu, Zhiyong Zhang*

School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, Guangdong, China

Overview

This is the regularly updated project page of Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey, a review that primarily concentrates on deep learning approaches to 3D human pose estimation and human mesh recovery. This survey comprehensively includes the most recent state-of-the-art publications (2019-now) from mainstream computer vision conferences and journals.

Please create issues if you have any suggestions!

3D human pose estimation

Single Person
- In Images
  - Solving Depth Ambiguity
    - Optical-aware: VI-HC [paper], Ray3D [paper]
    - Appropriate feature representation: HEMlets [paper]
    - Joint aware: JRAN [paper]
  - Solving Body Structure Understanding
    - Limb aware: Wu et al. [paper], Deep grammar network [paper]
    - Orientation keypoints: Fisch et al. [paper]
    - Graph-based: Liu et al. [paper], LCN [paper], Modulated-CNN [paper], Skeletal-GNN [paper], HopFIR [paper], RS-Net [paper]
  - Solving Occlusion Problems
    - Learnable-triangulation [paper]
    - RPMS [paper]
    - Lightweight multi-view [paper]
    - AdaFuse [paper]
    - Bartol et al. [paper]
    - 3D pose consensus [paper]
    - Probabilistic triangulation model [paper]
  - Solving Data Lacking
    - Unsupervised learning: Kudo et al. [paper], Chen et al. [paper], ElePose [paper]
    - Self-supervised learning: EpipolarPose [paper], Wang et al. [paper], MRP-Net [paper], PoseTriplet [paper]
    - Weakly-supervised learning: Hua et al. [paper], CameraPose [paper]
    - Transfer learning: Adaptpose [paper]
- In Videos
  - Solving Single-frame Limitation
    - VideoPose3D [paper]
    - PoseFormer [paper]
    - UniPose+ [paper]
    - MHFormer [paper]
    - MixSTE [paper]
    - Honari et al. [paper]
    - HSTFormer [paper]
    - STCFormer [paper]
  - Solving Real-time Problems
    - Temporally sparse sampling: Einfalt et al. [paper]
    - Spatio-temporal sparse sampling: MixSynthFormer [paper]
  - Solving Body Structure Understanding
    - Motion loss: Wang et al. [paper]
    - Human-joint affinity: Dc-Net [paper]
    - Anatomy-aware: Chen et al. [paper]
    - Part aware attention: Xue et al. [paper]
  - Solving Occlusion Problems
    - Optical-flow consistency constraint: Cheng et al. [paper]
    - Multi-view: MTF-Transformer [paper]
  - Solving Data Lacking
    - Unsupervised learning: Yu et al. [paper]
    - Weakly-supervised learning: Chen et al. [paper]
    - Semi-supervised learning: MCSS [paper]
    - Self-supervised learning: Kundu et al. [paper], P-STMO [paper]
    - Meta-learning: Cho et al. [paper]
    - Data augmentation: PoseAug [paper], Zhang et al. [paper]
Multi-person
- Top-down
  - Solving Real-time Problems
    - Multi-view: Chen et al. [paper]
    - Whole body: AlphaPose [paper]
  - Solving Representation Limitation
    - VoxelTrack [paper]
  - Solving Occlusion Problems
    - Wu et al. [paper]
  - Solving Data Lacking
    - Single-shot: PandaNet [paper]
    - Optical-aware: Moon et al. [paper]
- Bottom-up -Solving Real-time Problems - Fabbri et al. [paper]
  - Solving Supervisory Limitation.
    - HMOR [paper]
  - Solving Data Lacking
    - Single-shot: SMAP [paper], Benzine et al. [paper]
  - Solving Occlusion Problems
    - Mehta et al. [paper]
    - LCR-Net++ [paper]
- Others
  - Single Stage
    - Jin et al. [paper]
  - Top-down & Bottom-up
    - Cheng et al. [paper]

Human Mesh Recovery

Template-based
- Naked
  - Multimodal Methods
    - Hybrid annotations: Rong et al. [paper]
    - Optical flow: DTS-VIBE [paper]
    - Silhouettes: LASOR [paper]
    - Cropped image and bounding box: CLIFF [paper]
  - Utilizing Attention Mechanism
    - Part-driven attention: PARE [paper]
    - Graph attention: Mesh Graphormer [paper]
    - Spatio-temporal attention: MPS-Net [paper], PSVT [paper]
    - Efficient architecture: FastMETRO [paper], Xue et al. [paper]
    - End-to-end structure: METRO [paper]
  - Exploiting Temporal Information
    - Temporally encoding features: Kanazawa et al. [paper]
    - Self-attention temporal: VIBE [paper]
    - Temporally consistent: TCMR [paper]
    - Multi-level spatial-temporal attention: MAED [paper]
    - Temporally embedded live stream: TePose [paper]
    - Short-term and long-term temporal correlations: Glot [paper]
  - Multi-view Methods
    - Confidence-aware majority voting mechanism: Dong et al. [paper]
    - Probabilistic-based multi-view: Sengupta et al. [paper]
    - Dynamic physics-geometry consistency: Huang et al. [paper]
    - Cross-view fusion: Zhuo et al. [paper]
  - Boosting Efficiency
    - Sparse constrained formulation: SCOPE [paper]
    - Single-stage model: BMP [paper]
    - Process heatmap inputs: HeatER [paper]
    - Removing redundant tokens: TORE [paper]
  - Developing Various Representations
    - Texture map: TexturePose [paper]
    - UV map: Zhang et al. [paper], DecoMR [paper], Zhang et al. [paper]
    - Heat map: Sun et al. [paper], 3DCrowdNet [paper]
    - Uniform representation: DSTFormer [paper]
  - Utilizing Structural Information
    - Part-based: holopose [paper]
    - Skeleton disentangling: Sun et al. [paper]
    - Hybrid inverse kinematics: HybrIK [paper], NIKI [paper]
    - Uncertainty-aware: Lee et al. [paper]
    - Kinematic tree structure: Sengupta et al. [paper]
    - Kinematic chains: SGRE [paper]
  - Choosing Appropriate Learning Strategies
    - Self-improving: SPIN [paper], ReFit [paper], You et al. [paper]
    - Novel losses: Zanfir et al. [paper], Jiang et al. [paper]
    - Unsupervised learning: Madadi et al. [paper], Yu et al. [paper]
    - Bilevel online adaptation: Guan et al. [paper]
    - Single-shot: Pose2UV [paper]
    - Contrastive learning: JOTR [paper]
    - Domain adaptation: Nam et al. [paper]
- Detailed
  - With Clothes
    - Alldieck et al. [paper]
    - Multi-Garment Network (MGN) [paper]
    - Texture map: Tex2Shape [paper]
    - Layered garment representation: BCNet [paper]
    - Temporal span: H4D [paper]
  - With Hands
    - Linguistic priors: SGNify [paper]
    - Two-hands interaction: [paper]
    - Hand-object interaction: [paper]
  - Whole Body
    - PROX [paper]
    - ExPose [paper]
    - FrankMocap [paper]
    - PIXIE [paper]
    - Moon et al. [paper]
    - PyMAF [paper]
    - OSX [paper]
    - HybrIK-X [paper]
Template-free
- Regression-based
  - FACSIMILE [paper], PeeledHuman [paper], GTA [paper], NSF [paper]
- Optimization-based Differentiable
  - DiffPhy [paper], AG3D [paper]
- Implicit Representations
  - PIFu [paper], PIFuHD [paper]
  - Canonical space: ARCH [paper], ARCH++ [paper], CAR [paper]
  - Geometric priors: GeoPIFu [paper]
  - Novel representations: Peng et al. [paper], 3DNBF [paper]
- Neural Radiance Fields
  - Volume deformation scheme [paper]
  - ActorsNeRF [paper]
- Diffusion Models
  - HMDiff [paper]
- Implicit + Explicit
  - HMD [paper], IP-Net [paper], PaMIR [paper], Zhu et al. [paper], ICON [paper], ECON [paper], DELTA [paper], GETAvatar [paper]
- Diffusion + Explicit
  - DINAR [paper]
- NeRF + Explicit
  - TransHuman [paper]
- Gaussian Splatting + Explicit
  - Animatable 3D Gaussian [paper]

The overview of the mainstream datasets.

Dataset	Type	Data	Total frames	Feature	Download link
Human3.6M	3D/Mesh	Video	3.6M	multi-view	Website
3DPW	3D/Mesh	Video	51K	multi-person	Website
MPI-INF-3DPH	2D/3D	Video	2K	in-wild	Website
HumanEva	3D	Video	40K	multi-view	Website
CMU-Panoptic	3D	Video	1.5M	multi-view/multi-person	Website
MuCo-3DHP	3D	Image	8K	multi-person/occluded scene	Website
SURREAL	2D/3D/Mesh	Video	6.0M	synthetic model	Website
3DOH50K	2D/3D/Mesh	Image	51K	object-occluded	Website
3DCP	Mesh	Mesh	190	contact	Website
AMASS	Mesh	Motion	11K	soft-tissue dynamics	Website
DensePose	Mesh	Image	50K	multi-person	Website
UP-3D	3D/Mesh	Image	8K	sport scene	Website
THuman2.0	Mesh	Image	7K	textured surface	Website

Comparisons of 3D pose estimation methods on Human3.6M.

Method	Year	Publication	Highlight	MPJPE↓	PMPJPE↓	Code
Graformer	2022	CVPR'22	graph-based transformer	35.2	-	Code
GLA-GCN	2023	ICCV'23	adaptive GCN	34.4	37.8	Code
PoseDA	2023	arXiv'23	domain adaptation	49.4	34.2	Code
GFPose	2023	CVPR'23	gradient fields	35.6	30.5	Code
TP-LSTMs	2022	TPAMI'22	pose similarity metric	40.5	31.8	-
FTCM	2023	TCSVT'23	frequency-temporal collaborative	28.1	-	Code
VideoPose3D	2019	CVPR'19	semi-supervised	46.8	36.5	Code
PoseFormer	2021	ICCV'21	spatio-temporal transformer	44.3	34.6	Code
STCFormer	2023	CVPR'23	spatio-temporal transformer	40.5	31.8	Code
3Dpose_ssl	2020	TPAMI'20	self-supervised	63.6	63.7	Code
MTF-Transformer	2022	TPAMI'22	multi-view temporal fusion	26.2	-	Code
AdaptPose	2022	CVPR'22	cross datasets	42.5	34.0	Code
3D-HPE-PAA	2022	TIP'22	part aware attention	43.1	33.7	Code
DeciWatch	2022	ECCV'22	efficient framework	52.8	-	Code
Diffpose	2023	CVPR'23	pose refine	36.9	28.7	Code
Elepose	2022	CVPR'22	unsupervised	-	36.7	Code
Uplift and Upsample	2023	CVPR'23	efficient transformers	48.1	37.6	Code
RS-Net	2023	TIP'23	regular splitting graph network	48.6	38.9	Code
HSTFormer	2023	arXiv'23	spatial-temporal transformers	42.7	33.7	Code
PoseFormerV2	2023	CVPR'23	frequency domain	45.2	35.6	Code
DiffPose	2023	ICCV'23	diffusion models	42.9	30.8	Code

Comparisons of 3D pose estimation methods on MPI-INF-3DPH.

Method	Year	Publication	Highlight	MPJPE↓	PCK↑	AUC↑	Code
HSTFormer	2023	arXiv'23	spatial-temporal transformers	28.3	98.0	78.6	Code
PoseFormerV2	2023	CVPR'23	frequency domain	27.8	97.9	78.8	Code
Uplift and Upsample	2023	CVPR'23	efficient transformers	46.9	95.4	67.6	Code
RS-Net	2023	TIP'23	regular splitting graph network	-	85.6	53.2	Code
Diffpose	2023	CVPR'23	pose refine	29.1	98.0	75.9	Code
FTCM	2023	TCSVT'23	frequency-temporal collaborative	31.2	97.9	79.8	Code
STCFormer	2023	CVPR'23	spatio-temporal transformer	23.1	98.7	83.9	Code
PoseDA	2023	arXiv'23	domain adaptation	61.3	92.0	62.5	Code
TP-LSTMs	2022	TPAMI'22	pose similarity metric	48.8	82.6	81.3	-
AdaptPose	2022	CVPR'22	cross datasets	77.2	88.4	54.2	Code
3D-HPE-PAA	2022	TIP'22	part aware attention	69.4	90.3	57.8	Code
Elepose	2022	CVPR'22	unsupervised	54.0	86.0	50.1	Code

Comparisons of human mesh recovery methods on Human3.6M and 3DPW.

Method	Publication	Highlight	Human3.6M MPJPE↓	Human3.6M PA-MPJPE↓	3DPW MPJPE↓	3DPW PA-MPJPE↓	3DPW PVE↓	Code
VirtualMarker	CVPR'23	novel intermediate representation	47.3	32.0	67.5	41.3	77.9	Code
NIKI	CVPR'23	inverse kinematics	-	-	71.3	40.6	86.6	Code
TORE	ICCV'23	efficient transformer	59.6	36.4	72.3	44.4	88.2	Code
JOTR	ICCV'23	contrastive learning	-	-	76.4	48.7	92.6	Code
HMDiff	ICCV'23	reverse diffusion processing	49.3	32.4	72.7	44.5	82.4	Code
ReFit	ICCV'23	recurrent fitting network	48.4	32.2	65.8	41.0	-	Code
PyMAF-X	TPAMI'23	regression-based one-stage whole body	-	-	74.2	45.3	87.0	Code
PointHMR	CVPR'23	vertex-relevant feature extraction	48.3	32.9	73.9	44.9	85.5	-
PLIKS	CVPR'23	inverse kinematics	47.0	34.5	60.5	38.5	73.3	Code
ProPose	CVPR'23	learning analytical posterior probability	45.7	29.1	68.3	40.6	79.4	Code
POTTER	CVPR'23	pooling attention transformer	56.5	35.1	75.0	44.8	87.4	Code
PoseExaminer	ICCV'23	automated testing of out-of-distribution	-	-	74.5	46.5	88.6	Code
MotionBERT	ICCV'23	pretrained human representations	43.1	27.8	68.8	40.6	79.4	Code
3DNBF	ICCV'23	analysis-by-synthesis approach	-	-	88.8	53.3	-	Code
FastMETRO	ECCV'22	efficient architecture	52.2	33.7	73.5	44.6	84.1	Code
CLIFF	ECCV'22	multi-modality inputs	47.1	32.7	69.0	43.0	81.2	Code
PARE	ICCV'21	part-driven attention	-	-	74.5	46.5	88.6	Code
Graphormer	ICCV'21	GCNN-reinforced transformer	51.2	34.5	74.7	45.6	87.7	Code
PSVT	CVPR'23	spatio-temporal encoder	-	-	73.1	43.5	84.0	-
GLoT	CVPR'23	short-term and long-term temporal correlations	67.0	46.3	80.7	50.6	96.3	Code
MPS-Net	CVPR'23	temporally adjacent representations	69.4	47.4	91.6	54.0	109.6	Code
MAED	ICCV'21	multi-level attention	56.4	38.7	79.1	45.7	92.6	Code
Lee et al.	ICCV'21	uncertainty-aware	58.4	38.4	92.8	52.2	106.1	-
TCMR	CVPR'21	temporal consistency	62.3	41.1	95.0	55.8	111.3	-
VIBE	CVPR'20	self-attention temporal network	65.6	41.4	82.9	51.9	99.1	Code
ImpHMR	CVPR'23	implicitly imagine person in 3D space	-	-	74.3	45.4	87.1	-
SGRE	ICCV'23	sequentially global rotation estimation	-	-	78.4	49.6	93.3	Code
PMCE	ICCV'23	pose and mesh co-evolution network	53.5	37.7	69.5	46.7	84.8	Code

Citation

Please kindly cite the papers if our work is useful and helpful for your research.

@misc{liu2024deep,
      title={Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey}, 
      author={Yang Liu and Changzhen Qiu and Zhiyong Zhang},
      year={2024},
      eprint={2402.18844},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitattributes

.gitattributes

README.md

README.md

Repository files navigation

Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey

Overview

3D human pose estimation

Human Mesh Recovery

The overview of the mainstream datasets.

Comparisons of 3D pose estimation methods on Human3.6M.

Comparisons of 3D pose estimation methods on MPI-INF-3DPH.

Comparisons of human mesh recovery methods on Human3.6M and 3DPW.

Citation

Please kindly cite the papers if our work is useful and helpful for your research.

About

Releases

Packages

liuyangme/SOTA-3DHPE-HMR

Folders and files

Latest commit

History

.gitattributes

.gitattributes

README.md

README.md

Repository files navigation

Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey

Overview

3D human pose estimation

Human Mesh Recovery

The overview of the mainstream datasets.

Comparisons of 3D pose estimation methods on Human3.6M.

Comparisons of 3D pose estimation methods on MPI-INF-3DPH.

Comparisons of human mesh recovery methods on Human3.6M and 3DPW.

Citation

Please kindly cite the papers if our work is useful and helpful for your research.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages