You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ViTPose is used in 2D human pose estimation, a subset of the keypoint detection task #24044
It provides a simple baseline for vision transformer-based human pose estimation. It utilises a pretrained vision transformer backbone to extract features and a simple decoder head to process the extracted features. Despite no elaborate designs in the model, ViTPose obtained state-of-the-art (SOTA) performance of 80.9 AP on the MS COCO Keypoint test-dev set.
This model presents a new task for the library, so there might be some iterations and discussions on what the inputs and outputs should look like. The model translation should be fairly straightforward though, so I'd suggest starting with a PR that implements that and then on the PR we can figure out what works best.
Model description
ViTPose is used in 2D human pose estimation, a subset of the keypoint detection task #24044
It provides a simple baseline for vision transformer-based human pose estimation. It utilises a pretrained vision transformer backbone to extract features and a simple decoder head to process the extracted features. Despite no elaborate designs in the model, ViTPose obtained state-of-the-art (SOTA) performance of 80.9 AP on the MS COCO Keypoint test-dev set.
Open source status
Provide useful links for the implementation
Code and weights: https://github.com/ViTAE-Transformer/ViTPose
Paper: https://arxiv.org/abs/2204.12484
@Annbless
The text was updated successfully, but these errors were encountered: